Data and Text Mining

What is Data and Text mining?  Data mining is a class of database applications that look for hidden patterns in a group of data that can be used to predict future behavior. Data mining uses sophisticated mathematical algorithms to segment the data and evaluate the probability of future events.  data mining helps analysts recognize significant facts, relationships, trends, patterns, exceptions and anomalies that might otherwise go unnoticed. The key properties of Data Mining are Automatic discovery of patterns, Prediction of likely outcomes, Creation of actionable information, and it Focuses on large data sets and databases. Text mining on the other hand is the analysis of data contained in natural language text. Text mining works  by transposing words and phrases in unstructured data into numerical values which can then be linked with structured data in a database and analyzed with traditional data mining techniques.

The difference between regular data mining and text mining  is that in text mining the patterns are extracted from natural language text rather than from structured databases of facts. Databases are designed for programs to process automatically; text is written for people to read. We do not have programs that can “read” text and will not have such for the forseeable future. Many researchers think it will require a full simulation of how the mind works before we can write programs that read the way people do.

One of our tasks in DITA labs was to explore the Old Bailey Online site (the API demonstrator) and the Utrecht University Digital Humanities Lab to compare the way in which each used data mining.

I explored the Utrecht University Digital Humanities Lab and was particularly interested in the Dynamics of the medieval manuscript: Text collection from a European Perspective. Upon observation I noted that the site just gives a synopsis of the project and there is no hyperlink to text analysis tools.  See screen shot below:

Digital Humanities Lab

Unlike Old Bailey Online site which allows the direct export of data to Voyant and you can chose the amount of search results that you would like to export ranging from 10, 50 and 100. See example below:

Old Bailey Online

One of the data sets that were exported via voyant tools created data visualization shown below:


Experimenting with the text visualization tools was very interesting!


Open Data , Data Visualization and Analysis: Making sense of it!

Open data is information that is released by organizations to the public in datasets, data sets, if you can recall is a collection of factual information in electronic form. This allows for and support the Freedom of Information (FOI) and it increases transparency, as A. Rae 2004, posits, ‘open data has increased transparency, improved access to information and helped places begin to understand and solve problems.’ However, this data should be presented in such a way that anyone can interpret what is being presented.

What is Data Visualization? Data visualization is a general term that describes any effect to help people understand the significance of data by placing it in a visual context, in short, it is visual representation of data that goes beyond the standard charts and graphs commonly used in Excel spreadsheets, today’s data visualization tools displays data in a more enhanced and sophisticated way such as heat maps, bar, pie and fever charts among others.

This was illustrated in one of our DITA sessions where TAGS were created for #citylis top twitters and data visualization of the results were presented. The data revealed here was very amazing!

Data analysis is the process of discovering and understanding the meaning of data that is presented to us, it is making sense of the information, hence, data visualization is a core and usually essential means to perform data analysis in an effective way.