Data and Text Mining

What is Data and Text mining?  Data mining is a class of database applications that look for hidden patterns in a group of data that can be used to predict future behavior. Data mining uses sophisticated mathematical algorithms to segment the data and evaluate the probability of future events.  data mining helps analysts recognize significant facts, relationships, trends, patterns, exceptions and anomalies that might otherwise go unnoticed. The key properties of Data Mining are Automatic discovery of patterns, Prediction of likely outcomes, Creation of actionable information, and it Focuses on large data sets and databases. Text mining on the other hand is the analysis of data contained in natural language text. Text mining works  by transposing words and phrases in unstructured data into numerical values which can then be linked with structured data in a database and analyzed with traditional data mining techniques.

The difference between regular data mining and text mining  is that in text mining the patterns are extracted from natural language text rather than from structured databases of facts. Databases are designed for programs to process automatically; text is written for people to read. We do not have programs that can “read” text and will not have such for the forseeable future. Many researchers think it will require a full simulation of how the mind works before we can write programs that read the way people do.

One of our tasks in DITA labs was to explore the Old Bailey Online site (the API demonstrator) and the Utrecht University Digital Humanities Lab to compare the way in which each used data mining.

I explored the Utrecht University Digital Humanities Lab and was particularly interested in the Dynamics of the medieval manuscript: Text collection from a European Perspective. Upon observation I noted that the site just gives a synopsis of the project and there is no hyperlink to text analysis tools.  See screen shot below:

Digital Humanities Lab

Unlike Old Bailey Online site which allows the direct export of data to Voyant and you can chose the amount of search results that you would like to export ranging from 10, 50 and 100. See example below:

Old Bailey Online

One of the data sets that were exported via voyant tools created data visualization shown below:


Experimenting with the text visualization tools was very interesting!