Digital Humanities Workbench


Home page > Data analysis > Text analysis > Text mining

Text mining

Text mining, also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output.
Source: Wikipedia (https://en.wikipedia.org/wiki/Text_mining), consulted on 5-4-2017.

Text mining involves automated analysis of large amounts of text. This fits in with the concept of Big Data: vast collections of data which can no longer be stored and analysed using conventional computer techniques. Although Big Data can refer to various types of data (from large databases to audio and video files and from structured to unstructured material), text mining focuses on searching through and analysing large amounts of (mostly unstructured) text. That can be interesting for humanities research because it means that text collections that cannot be analysed properly with more traditional text analysis software (such as concordance software) due to their size can still be used as research material. Some examples include historical or modern digitized journalistic texts (from Delpher or LexisNexis Academic, for example), collections of tweets, or a large corpus of digitized 18th century books (Early Dutch Books Online).

The computer technology used in text mining enables researchers to collect, edit and store large amounts of textual data, as well as highlighting and interpreting all sorts of relevant connections in the data. It is used in communication science, for example, where an automated version of the more traditional content analysis is used for research on aspects of social media and newspaper articles based on collections of millions of items (see, for example Flaounas et.al, 2012).

Certain individual text mining techniques can also be used to automatically annotate limited text collections that are analysed in a more traditional manner. This can include addding certain semantic labels and classifications, as well as named entity recognition, which can help locate and classify certain specific elements in a text, such as personal names, geographical locations, organizations, years and dates.

In various research projects in our faculty text mining techniques are used. Examples are:

  • NewsReader: a computer program that "reads" daily streams of news and stores exactly what happened, where and when in the world and who has been involved..
  • Semantics of History: this project has developed a historical ontology and a lexicon that are used in a new type of information system that can handle the time-based dynamics and varying perspectives in historical archives..
For more information about using this technique, please contact the faculty's Computational Lexicology & Terminology Lab (CLTL)

wekaFor support in working with the freeware data mining program Weka, you can also contact Onno Huber (o.huber at vu.nl).

More information