Digital Humanities Workbench

Home page > Data analysis > Text analysis > Basic text analysis

Basic text analysis

Tekstanalyse In this workbench, we use the term basic text analysis for the analysis of the occurrence and usage of certain words in a text or a collection of texts. One might also call this lexical analysis; essentially it ficuses on the vocabulary of texts. This can involve the frequency of words in a text (e.g. in contrastive analysis), but it usually comes down to searching for certain words or phrases, word patterns and annotations in a text. Other word-related aspects, such as measuring the distribution of certain words in a text and establishing a text-specific vocabulary, are also part of this method of analysis.
As early as the 1960s and 1970s researchers investigated how computers can be used to support this type of analysis. There are various examples to be found in the scientific literature of the fields of thematic, stylistic and structural analysis, intertextuality and prosodic analysis (rhyme and metre). A comprehensive overview can be found in Chapter 5 ( "Literary Analysis") of Electronic Texts in the Humanities, by Susan Hockey (Oxford: OUP, 2000).


Software for lexical text analysis offers the possibility to quickly and efficiently examine how certain words are used in a text or a collection texts, answering questions such as the following: how often do they occur?, in which context do they occur?, with which other words are they combined?, and in which part of the text do they occur? A detailed overview of the various functions these programs have can be found here. Accordingly, software allows researchers to approach texts in a different way than with just (linear) reading, as well as providing many more specific ways to analyse texts than programs such as Word. Online texts are often only partially searchable: you can only search those elements the creator of the website has allowed you to (or you can search the web page on which the texts are displayed by using the default browser search function - Ctrl+ F). More detailed analyses are not usually possible.

Thematic analysis

In the case of thematic analysis it is usually not clear in advance which exact words are to be searched for. You can approach it in the folllowing ways:
  • Look up potentially relevant in a list and request a concordance for these words (see: functions of text analysis software).
  • Manually tag the relevant words in the text.
  • Work with semantically oriented search software (see: text mining). This technique is currently still in development.

Working from a frequency list is (much) less labour-intensive than annotating a text. In addition, it is a more intuitive and flexible process than annotation (which is usually applied only once, after which it is rarely changed, so that it could end up being a leading part of the analysis). A disadvantage compared to annotation is that words often have homonyms that are not interesting for your research and that you will have to filter out of the concordance. Moreover, not all concepts, thematic aspects and the like can be described with specific words. A particular theme may very well be represented through words that you would not initially associate with it.

Note: in order to use programs like this, it is usually necessary to prepare the the text you want to analyse. This depends partly on the format the file has been stored in (see: preparation and file formats). It may also be useful, or even necessary, to enhance the text with certain structural or analytical information (see: formal annotation).


Some concordance programs that were commonly used in the humanities in the past are Oxford Concordance Program and TACT. Nowadays, WordSmith (for which our faculty has a licence) and Concordance are quite popular.
AntConcis a freeware program that is available on all VU PCs for students and staff members of the Faculty of Humanities. The online program Voyant Tools is currently experiencing a rise in popularity, because of the ways it allows users to visualise word usage in a text, such as word clouds, bubble lines, scatter plots and networks.