Digital Humanities Workbench

Home page > Data analysis > Text analysis > Corpus analysis

Corpus analysis

Corpus analysis is an empirical research strategy that is widely used within language research, using authentic (real, actually attested) language material. A so-called corpus (also known as a text corpus) is a digital collection of texts, text fragments and/or transcripts (of spoken language), which are selected in such a way that they are the best possible representation of a particular language, dialect or text type, making the collection as a whole a reliable source for linguistic research. This can be descriptive / exploratory research, as well as research designed to test linguistic hypotheses.

Many corpora have been developed worldwide that can be used by linguistic researchers. See the faculty corpus overview for an overview of the corpora available to staff and students of our faculty. In some cases, you will have to build your own corpus if the linguistic material you aim to investigate has not been integrated into a corpus yet. In both cases, the usability of a corpus strongly depends on its composition and design. It is very important, therefore, to find out all these details when working with existing corpora, and to carefully consider your needs when building your own corpus.

Tasks and activities

Compiling a corpus If available text corpora contain no useful data for a research project, it may be necessary to build your own corpus. For more information, see the page about compiling a corpus.
Enhancing a corpus In many cases, the original corpus text is enhanced (supplemented) with additional information. This can be information unrelated to content (source information, information about the speakers, textual structure, etc.), as well as information that does relate to content. This information, which is usually called annotation, can be added in various ways. For more information about this topic, see the page about formal annotation.
Exploring a corpus The way corpora are structured and stored determines how they can be searched. For more information about this topic, see the page about corpus exploration.
Analyzing a corpus The final analysis of the corpus data can be done in different ways. This partly relates to the way in which the corpus has been annotated (see the page about formal annotation). If there is a quantitative research component, a form of statistical analysis will be required.


Various tools can be used at the various stages of corpus research (see above). A brief overview of the most important tools available to staff and students in our faculty can be found in the table below. The name of the program is also a link to a more detailed description.

Programme Application(s) Type
NoteTab pre-processing editor; HTML stripper
Soundscriber transcription transcription tool
XMLPad annotation XML editor
WordSmith Tools exploration concordancer
AntConc exploration concordancer
Transana transcription and analysis research tool
SPSS statistical analysis statistics suite
R statistical analysis statistics suite

More information

Corpus linguistics
Online tutorial, based on the book Corpus linguistics by t. McEnery & A. Wilson (Edinburgh University Press, 1996). [Available at the VU University Library]

McEnery, T., R. Xiao and Y. Tono (2006). Corpus-based language studies: an advanced resource book . London: Routledge.
This book provides a comprehensive introduction to all aspects of corpus research and gives many examples of concrete research.

International Journal of Corpus Linguistics (IJCL) and Corpora
These journals provide an overview of the role of corpora in all kinds of linguistic research. Both publications are available in digital form via UBVU; a number of older volumes of Corpora are freely accessible (see "Archive").