Digital Humanities Workbench


Home page > Data collection > Corpus compilation

Compiling a corpus

Compiling a text corpus is a time consuming task. It is therefore recommended that you make use of existing corpus material, if that's possible. See the faculty corpus overview for an overview of the corpora available to staff and students in our faculty.

If available text corpora contain no useful data for a research project, it may be necessary to build your own corpus, which is known as a DIY or Do-It-Yourself corpus. When compiling a DIY corpus, you should consider the following points.

Collecting material

  • Your research objectives and/or research questions should be clear before compiling the corpus, because these determine what material you have to collect.
  • The internet can, of course, be an important source for collecting all kinds of textual material, but it is important that you know (and make note of) the origin of every text you have in your corpus. Unfortunately, it is not always easy to find this information, so, if possible, it is preferable to make use of more predefined digital text collections, such as, for example, LexisNexis Academic, which is available as an e-resource through the UB VU.
  • Many texts that you find on the internet will have to be converted from HTML, Word, or pdf format to standard text files, before you can explore them with software such as WordSmith or AntConc. See the section called 'Preparing material' below.
  • If you do look for texts on the Internet, there are certain tools that can help you to efficiently search for texts on specific topics. One example is WebBootCat (free trial account for 30 days).
  • Because of copyright constraints, it is usually not permitted to freely distribute a DIY corpus. This confirms the importance of keeping good records of the sources of your texts.
  • The texts for the corpus have to be collected in a systematic way, under controlled conditions, and in such a way that the corpus is an adequate representation of the text type/text genre that is to be studied. Important concepts are balance, representativeness, and sample (see McEnery, Xiao & Tono (2006) - reference below).
  • The ideal size of the corpus depends on the frequency and distribution of the linguistic characteristics that you want to investigate.
NB. The last two points of interest are essential if your research has a quantitative component, and you want to draw statistical conclusions from the corpus.

Preparing material

If you have collected textual material for compiling a corpus yourself, it is often necessary to prepare the material for further investigation. This can consist of the following activities:
  • Converting the text format of the collected material. A lot of software used for the further processing and/or analysis of the corpus can't handle every text type and file format. Concordance programmes such as WordSmith and AntConc, for example, cannot search Word and pdf files. These will first have to be converted into ASCII text files (which often have the suffix .txt).
  • For texts from websites: removing HTML tags and other non-text elements of the web text (such as javaScript and php code). There are several freeware programs available for this purpose, usually called HTML strippers. One program you can use is Notetab, a freeware text editor.
  • Ensuring that all collected documents use a uniform character set. This is necessary if the material comes from various sources.
  • In many cases, information about aspects such as origin and structure will need to be added to the corpus. This is referred to as markup. It will also be necessary to enhance the content of the corpus through annotation. Although these tasks can partially be performed (semi-) automatically, it is usually still a time consuming process. For more information, see the pages about annotation and formal annotation.

If you have recorded spoken language for your DIY corpus, it is usually necessary to transcribe it first, before subjecting it to further analysis. Adequate automatic transcription (by means of automatic speech recognition) is not yet possible, so transcription must usually be done by hand. There are, however, several tools available to support this task. For more information, see the page about transcription of speech.

More information

McEnery, T. Xiao, R. & Tono, Y. (2006) Corpus-based language studies: an advanced resource book, New York: Routledge.
Chapter 8 of this book, entitled 'Going solo: DIY corpora', provides a guide for creating do-it-yourselfcorpora. N.B. A hard copy of this book is available at the University Library VU.

Developing linguistic corpora: a guide to good practice, ed. M. Wynne (2005).
This online Handbook contains the advice of experts in constructing text corpora on compiling a reliable text corpus that meets all research requirements.

Other topics in this section: Introduction   Digital archives   Survey   Interview   Experiment   Field work   Data management