Overview of text corpora in the Faculty of Humanities

Corpus exploration

Corpora that are available via the Internet usually have their own web based query forms. The website in question will provide online instructions on how to use these forms.

A number of corpora that contain specific data (e.g. speech data) and/or complex annotations (e.g. syntactic tree structures or complex XML annotations) are provided with specific exploration programs. Examples are the British National Corpus (that can be explored with Xaira), International Corpus of English - British component (software: ICECUP), Corpus Gesproken Nederlands (software: Corex) and Childes (software: clan). On this Corpus overview website you find links to manuals for these programs under the lemma's for the corpora this concerns.

The text corpora that are available on the faculty network as plain text files can be explored with standard text analysis software. In our faculty, WordSmith and Windows Grep are used for this purpose. WordSmith is most suitable for the production of frequency counts and for relatively simple text searches, of which the results are presentented as a concordance. Windows Grep can be used for more complex text searches (using so-called regular expressions). You should also use Windows Grep if line breaks in the corpus text must be presented as such in a search output (in a WordSmith concordance line breaks are ingnored). WordSmith is available on all faculty pc's but cannot be distributed for use at home. Windows Grep is avalaible on all faculty pc's and can be downloaded as a trial version for use at home.
N.B. A freeware alternative for WordSmith is AntConc. This program offers less functionality than WordSmith (e.g. for manipulating the concordance output and for handling XML annotations), but it is very suitable for basic text searches, including regular expression searches.

Wordsmith website
The official WordSmith website provides a lot of information about the program and about related issues. You can also consult and/or download WordSmith manuals here.

Manual Wordsmith 5
Online manual for WordSmith 5.0.

Wordsmith tutorial   Internal publication of the Faculty of Humanities - VU Amsterdam
Tutorial of approximately 2 hours that introduces you to the main features of WordSmith.

Windows Grep website
On this website you can find information about Windows Grep and download a test version for use at home.

Instruction sheet Windows Grep   Internal publication of the Faculty of Humanities - VU Amsterdam
Instruction sheet explaining the basic funtions of Windows Grep.

On-line regular expression tutorial
More extensive introduction to the use of regular expressions.

A number of specific search engines have been developed with which you can search the Internet for language data. One of these is WebCorp, a suite of tools which allows access to the World Wide Web as a corpus and presents the output of a web search in a concordance format.
N.B. Using the Internet as a source for liguistic data has a number of drawbacks. You have almost no control over the composition of the 'corpus'. Certain text genres are overrepresented on the Web, while others are underrepresented or almost absent (e.g. transcripts of spoken language). Moreover, seemingly identical text genres (e.g. newspaper articles), may differ considerably in their printed and digital forms. Apart from this, it is quite a laborious task to establish the origin and the text type of the resulting documents. It is therefore difficult to establish if your findings are in any way representative for a certain language in general or for a specific text type in particular. Besides that, because of the dynamic character of the Web, searches usually cannot be reproduced, which is a basic requirement of academic research.