Overview of text corpora in the Faculty of Humanities

Overview of English corpora

Click on the name of a corpus to view details.
A globe [] indicates that the corpus can be accessed through the Internet.

NameDescription
A Standard Corpus of Present-Day Edited American English
(Brown Corpus)
1,000,000 Words of running text of edited English prose printed in the United States during the calendar year 1961.
Academic Discourse Verbal Interactions Corpus
(ADVICe)
ADVICe is a small single-genre corpus of spoken university English in New Zealand context.
British National Corpus
(BNC)
A 100 million word collection of samples of written and spoken language from a wide range of sources.
Business Letter Corpus Online, searchable collection of business letters.
Child Language Data Exchange System
(CHILDES)
Corpus material related to first language acquisition.
Corpus of Contemporary American English
(COCA)
425 Million words of American English, based on spoken, fiction, popular magazines, newspapers, and academic texts.
Corpus of Global Web-Based English
(GloWbE)
GloWbE is composed of 1.9 billion words from 1.8 million web pages in 20 different English-speaking countries.
Dutch Parallel Corpus
(DPC)
DPC is a parallel corpus of 10 million words containing the language pairs Dutch - English and Dutch - French.
ICAME corpora Set of corpora containing diiferent varieties of English.
International Corpus of English, British component
(ICE-GB)
One million words of spoken and written English, fully grammatically analysed.
International Corpus of Learner's English
(ICLE)
Over 2 million words of writing by advanced/university learners of English from 19 different mother tongue backgrounds.
Lancaster-Oslo/Bergen Corpus
(LOB Corpus)
1,000,000 Words of running text of edited British English prose printed in the UK during the calendar year 1961.
Lancaster/ IBM Spoken English Corpus
(SEC)
52,000 Words of mostly prepared (and mostly monologic) southern British English speech (approximating to RP).
London-Lund Corpus of Spoken English
(LLC)
510.000 Words of spoken British English recorded from 1953 to 1987, prosodically transcribed.
Michigan Corpus of Academic Spoken English
(MICASE)
Online, searchable collection of transcripts of academic speech events.
MicroConcord Corpus Collection of short samples of journalistic text and academic prose from books and papers, totalling 2.000.000 words.
Reuters Corpus (RCV1) A collection of 810.000 newswires from Reuters for one year from 20-08-1996 to 19-08-1997.
Santa Barbara Corpus of Spoken American English
(Santa Barbara Corpus)
The SBCSAE is based on a large body of recordings of naturally occurring spoken interaction from all over the United States.
Scottish Corpus of Texts and Speech
(SCOTS)
Online corpus of both written and spoken Scottish English and Scots.
SUSANNE Corpus 130,000-Word cross-section of written American English syntactically analysed (treebanked).
TalkBank TalkBank is a multilingual corpus containing sample databases from within several subfields of communication.
Time Corpus More than 100 million words of text of American English from 1923 to the present, as found in TIME magazine.
Translational English Corpus
(TEC)
10 Million word corpus of contemporary translational English (written texts translated into English).
Wellington Corpus of Spoken New Zealand English
(WCS)
Spoken New Zealand English collected in the years 1988 to 1994.