Overview of text corpora in the Faculty of Humanities

Corpus based frequency lists

This section provides links to a number of authoritative corpus-based frequency lists.

Dutch

The following lists of the 5000 most frequent words in a number of Dutch corpora (as distributed by the TST-Centrale in March 2012) are available:

  1. INL Miljoenencorpora:
  2. PAROLE-corpus 2004: lemma's en woordvormen
  3. Corpus Gesproken Nederlands: lemma's en woordvormen
  4. Corpus Algemeen Nederlands Woordenboek: lemma's en woordvormen
  5. Eindhoven Corpus: woordvormen
  6. D-Coi-corpus: lemma's en woordvormen

Of these, the D-COI-corpus and the Corpus ANW are the most recent and the largest corpora. D-COI covers many different text types, ANW is largely based on texts extracted from the Internet. More information about these frequency lists can be found in the document Informatieblad Frequentielijsten Corpora.

British English

The web page http://ucrel.lancs.ac.uk/bncfreq/flists.html from Lancaster University provides plain text versions of the frequency lists contained in the book

Leech, Geoffrey, Paul Rayson & Andrew Wilson (2001). Word frequencies in written and spoken English: based on the British National Corpus. Harlow (etc.): Longman. [available at the UB VU]

The lists are based on the British National Corpus (BNC) and contain information about both written and spoken British English. They are raw unedited frequency lists, which do not contain the many additional notes supplied in the book itself. The lists are tab delimited plain text so can be imported into your prefered spreadsheet format. For the main lists a key to the columns is provided. More details on the process undertaken in the preparation of the lists can be found in the introduction to the book.

American English

The web site www.wordfrequency.info provides a list of the top 5000 words/lemmas from the Corpus of Contemporary American English (COCA). N.B. The loading of the list may take some time. COCA is a 385-million-word corpus - evenly balanced between spoken English (unscripted conversation from radio and TV shows), fiction (books, short stories, movie scripts), more than 100 popular magazines, ten newspapers, and 100 academic journals - for a total of nearly 150,000 texts.
The website also offers a number of other frequency lists based on this corpus (e.g. collocates and n-grams). Printed versions of these lists can be found in the book

Davies, Mark & Dee Gardner (2010). A Frequency dictionary of contemporary American English word sketches, collocates and thematic lists. London & New York: Routledge.