Digital Humanities Workbench

homepage Faculty of Humanities VU University Amsterdam



Site map




About this site

Home page > E-resources > Linguistics > Text corpora

Text corpora

Collections of digital or digitized texts in many languages are available for empirical research in the field of language and communication. These collections are also called text corpora. In many cases, text corpora now also contain various forms of annotation. Most corpora are enhanced with so-called morphosyntactic information, for example, which details the word class and inflection of every word in a text (e.g. church: noun_singular). An overview of relevant text corpora available for teaching and research in our faculty can be found on the website

Overview of text corpora in the Faculty of Humanities

The corpora described on this website are accessible via the faculty network or via the Internet. Note: the site also provides an overview of corpus-based frequency lists.

There are many more corpora available than just those included in the overview above. Institutions that distribute corpora include the Dutch-Flemish HLT Agency (Dutch), the European Language Association (ELRA) and the Linguistic Data Consortium (LDC). There are also institutions that maintain overviews of existing corpora, including:

Corpus Resource Database
CoRD is an open-access online resource which academic corpus compilers can use to make basic information about their corpora available. It is part of the eVARIENG online services, provided and maintained by the Research Unit for Variation, Contacts and Change in English (University of Helsinki).

Texts & Corpora
Overview of The Linguist List.

Corpora, Collections, Data Archives
Overview on the website Bookmarks for Corpus-Based Linguists by David Lee.
Note: although the site has not been updated since 2010, it still contains a useful list of corpora that were available at that time.

Finally, the website corpus.byu.edu is also worth mentioning. After registering at this site, you can get free online access to a vast number of web-based corpora of English (from various countries), Spanish and Portuguese texts.

Other topics in this section: Lexical data Grammars Linguistic lexicons