Digital Humanities Workbench |
Home page > E-resources > Linguistics > Text corpora Text corporaCollections of digital or digitized texts in many languages are available for empirical research in the field of language and communication. These collections are also called text corpora. In many cases, text corpora now also contain various forms of annotation. Most corpora are enhanced with so-called morphosyntactic information, for example, which details the word class and inflection of every word in a text (e.g. church: noun_singular). An overview of relevant text corpora available for teaching and research in our faculty can be found on the website
The corpora described on this website are accessible via the faculty network or via the Internet. Note: the site also provides an overview of corpus-based frequency lists. There are many more corpora available than just those included in the overview above. Institutions that distribute corpora include the Dutch-Flemish HLT Agency (Dutch), the European Language Association (ELRA) and the Linguistic Data Consortium (LDC). There are also institutions that maintain overviews of existing corpora, including:
Corpus Resource Database
Texts & Corpora
Corpora, Collections, Data Archives Finally, the website corpus.byu.edu is also worth mentioning. After registering at this site, you can get free online access to a vast number of web-based corpora of English (from various countries), Spanish and Portuguese texts. |
Other topics in this section: Lexical data Grammars Linguistic lexicons |