Digital Humanities Workbench


Home page > Special topics > Language technology > Instruments

Language technology instruments

There are various approaches within language technology:

  1. The rule-based approach, focused on the analysis of natural language expressions using linguistic rules. Although there was a focus on the lexical, morphological and syntactic components in the past, over the last few years there has been a lot of interest in the semantic level (for speech technology) and the pragmatic level. There is not one single rule-based approach: it can also be a synthesis of various linguistic approaches.
  2. The probabilistic approach, where analysis is performed based on statistical processes that have (often) been developed on the basis of (large) text corpora.
  3. A combination of (a) and (b), which aims to clarify ambiguous analyses with statistical techniques, for example. A well-known application is the so-called trigram analysis, in which word class is determined by looking at the classes of the words to the left and to the right of a specific word, and determining which combination has the highest statistical combination in that context.
This page provides an overview of a number of common instruments that are used in language technology. Most instruments are available in our faculty and can be used for (thesis) research that requires automatic analysis of text corpora. Please contact Piek Vossen (p.vossen at vu.nl) or Onno Huber (o.huber at vu.nl) for further information.

Tokenizers

For a computer, a text is nothing but a stream of characters (letters, numbers, and punctuation). Before this stream can be analysed, it must first be broken up (segmented) into words (tokens). This can be done with a tokenizer. While this may seem easy, the process can be more difficult than it looks. Punctuation, for example, often has multiple functions (such as the full stops in "I found the info about the H.B.S. on page 15" and the hyphen in "lay-out" and between "lay-" at the end of a line and "out" at the beginning of the next line). In addition, each language must have its own tokenizer.
Tokenizers are usually not used independently, but as part of the methods that are discussed below.

Taggers

Taggers can be used to assign word classes and any other characteristics (such as singular-plural or verb tense) to words in a sentence. Taggers usually also indicate a word's stem form, or lemma (the lemma of "dyed" is "dye"), which is why they are often called tagger-lemmatizers.
Taggers are usually based on a lexicon, which is compared with the words in a text. Because words often belong to more than one word class (take the word "paint", for example, which can be both a noun and a verb), taggers have to be able to choose the right option based on context. An additional problem is that many texts contain words that are not included in lexicons, such as proper names, place names and very uncommon words, for example. Meaningful units consisting of more than one word (multiwords, such as "after all") can also make tagging more difficult.
The taggers we use in the faculty are Frog (for Dutch) and TreeTagger( (for multiple languages, including English, French and German). The CGN tagger-lemmatizer was used to analyse the CGN corpus, and can be tested online through the link.
For more information about this topic, please see the Wikipedia entry for Part-of-speech tagging.

Parsers

A parser (based on pars, the Latin word for part) is a computer program, or component of a program, that parses the grammatical structure of input according to a fixed grammar. A parser converts textual input into a data structure. Parsers usually produce a tree structure, known as the syntax tree. (source: Wikipedia, see the link below).
Parsers that are used in our faculty include Alpino (for Dutch; the website also offers an online demo) and Freeling (for English and Spanish). NB. The freeLing suite also includes modules for word class tagging, multiword detection, named-entity recognition, word-sense disambiguation, WordNet sense annotation and coreference resolution (the latter only for Spanish).
For more information about this topic, please see the Wikipedia entry for Text Parsing.

Named-entity recognition

Named-entity recognition automatically classifies certain elements in text according to predefined categories, such as proper names, names of organisations and places, time and date, numbers and monetary values. This makes named-entity recognition an important addition to tagger-lemmatizers, particularly in applications in the field of information retrieval.
For more information about this topic, please see the Wikipedia entry for Named-entity recognition.

Word-sense disambiguation

Word-sense disambiguation is a method that can be used to determine the meaning of words that can have several meanings (such as "bank", which can denote the place you keep your money or the riverside.
For more information about this topic, please see the Wikipedia entry for Word-sense disambiguation.

Co-reference resolution

A text generally contains several elements that refer to the same entity. Put in linguistic terms, these elements have the same referent. This phenomenon is usually not restricted to individual sentences. Take, for example, the fragment "Margaret shook her head. She didn't feel like hurrying up", in which "Margaret", "her" and "she" refer to the same person. Co-reference resolution is a method that can be used to automatically detect such references.

Other methods

In addition to the methods described above, there are also methods for speech recognition, speech generation and language generation. These topics, however, are too elaborate to feature in this workbench. See Applications for a breif description of a number of applications of these methods.

Other topics in this section: Introduction   Applications