Digital Humanities Workbench |
Home page > Special topics > Language technology > Instruments Language technology instrumentsThere are various approaches within language technology:
TokenizersFor a computer, a text is nothing but a stream of characters (letters, numbers, and punctuation). Before this stream can be analysed, it must first be broken up (segmented) into words (tokens). This can be done with a tokenizer. While this may seem easy, the process can be more difficult than it looks. Punctuation, for example, often has multiple functions (such as the full stops in "I found the info about the H.B.S. on page 15" and the hyphen in "lay-out" and between "lay-" at the end of a line and "out" at the beginning of the next line). In addition, each language must have its own tokenizer.Tokenizers are usually not used independently, but as part of the methods that are discussed below.
TaggersTaggers can be used to assign word classes and any other characteristics
(such as singular-plural or verb tense) to words in a sentence. Taggers usually
also indicate a word's stem form, or lemma (the lemma of "dyed" is "dye"), which
is why they are often called tagger-lemmatizers.
ParsersA parser (based on pars, the Latin word for part) is a computer program, or component of a program, that parses the grammatical structure of input according to a fixed grammar. A parser converts textual input into a data structure. Parsers usually produce a tree structure, known as the syntax tree. (source: Wikipedia, see the link below).Parsers that are used in our faculty include Alpino (for Dutch; the website also offers an online demo) and Freeling (for English and Spanish). NB. The freeLing suite also includes modules for word class tagging, multiword detection, named-entity recognition, word-sense disambiguation, WordNet sense annotation and coreference resolution (the latter only for Spanish). For more information about this topic, please see the Wikipedia entry for Text Parsing.
Named-entity recognitionNamed-entity recognition automatically classifies certain elements in text according to predefined categories, such as proper names, names of organisations and places, time and date, numbers and monetary values. This makes named-entity recognition an important addition to tagger-lemmatizers, particularly in applications in the field of information retrieval.For more information about this topic, please see the Wikipedia entry for Named-entity recognition.
Word-sense disambiguationWord-sense disambiguation is a method that can be used to determine the meaning of words that can have several meanings (such as "bank", which can denote the place you keep your money or the riverside.For more information about this topic, please see the Wikipedia entry for Word-sense disambiguation. Co-reference resolutionA text generally contains several elements that refer to the same entity. Put in linguistic terms, these elements have the same referent. This phenomenon is usually not restricted to individual sentences. Take, for example, the fragment "Margaret shook her head. She didn't feel like hurrying up", in which "Margaret", "her" and "she" refer to the same person. Co-reference resolution is a method that can be used to automatically detect such references.
Other methodsIn addition to the methods described above, there are also methods for speech recognition, speech generation and language generation. These topics, however, are too elaborate to feature in this workbench. See Applications for a breif description of a number of applications of these methods. |
Other topics in this section: Introduction Applications |