Digital Humanities Workbench


Home page > Digital data > Digital text > Annotation

Annotation in text files

For various reasons, all sorts of information may be be added to the actual text of digitized text files. This is called annotation of the text, which is usually applied using certain codes. We distinguish between the following three types of annotation.

  • Metadata
    Descriptive metadata identifies the source of the text, for example by providing information concerning the edition of the text or its provenance (archive, library, etc.), Administrative metadata provides information about the creation process of the digital version of the source, such as when and how it was created (like editing decisions and a revision history). More and more, this information is added to the text itself in a so-called header.
  • Structural elements
    When structural elements in a text are marked, this makes it possible to limit searches to certain portions of the text or to exclude certain portions of the text from search operations. Relevant markers in a novel could include the title page, the book number (I, II, III, etc.), the preface, chapters, and possibly even paragraphs. In plays, acts, scenes, stage instructions and the expressions of the different characters are often marked. For example, the latter allows researchers to investigate differences in word and language usage between certain characters. In poems, titles, stanzas and lines could be marked, and in letters the name of the addressee, the date, the salutation, paragraphs, closing and signature. In (historical) newspaper texts headings and bylines may be distinguished from the body text by certain markers.
  • Content-based aspects
    For certain types of research content-based annotation is added to a text or collection of texts. This can involve, for example, marking names, references to other texts, thematic units, metaphors, elements governing the narrative structure or perspective.
    It can also be useful to highlight grammatical aspects of a text, such as direct speech versus indirect speech; direct versus indirect thought, epithets (such as ""fleet-footed Achilles" and "owl-eyed Athena" in the works of Homer).

Annotation systems

There are various ways to add annotations to a text. It is generally done by adding codes, which is usually called markup. The easiest way is to add a code behind a reserved character in the text. In such a system, epithets could be encoded as follows:
    fleet-footed#ep Achilles  
A thematic enhancement could be represesented as:
    {theme=love}
An advantage of this approach is that it is simple. The downside is that it has not been standardized and that software for text analysis is not specifically tailored to processing these arbitrary codes.

COCOA is an annotation system that was frequently used a few decades ago and that you can still find in texts that were digitized in the 20th century. The principle of COCOA is that a code is placed between angled brackets and that all codes can consist of two parts: the first part specifies the marker type, the (optional) second part can add a certain value. COCOA markers are placed at the start of a particular element. The start of Act 3 in a certain play could thus be marked with the following COCOA code: <act 3>.
COCOA markup example

XML is now more commonly used to annotate texts. In text archives you can still find many texts that are encoded with SGML, the precursor to XML. The main advantage of using XML is that it is a widespread standard that is handled well by most modern software. Based on XML, the Text Encoding Initiative (TEI) has developed a number of encoding sets for use in the humanities, including an encoding set for novels and plays.
XML markup example. [Source: Shakespeare XML project (accessed on 22-2-2016)].

Further information

For further information about the process of annotation, see the texts about formal annotation and free annotation in this Workbench.

More information about XML and TEI.

Other topics in this section: Introduction   Types   File formats