Digital Humanities Workbench


Home page > From source to data > Annotation > Formal annotation

Formal annotation

In a lot of research in the humanities, information is added to the objects being studied (such as texts, audio files, images, or video files), in order to come to a formal analysis of certain phenomena. When this is done based on a predetermined classification system, we call it formal annotation. This may be non-content-based information, such as source information (metadata) and information about the structure of the text. But it can also be interpretative and/or analytical information. For more information on this topic, which is relevant for all disciplines, please see the page about annotation in textual data in this workbench. See Leech (2004) for more specific information about linguistic annotation.

Note: there is also a section on free annotation in this workbench, which involves making notes in a text whilst reading/studying by underlining words or phrases and adding exclamation marks, labels or textual notes, for instance.

For many types of research there are standard classification systems for structural and source-related information. These are defined in the Text Encoding Initiative, for example. This is different for content-based annotation; in most cases the researcher must create their own classification system or typology based on a specific research question before starting the annotation process. Of course, the classification system can be adjusted during the process of annotation if necessary, but this should be avoided as much as possible.

Tools

Certain types of annotation can be applied automatically (see language technology instruments), but it is often done by the researcher him- or herself. There are various programmes that can be used to support this process. Which particular programme is best depends on the research method used and is determined by a number of factors, the most important of which are:
  • the complexity of the classification system;
  • the tagging system (ad hoc tagging; XML);
  • the relationship between the research subjects and the annotation subjects (does annotation take place on several levels?);
  • Whether or not a predefined classification system is used for annotation and the complexity of this system;
  • the way the data must be further modified and analysed;
  • how the program supports the researcher (intuitive software; preventing input errors; enforcing consistent annotation; complexity (learning curve); availability of documentation, for instance).

See below for an overview of the most frequently used annotation software in our faculty. The addition [I] behind the name of the program indicates that image files can be also be annotated, [A] is used for audio files and [V] for video files. The faculty Computerization Office can advise in the selection of an appropriate annotation system and can provide support when using it.

Text editor - XML editor - Microsoft Excel -Microsoft Access - UAM CorpusTool - UAM ImageTool [I] - Transana [A;V] - Atlas.ti [I;A;V] - AmCAT

Text editor

For annotation with a simple private classication system (where no universal annotation system is used), you can use a text editor. One option is Windows Notepad, as well as special editors such as NoteTab, which has some useful features (such as macros and advanced search and replace operations).

XML editor

XML is often used as an annotation system in scientific projects. Although XML annotation can be done using basically any text editor, it is recommended that you use a special XML editor, especially when working with an annotation system that is set up before the text is analysed, in which case all the possible annotations are known before the text is annotated. XMLPadis relatively simple freeware editor that can be used for this. More advanced XML editors include Oxygenand XMetal. Note: the faculty does not have a license for these programs.

Microsoft Excel

If the research subject is the same as the annotation subject (e.g. if you are counting the number of abbreviations in text messages) and if the classification system is simple, Excel can be used for annotation. One advantage of Excel is that it is easy to use and that the annotated data will be easy to read in statistical software such as SPSS. Excel is much less suitable when using a more complicated classification system and/or if multiple annotation subjects can occur within the research unit (e.g. if you are also annotating every abbreviation in a text message individually without ignoring the relationship between them).

Microsoft Access

This database program is particularly useful for annotation when the research subject is the same as the annotation subject. Compared to Excel, Access has the advantage that it allows you to work more efficiently with complex annotation systems and it gives you the option of inputting data in a controlled manner. If multiple aspects are annotated for every annotation subject, the whole process can be made clearer by inputting the data in an annotation window that focuses on one annotation subject at a time (instead of using a long horizontal row in a spreadsheet that requires constant scrolling). You do need to have some expertise in order to set up the database and develop input forms.
Finally, Access offers more possibilities for the processing and analysis of the data than Excel does.

Atlas.ti [I; A; V]

Atlas. ti was developed to support qualitative content analysis (which usually involves free annotation), but can also be used in combination with a fixed set of tags. A major disadvantage of this program is that the tagged text is stored in a fairly specific way, so that, practically speaking, Atlas. ti must also be used for further analysis, especially if it is not quantitative in nature. This also makes exchanging the annotated material rather tricky. An advantage of working with Atlas.ti, however, is that it allows the controlled annotation of a predefined set of labels, as well as a combination of formal annotation annotation and free annotation. In addition, Atlas. ti can also be used to annotate images, audio files and video files. Atlas.ti is used for research in a large number of fields.

UAM CorpusTool

The UAM Corpus Tool was developed specially for the annotation of text files based on a user defined classification system. The text can be annotated on multiple levels (e.g. on text level, sentence level and clause level), which also makes this program suitable for situations in which the annotation subject is not the same as the research unit. The annotation is made in the text itself in a natural way and is presented in a visually attractive manner. The program also supports analysis of the annotated material, both through various search functions and through (relatively basic) statistical analyses, including comparative statistical analysis of language use in different genres, for instance. Because all annotations are saved in XML files, they can also be edited and processed with other software. In many cases that won't be necessary, because the program itself provides plenty features for analysis.
Note: this program is not (yet) available on faculty PCs, but it is available as a free download from the link above.

UAM ImageTool [I]

The UAM ImageTool is a derivative of the UAM CorpusTool, which was developed for the annotation of images.

Transana [A;V]

Transana can be used to transcribe, annotate and analyse digital audio and video material. In our faculty this program is primarily used to support conversation analysis.

AmCat

The AmCAT system was developed to support content analysis. The central element of AmCAT is a database that contains all documents that need to be analysed (such as newspaper articles and contributions to web forums) and the annotations and analyses that are associated with these documents. Through a web interface, researchers can explore the data, quickly perform automatic analyses, add documents for manual annotation and view and analyse the results.

Other topics in this section: Free annotation