Digital Humanities Workbench


Home page > Tools > XML > Introduction

Introduction to XML

XML (Extensible Markup Language) was developed to store the contents of files in a structured way. A central aspect of XML is the use of tags that mark and annotate the structure of both files and individual meaningful elements in a document. XML documents consist only of letters, numbers, and punctuation (they are so so-called plain text files) and contain no specific binary code for formatting or structuring, as is the case with Word documents and Excel files, for example. As a result, XML documents are program and platform-independent: a text file with XML tags created with program A on OS C can also be processed by program B on OS D. In addition, it is an open system that can be used for free. Using XML is a way to make digital information more future-proof: an open and relatively simple standard is a good basis for future reuse of (research) data.

Some key benefits of using XML are:

  • Documents encoded with XML can be edited and processed in various ways. This can be done at the document level, at the level of the individual tagged elements, or at both levels. For research purposes, it is useful, for example, that the content of XML documents can be converted to SPSS files for further statistical analysis.
  • Optimal access to the contents of the documents (text retrieval): these documents can be searched in various ways, making optimal use of the various tagged elements. Ease of access can be improved by indexes, which can be generated automatically based on the tagged elements.
  • XML documents can be shared easily, so that they can also be used by other people.

Although many different XML applications have been developed for encoding specific types of documents (see, for example, the overview of XML Applications and Initiatives on the Cover Pages website), anyone can develop an entirely new coding system that meets the needs of a specific research project, for example. It is also possible to use an existing XML application and to adjust or extend it for a specific research project.

XML has hundreds of applications and is used in many different disciplines. Here are some examples:

  • In the tech industry, XML is widely used to exchange data between different computer programs and systems
  • On the Internet, XHTML (the XML version of HTML) has been used to design web pages since 2001(see, for example, the HTML vs XHTML page on www.W3schools.com). HTML5 (the successor of HTML4 and XHTML) is also based on XML. In addition, more and more dynamic websites are based on files coded with XML (instead of data stored in databases).
  • In the publishing world, XML is used to save documents in such a way that (parts of them) can be published on various media. This is also called ' medium-neutral storage ' (see e.g. Kunst 2010). 'Printing on demand', which allows people to buy just a few relevant digital chapters of a book, rather than the full book, is also made possible by XML.
  • In the library world, XML plays a role in the exchange and joint use of bibliographic data (see e.g. Banerjee 2008).
  • In the archiving world, XML is used for the structured storage of archive inventories. The XML application in question is called Encoded Archival Description (for more information, see e.g. the page about EAD of the DEN Foundation.

This Workbench focuses on how XML can be used to enhance (mainly textual) documents for research purposes in the humanities, where extra information is often added to digitized textual sources in order to aid analysis of these texts. This can include information about the origin and structure of the documents and/or more content-related information, which is used to classify the content of documents in any number of ways. This process is usually called annotation. Research material enhanced with XML tags can be edited, searched, analysed and presented in various ways (on a website, for example). XML can greatly improve the accessibility of research material, on condition that the XML tags are applied correctly, of course.

In linguistic research and textual analysis, XML is used for the markup and annotation of text corpora, including standard text corpora, such as the British National Corpus and SoNaR, as well as specific research projects. In literary research, XML is used for the annotation of digitized literary texts. The TEI By Example project carried out by the Royal Academy of Dutch Language and Literature provides a good overview of possible applications of XML for annotating poetry . In historical and cultural-historical research, XML is used for the annotation and disclosure of, for example, letters and other historical documents, as well as for the markup of more structured data sets (based on personal archives, for example).

In addition to annotation for research purposes, XML is also widely used for annotation for the delivery and digital publication of primary texts, manuscripts and other documents.