Digital Humanities Workbench


Home page > Tools > XML > Basic principles

Basic principles of XML

XML allows you to define tagging systems for different types of documents, which can be used to store the content of the documents concerned in a structured way. Elements and attributes play an important role.

Elements

An element is a section of text marked with an XML tag. This can be a complete text, a paragraph, a word or even a part of a Word. Basically, there are two types of elements: structural elements, which are associated with the structure of the text, and so-called floating elements, which can occur in any given place within another element. Structural elements mark the hierarchical and linear structure of parts of a text. In principle, XML tags consist of a name enclosed by angle brackets. The end tag of an element consists of a forward slash placed directly after the first angle bracket. Various elements can be distinguished in the document type 'newspaper article', for example. As a whole. the document belongs to the element article:

<artikel>
... here is the text of the article ...
</artikel>

Within this element you can distinguish between the elements 'header', 'byline', 'place', 'intro' and 'text', for example. These elements are embedded in the element 'article' and therefore have a hierarchical relationship with this parent element. Seeing as they come in sequence, they have a linear relationship to each other. Note: it doesn't matter if XML tags occur on separate lines or that they are on the same line as the textual element that they belong to. This is only a matter of lay-out.

<artikel>
<kop> Apple weer slachtoffer van eigen succes </kop>
<byline> Edward Deiters </byline>
<plaats> Amsterdam </plaats>
<intro>
Bekend patroon: 1. Apple introduceert zijn nieuwste gadget. 2. Binnen een paar dagen zijn er miljoenen van verkocht. 3. Het ding heeft een probleem.
</intro>
<tekst>
... here is the text of the article ...
</tekst>
</artikel>

There is no fixed way to structure a document with XML. If it is useful or necessary to distinguish between paragraphs, that hierarchical level can easily be introduced:

<tekst>
<paragraaf>
... here is the text of paragraph 1
</paragraaf>
<paragraaf>
... here is the text of paragraph 2
</paragraaf>
(...)
</tekst>

If it is relevant (for textual research, for example, to make a distinction between the introduction, the main body and the conclusion of an article, you can also define specific elements (which will also contain the paragraph elements) for this purpose. If it is important to efficiently search for names in a text, the element 'name' can be divided into a number of sub-elements:

<naam><voornaam>Edward</voornaam><achternaam>Deiters</achternaam></naam>

Elements that have no strictly hierarchical or linear order, but can occur anywhere in the text, are called floating elements. In a newspaper article, it might be relevant to highlight names and quotes:\

<paragraaf>
En de primeur blijkt nog van Nederlandse boden te komen ook. De Nederlandse tech-website <naam>Tweakers.net</naam> stel de maandag met een hittegevoelige camera vast dat de nieuwe <naam>iPad</naam> aanzienlijk warmer werd dan zijn voorganger. <citaat>'Dat nieuws werd eerst alleen opgepikt door een aantal grote IT-sites'</citaat>, zegt <naam>Tweakers</naam>-redacteur <naam>Dimitri Reijerman</naam>.
</paragraaf>

Attributes

Attributes can be used to tag certain characteristics of elements. Attributes are specified in the start tag. They can be used to highlight characteristics of the article ('metadata'), for example,

<artikel bron="De Pers" datum="22-03-2012" katern ="Eerst" onderwerp="consumentenzaken">
(...)
</artikel>

as well as to number elements or to indicate the type of a name.

<paragraaf nr="3">
En de primeur blijkt nog van Nederlandse boden te komen ook. De Nederlandse tech-website <naam type="bedrijf">Tweakers.net</naam> stel de maandag met een hittegevoelige camera vast dat de nieuwe <naam type="product">iPad</naam> aanzienlijk warmer werd dan zijn voorganger. <citaat>'Dat nieuws werd eerst alleen opgepikt door een aantal grote IT-sites'</citaat>, zegt <naam type="bedrijf">Tweakers</naam>-redacteur <naam type="persoon">Dimitri Reijerman</naam>.
</paragraaf>

Free or fixed document structure

As was mentioned above, you are mostly free to structure documents however you want. You are also free to name your tags whatever you want (of course it is advisable to use meaningful names). An XML document is well-formed if it meets some general basic requirements, such as
  • start tags must begin with '<' en stopt met '>' / End tags start with '</' and stopt met '>' ;
  • an XML document must always have one parent element in which all other elements are nested, which is called the root element or start element;
  • all tags must be well nested (overlapping elements are not permitted);
  • an XML document must be balanced (there must always be a corresponding end tag for each start tag);
  • Element and attribute names must be case sensitive (the tags <Paragraaf> and </paragraaf> do not correspond);
  • attribute values must be enclosed in quotes (").

However, working with a completely free document structure might have its disadvantages. The main disadvantages are that it can lead to messy XML documents that are not very workable and that it makes the tagging process more prone to error and more labour-intensive.

That is why many specific (research) projects work with a specified data structure, which is laid out in a so-called document type definition (dtd) or an XML schema. These consist of a kind of set of grammar rules that specify which elements are required for a certain document type and which are optional, which elements can appear multiple times in succession, the appropriate hierarchical and linear order of the elements, and which elements can occur as floating elements (and where they may occur). XML documents constructed according to specified rules are called 'valid' A simple example of a dtd for the (highly simplified!) document type 'article', which uses the example given above, can be found below:

<?xml version="1.0"?>
<!-- XML-document voor documenttype krantenartikel -->
<!-- Eric Akkerman, 23-03-2012 -->
<!DOCTYPE artikel [
<!ELEMENT artikel        (kop, byline, plaats, intro, tekst) >
<!ELEMENT tekst          (paragraaf)+ >
<!ELEMENT paragraaf      (#PCDATA) | citaat | naam)*>
<!ELEMENT naam           (voornaam, tussenvoegsel?, achternaam)>
<!ELEMENT kop            (#PCDATA)>
<!ELEMENT byline         (#PCDATA)>
<!ELEMENT plaats         (#PCDATA)>
<!ELEMENT intro          (#PCDATA)>
<!ELEMENT citaat         (#PCDATA)>
<!ELEMENT voornaam       (#PCDATA)>
<!ELEMENT tussenvoegsel  (#PCDATA)>
<!ELEMENT achternaam     (#PCDATA)>
<!ATTLIST naam type CDATA #REQUIRED >
<!ATTLIST artikel bron CDATA #REQUIRED
                  bdatum CDATA #REQUIRED
                  katern CDATA #REQUIRED
                  onderwerp CDATA #REQUIRED>
<!ATTLIST paragraaf nr CDATA #REQUIRED >
]>

Brief explanation: a comma between two elements indicates that one element must be immediately followed by the other. A '+' indicates that an element (or group of elements) can occur several times in a row, a ' * ' indicates that an element can occur zero or more times and a '?' indicates that an element can occur zero times or one time. (#PCDATA) indicates that the content of an element consists of actual text. The element 'paragraph' consists of a mix of text and the elements 'quote' and 'name'; this construction indicates that the elements in question can occur anywhere in the running text.

An example of a dtd for literary texts (novels, stories, poems and plays) is XLDL by Ister-ORG.

Other topics in this section: Introduction   Examples   Using XML   Text Encoding Initiative   Further information