Digital Humanities Workbench


Home page > Tools > XML

Examples of how XML is used in linguistic research

XML is used increasingly often for structural and analytical annotation of text corpora. In many cases, the XML tags used follow the definitions of the Text Encoding Initiative.

Example of structural markup (taken from the English Gigaword Corpus).
N.B. <P> stands for "Paragraph".

<DOC id="AFE19940514.0014" type="story" >
<HEADLINE>
Queen Beatrix to appoint party negotiators to explore coalition
</HEADLINE>
<DATELINE>
THE HAGUE, May 14 (AFP)
</DATELINE>
<TEXT>
<P>
Queen Beatrix was expected Saturday to formally appoint three party officials to negotiate a broad coalition government for the Netherlands, thrown into political turmoil after this month's general election.
</P>
<P>
The Christian Democrats (CDA), who have dominated the political scene for most of this century, lost 20 seats in the vote on May 3, retaining only 34 in the 150-seat lower house of parliament.
</P>
(...)
</TEXT>



Example of the annotation of clauses, in the context of research into speech, thought and writing presentation .
N.B. Each clause is marked as a 'sptag' elemant. The 'cat' attribute indicates whether it is a 'reporting clause' (NRS), 'direct speech' (DS), 'free indirect speech' (FIS), etc. The 'who' attribute contains a speaker code, 'w' indicates the number of words of the clause and 's' indicates what percentage of the sentence is made up by the clause. Taken from The Lancaster Speech, Writing and Thought Presentation Written Corpus, converted from SGML to XML.

<sptag cat="NRS" who="B" next="IS" whonext="B" s="0.21" w="3">
I asked him
</sptag>
<sptag cat="FIS" who="B" next="DS" whonext="B" s="0.43" w="6">
what Franco was doing down here.
</sptag>
<sptag cat="DS" who="L" next="NRS" whonext="L" s="0.64" w="7">
'He is opening the new Almeria airport,'
</sptag>
<sptag cat="NRS" who="L" next="N" s="0.36" w="4">
he said with pride.
<p />
</sptag>



Example of morphosyntactic annotation of the sentence 'Moreover, the analysis of skills provides a common topic of research for both art and science historians.' Derived from the BNC Baby corpus.

<s n="820">
  <w type="AV0" lemma="moreover">Moreover</w>
  <c type="PUN">, </c>
  <w type="AT0" lemma="the">the </w>
  <w type="NN1" lemma="analysis">analysis </w>
  <w type="PRF" lemma="of">of </w>
  <w type="NN2" lemma="skill">skills </w>
  <w type="VVZ" lemma="provide">provides </w>
  <w type="AT0" lemma="a">a </w>
  <w type="AJ0" lemma="common">common </w>
  <w type="NN1" lemma="topic">topic </w>
  <w type="PRF" lemma="of">of </w>
  <w type="NN1" lemma="research">research </w>
  <w type="PRP" lemma="for">for </w>
  <w type="AV0" lemma="both">both </w>
  <w type="NN1" lemma="art">art </w>
  <w type="CJC" lemma="and">and </w>
  <w type="NN1" lemma="science">science </w>
  <w type="NN2" lemma="historian">historians</w>
  <c type="PUN">.</c>
</s>



Example of syntactic annotation of the Dutch sentence 'Er klinkt een zacht geluid.', as a result of automatic syntactic analysis with the Alpino parsing program.

<?xml version="1.0" encoding="ISO-8859-1"?>
<alpino_ds version="1.0">
  <node id="0" rel="top" cat="top" begin="0" end="6">
    <node id="1" rel="--" cat="smain" begin="0" end="5">
      <node id="2" rel="mod" pos="adv" begin="0" end="1" root="er" word="Er"/>
      <node id="3" rel="hd" pos="verb" begin="1" end="2" root="klink" word="klinkt"/>
      <node id="4" rel="su" cat="np" begin="2" end="5">
        <node id="5" rel="det" pos="det" begin="2" end="3" root="een" word="een"/>
        <node id="6" rel="mod" pos="adj" begin="3" end="4" root="zacht" word="zacht"/>
        <node id="7" rel="hd" pos="noun" begin="4" end="5" root="geluid" word="geluid"/>
      </node>
    </node>
    <node id="8" rel="--" pos="punct" begin="5" end="6" root="." word="."/>
  </node>
  <sentence>Er klinkt een zacht geluid.</sentence>
</alpino_ds>