Digital Humanities Workbench


Home page > Digital data > Structured data

Structured data

Structured data are data which are organized and stored according to a predefined format. Many types of analysis (such as statistical analysis, network analysis and geospatial and temporal analysis) require structured data. Certain types of data collection, like surveys and experiments, by their nature result in a structured data set. However, digitized textual and (audio)visual sources are usually not organized in a pre-defined manner, which is why they are often referred to as unstructured data. In many cases, however, they do contain information that can be subjected to the types of analysis mentioned above, provided this information is extracted from the source and stored in a structured manner. Usually, this concerns descriptive data (resulting from an inventory, e.g. based on archive sources) and/or interpretative data (resulting from analysis), combined with metadata describing the provenance and other characteristics of the source. The process by which textual and (audio)visual data are converted into structured data is called data-modelling.

In humanities research, structured data are typically stored in tables, either in an Excel sheet (for simple, so-called flat data structures) or in a database. These tables contain the results of a survey or an experiment or, if your research is based on the study of primary sources, the information that is extracted from the texts and relevant metadata. An important characteristic of structured data is its repetitive nature: all data in a structured data set conform to the same format.

Example

For her research into verb-particle combinations in English (like to give up and to take off), Olga Steenhoek extracted verb-particle combinations from a text corpus and categorised them on the basis of several characteristics: the verb pattern, metaphorical meaning of the verb and/or the particle, semantic aspects, etc. She stored the resulting data set in a table with the following structure:


Click on the image to enlarge it

N.B. The table is simplified for instructional reasons.

Semi-structured data

If a data set is stored in one or more tables, the annotations are disconnected from the text itself. The only connection between the data and the source is that the data set may contain a reference to the source, such as a filename, an identification number or an archive number, and possibly a link to the source itself. Sometimes, as in the example given above, part of the source is copied to the database (e.g. a sentence), but even then there is no direct link between specific annotations and the parts of the sentence they refer to. When the source itself, or the relationship between the extracted data and the source, is important, the text itself can be annotated. Because this results in a mix of unstructured and structured data, we often speak of semi-structured data. Semi-structured data is typically created with the markup language XML. Below is an example of an XML document containing the same kind of information as the table in the example above, for the sentence "Mr Gummer should pick up Mr MacSharry's ideas and remould them to meet sensible criteria":


More information

More information about (aspects of) structured data can be found in the sections data modelling, databases, formal annotation and XML elsewhere in this workbench.

Other topics in this section: Introduction   Digital text   Digital images   Linked (Open) Data   Big data