Digital Humanities Workbench

Home page > From source to data > Data modelling

Data modelling

Data modeling is part of the crucial "DH-specific" intellectual work of translating between the (often implicit) understandings that scholars have of the objects they study and the affordances of a particular digital technology. (*)

Digitized sources in their original form (both as an image or as a scanned or transcribed text) offer limited possibilities for analysis. The main reason for this is that with most text analysis software it is difficult to extract conceptual information from these sources. Text may contain ambiguities, differerent nuances of meaning, indirect or vague descriptions, etc. Besides, since they are not organised in pre-defined manner, they usually contain all kinds of irregularities: information concerning facts, persons, places, dates, etc., can be found at different places in a text and is often not clearcut. A date, for example, may be precise or broad, may be written in numbers (1900) or in letters (19th century), or may be descriptional ("in the year World War II broke out"). Relations between persons, places and dates are usually not described in a structured way either. This is why these sources are often referred to as unstructured data.

Many types of digital analysis (such as statistical analysis, network analysis and geospatial and temporal analysis) require structured data, which are organized and stored according to a predefined format. This format is laid down in a so-called data model, which explicitly determines the structure of the data. Certain types of data collection, like surveys and experiments, by their nature result in a structured data set. Here, data modelling is part of designing the survey or the experiment. For unstructured data, a scholar must employ specific techniques, to translate concepts and source characteristics into computable objects. Thus, data modelling is a kind of formal knowledge organisation or knowledge representation, resulting in a data structure in which descriptive data (resulting from an inventory, e.g. based on archive sources) and/or interpretative data (resulting from analysis), combined with metadata describing the provenance and other characteristics of the source can be stored.

A data model organizes data elements and specifies how they relate to one another and by which properties they are described. For example, a data model for a correspondence network based on a collection of letters may specify that the data element representing a letter is composed of properties like source (archive and archive number), date, sender, addressee, location of addressee, etc. Another data element will represent the persons related to the letters (sender, adressee, but also persons who are mentioned in the letters), with properties like first name, last name, sex, profession, function, affiliation, etc.

Data modelling consists of a number of different steps. The first is conceptual data modeling: the identification and description of the entities and their relationship in the sources relating to the subject of research. A popular instrument for notation of the conceptiual model is the so-called entity - relationship diagram. The second step is logical data modeling, in which the tables of a database according the underlying relational model are defined. The third step, physical data modeling, deals with optimization of the database for performance, in an actual implementation.

There are various types of data models. In humanties research the following models are prevalent: the database model, the document-oriented data model and the RDF model.

Database model

The database model is commonly used to model data that have a fixed structure. In this model, objects (often called entities) with certain properties (attributes) are stored in one or more tables. Each object is described in one so-called record. Most data models aim at the least possible redundancy (repetition of identical data). For example, if a certain person is connected to more than one letter, you will only want to store the information concerning that person once. This also allows you to link variant names for the same person to one record in the persons table. If a database consists of more than one table, an essential part of the modelling process is to ensure that these tables are linked correctly. For this purpose, in each table a particular attribute or combination of attributes is desginated as a primary key that can be referred to in other tables.

Although there are various kinds of database models, the two most common database models used in the humanities are the flat model, which is the simplest data model, in which all the data are listed in a single table, consisting of columns and rows, and the relational model. The relational model can be used to store complex data efficiently and without redundancy.

Further references

Document-oriented data model

Whereas databases are commonly used for structured data, textual documents usually do not have a fixed structure. In the humanities, the XML model is often used to model textual data, including the text itself. Using XML codes, the structure of the text can be marked: chapters, paragraphs, headlines, notes, etc., but XML is also particularly suited to mark textual elements that may occur anywhere in a text, like names, dates, metaphors, syntactic elements, etc. In the so-called attributes of these codes, specific characteristics of these elements can be marked. For names, for example, you can indicate whether it concerns a person, a geographic location, a company, etc. Attributes also allow you to indicate relations between elements in a text. Textual metadata are usually stored in a so-called file header at the beginning of the text file.

When you work with XML, the data model is laid down in a so-called document type definition (dtd) or in a XML schema.

Further references:

RDF model

The Resource Description Framework (RDF) was originally designed as data model for metadata. Nowadays it has come to be used as a general method for conceptual description or modeling of information that is implemented in web resources. It is the data model underlying Linked Data and the Semantic Web, consisting of triples representing nodes and arcs, linked together in a network or graph.
See the page on Linked Open Data in this workbench for more information about RDF.

Other topics in this section: Introduction   Digitisation   Transcription   Preparation   Annotation   Data management