Digital Humanities Workbench |
Home page > From source to data > Data modelling Data modellingData modeling is part of the crucial "DH-specific" intellectual work of translating between the (often implicit) understandings that scholars have of the objects they study and the affordances of a particular digital technology. (*) Digitized sources in their original form (both as an image or as a scanned or transcribed text) offer limited possibilities for analysis. The main reason for this is that with most text analysis software it is difficult to extract conceptual information from these sources. Text may contain ambiguities, differerent nuances of meaning, indirect or vague descriptions, etc. Besides, since they are not organised in pre-defined manner, they usually contain all kinds of irregularities: information concerning facts, persons, places, dates, etc., can be found at different places in a text and is often not clearcut. A date, for example, may be precise or broad, may be written in numbers (1900) or in letters (19th century), or may be descriptional ("in the year World War II broke out"). Relations between persons, places and dates are usually not described in a structured way either. This is why these sources are often referred to as unstructured data. Many types of digital analysis (such as statistical analysis, network analysis and geospatial and temporal analysis) require structured data, which are organized and stored according to a predefined format. This format is laid down in a so-called data model, which explicitly determines the structure of the data. Certain types of data collection, like surveys and experiments, by their nature result in a structured data set. Here, data modelling is part of designing the survey or the experiment. For unstructured data, a scholar must employ specific techniques, to translate concepts and source characteristics into computable objects. Thus, data modelling is a kind of formal knowledge organisation or knowledge representation, resulting in a data structure in which descriptive data (resulting from an inventory, e.g. based on archive sources) and/or interpretative data (resulting from analysis), combined with metadata describing the provenance and other characteristics of the source can be stored. A data model organizes data elements and specifies how they relate to one another and by which properties they are described. For example, a data model for a correspondence network based on a collection of letters may specify that the data element representing a letter is composed of properties like source (archive and archive number), date, sender, addressee, location of addressee, etc. Another data element will represent the persons related to the letters (sender, adressee, but also persons who are mentioned in the letters), with properties like first name, last name, sex, profession, function, affiliation, etc. Data modelling consists of a number of different steps. The first is conceptual data modeling: the identification and description of the entities and their relationship in the sources relating to the subject of research. A popular instrument for notation of the conceptiual model is the so-called entity - relationship diagram. The second step is logical data modeling, in which the tables of a database according the underlying relational model are defined. The third step, physical data modeling, deals with optimization of the database for performance, in an actual implementation. There are various types of data models. In humanties research the following models are prevalent: the database model, the document-oriented data model and the RDF model.
Although there are various kinds of database models, the two most common database models used in the humanities are the flat model, which is the simplest data model, in which all the data are listed in a single table, consisting of columns and rows, and the relational model. The relational model can be used to store complex data efficiently and without redundancy.
Further references
When you work with XML, the data model is laid down in a so-called document type definition (dtd) or in a XML schema. Further references:
|
Other topics in this section: Introduction Digitisation Transcription Preparation Annotation Data management |