Digital Humanities Workbench


Home page > From source to data > Introduction

From source to data

Digital humanities research can be based on sources in many different formats. The most basic form is a digital copy of the original, which requires digitisation by means of a scan or a digital photo. If you want to use the computer to analyse textual sources, the digital images of those sources must be converted to computer readable text. An automatic way to achieve this, which is especially suitable for relatively recent documents containing neat print, is optical character recognition. For older documents (both handwritten or in print) this usually has to be done by the researcher, an activity which is called transcription (see transcription of text). In many cases, spoken texts must also be transcribed by hand (see transcription of speech), although automatic speech recognition has improved enormously in the last decade.

For further research, often certain information is added to the data, a process called annotation. This may be either free annotation, where the scholar may add any type of remarks to the text (usually interpretative or analytic), or formal annotation, for which a pre-established set of annotation codes is used. Annotation may entail different types of information, e.g. about the provenance of the source (so-called metadata) or its structure or information classifying the data in certain ways.

Certain analyses (like statistical, network or spatial analysis) cannot be performed on unstructured textual or (audio)visual data. In that case it is necessary that the data is stored in a fixed format, e.g. in tables. The conversion of data in unstructured sources into a structured data set is a process that is called data modelling.

At all stages of the process, certain types of analysis are possible. In general, however, each extra stage, opens up further possibilities for analysis. Examples of common working methods are:

analogue source(s) → digital copy → computer readable text → analysis
analogue source(s) → digital copy → computer readable text → annotated text → analysis
analogue source(s) → digital copy → annotated data → analysis
analogue source(s) → digital copy → structured data → analysis

Research data lifecycle Digitising, preparing and processing sources for analysis are part of what is often called the research data lifecycle. The encompassing set of activities is called data management or data curation. This concerns the overall organisation of the data, including aspects like storage, archiving and preservation.

Other topics in this section: Digitisation   Transcription   Preparation   Annotation   Data modelling   Data management