Digital Humanities Workbench

homepage Faculty of Humanities VU University Amsterdam



Site map




About this site

Home page > From source to data > Digitisation

Digitisation

If texts or other objects are not available in digital form (either as born-digital material, that was created digitally, or in an archive or elsewhere on the internet - see digital archives), digitisation is part of the research project for which the data are required. Digitisation implies that printed texts, manuscripts, photographs or audiovisual objects are converted into electronic images. The digitisation process usually consists of a number of stages.

N.B. Strictly speaking, a digital copy is not always necessary as the basis for further data processing. You can, for example, directly transcribe an original source. However, having a digital copy may be very useful, for a number of reasons: (i) you can restrict your time in an archive to copying the sources, after which you can do your transcription work anywhere; (ii) during all stages of your research you can inspect the original source; (iii) transcriptions, but also structured data in a database, can be linked to digitial copies of the sources.

Scanning and photography

There are three main devices for making a digital image of physical objects.

A scanner can be used to make a digital copy of a document, manuscript or picture. Although there are various types of scanners (for mass digitisation projects like those of major libraries and Google Books very advanced document scanners are used), so-called flatbed scanners will do the job for most smaller research projects. Most modern flatbed scanners offer good quality scans (with regard to aspects like resolution and colour depth). Differences in price are mainly reflected in speed and functionalities. The resulting digital copy may be an image or a PDF file. A PDF file will be most suitable for longer texts that have to be saved as one file. For single pages (of historic documents or manuscripts) and for pictures an image file is usually preferred. The two image file formats that are used in most cases are JPG (compressed - you should use a high quality level) and TIFF (lossless; high resolution). You should scan objects at at least 300 dpi (dots per inch). See the Workbench page about digital images for more information about this subject.
Scanners are operated by means of special scanning software. You are advised to study the different settings in relation to both the objects that will be scanned and the purpose of the scanning, before embarking on a scanning project.
A digital camera can be used to make digital images of two-dimensional objects. It can also be used to digitise documents and manuscripts, when these may not leave an archive (and the archive itself has no scanning service). Althought the digital quality and resolution of a photographed object are usually high, with a camera it is more difficult to make untilted scans of documents.
A 3D scanner can make three-dimensional images of objects. Currently, in the humanities this technique is mainly used for cultural heritage artefacts. For an example, see this rotatable 3D image of a censer (old incense or perfume burner) of the University of Exeter.

Conversion

If objects have already been recorded, there are a number of ways in which analogue recordings can be converted to a digital copy.

A film scanner can be used for converting analogue photographic film (of photographs and photographic slides) to digital images. Although this can also be done with most flatbed scanners, using a specialised film scanner has several advantages, which mainly have to do with accuracy, speed and the functionalities of the accompanying scanning software.
Audiovisual data on analogue media (LP records, cassette and video tapes) can also be converted to a digital medium. This requires an appropriate analogue playing device, an analogue to digital converter and software to capture the resulting digital data stream.

N.B. There are many companies offering these services. It is advisable to ask one for an offer before embarking on this task yourself.

Post-processing

Depending on the use of the digitised data, it might be necessary to tidy up the scans, especially for documents that are scanned from books. Often, the text will be slightly tilted and will have dark margins that you would like to crop. If two pages are scanned to one file, it might be desirable to divide the left and right pages into separate files. Most scanning software allows you to perform these actions, but it is also possible to use photo editing software like Microsoft Paint or Photoshop afterwards.

File management

First, you have to decide whether you want to store scans of textual data in one file (for the whole document) or in separate files (one for eacht page). Scanned files are automatically named during the scanning process (resulting in names like IMG00023.TIFF). If your scans are stored in separate files, it might be useful to rename the files in two ways: (i) giving them a prefix (with fixed length) denoting the source and (ii) numbering them according to the page numbers in the original source. When renumbering, make use of leading zero's for ordering purposes (e.g. Defoe-RC007.JPG). You should also make a plan for the storage (in one folder or in different folders; on your pc, in a project folder on your institute's network or in the cloud) and back-up of the files.

Optical character recognition (OCR)

If you want to use the computer to analyse textual sources, the digital images of those sources must be converted to computer readable text. For certain types of documents this can be realised by optical character recognition (OCR). You can use specialised software for this, but often the software that comes with the scanner also has OCR functionalities. The quality of the OCR will depend on the resolution of the scanned document (this should be at least 300 dpi (dots per inch) - the higher the resolution, the better the OCR results will usually be), the format and the quality of the print materials. The language of the original document and the diacritical marks that are used are also affect the quality.

The result of OCR will never be 100% accurate, although modern techniques can result in a very high success rate for recent, high quality print. OCR results for print of less quality may contain quite a number of errors. Usually, an assessment has to be made if the material must be corrected. This process can be automated to a certain extent by using global search and replace macros to correct errors that occur regularly and by using a spelling checker (for recent material). However, manual correction will almost always be required as well. The correction process will usually not be possible for certain types of computational analyses of large numbers of texts, whereas it may be essential for certain types of qualitative and interpretative analysis. However, correction is a very time consuming process, so you should always assess if you will use rough OCR with errors or if you will add a correction stage. Always inspect the results of the OCR process on a number of samples from your data, to establish the quality of the OCR.

For printed historic documents, OCR often does not produce satisfactory results, although progress is certainly being made in this area in the last decade. OCR errors may occur because of damaged material, irregular lay-out, and the use of historic fonts, but also because historical language usually contains many spelling and orthographical variants. As with scanner software, it may be advantageous to experiment with the settings in order to obtain maximum quality. When dealing with older documents, despeckling may improve OCR results, for example. Some software packages also allow you to 'train' the program on a sample of the text, so that it can learn the typeface that is used.
For handwritten documents (like historical manuscripts, letters and children's writing), OCR usually is very problematic, if possible at all.

If a digital text cannot be converted adequately to computer readable text by means of OCR, it must be transcribed by hand (see Transcription of text).

Documentation

During the digitisation process, two aspects of data management are especially important: file management (see above) and documentation of the digitised objects. The documentation, usually stored in the form of so-called metadata, may describe both the provenance and the type of the object. When you digitise a large collection of objects, these metadata have the added value of enabling you to search for objects that have specific characteristics. It is advised to incorporate documentation as a part of the digitisation process, because that is when most of the relevant information is closest at hand. Metadata can be stored in a table, but with textual data, they can also be incorporated in the file itself. Usually this is done in a so-called file header.

Logo Dublin Core A general standard for describing digital sources is the Dublin Core Metadata Element Set (usually referred to as Dublin Core). In its simplest form, Dublin Core is composed of 15 fields, holding information about the following characteristics of an object: title, creator, subject, description, publisher, contributor(s), date, file type, file identifier, source, language, relation, coverage and rights. The general character of this metadata set makes it also very suitable to serve as an exchange format for data collections.