![]() |
Digital Humanities Workbench |
Home page > From source to data > Digitisation DigitisationScaning and photography | Conversion | Post-processing | File management | OCR | Documentation | Further reading
If texts or other objects are not available in digital form (either as born-digital material, that was created digitally, or in an archive or elsewhere on the internet - see digital archives), digitisation is part of the research project for which the data are required. Digitisation implies that printed texts, manuscripts, photographs or audiovisual objects are converted into electronic images. The digitisation process usually consists of a number of stages. N.B. Strictly speaking, a digital copy is not always necessary as the basis for further data processing. You can, for example, directly transcribe an original source. However, having a digital copy may be very useful, for a number of reasons: (i) you can restrict your time in an archive to copying the sources, after which you can do your transcription work anywhere; (ii) during all stages of your research you can inspect the original source; (iii) transcriptions, but also structured data in a database, can be linked to digitial copies of the sources. Scanning and photographyThere are three main devices for making a digital image of physical objects.
ConversionIf objects have already been recorded, there are a number of ways in which analogue recordings can be converted to a digital copy.
Post-processing![]() File managementFirst, you have to decide whether you want to store scans of textual data in one file (for the whole document) or in separate files (one for eacht page). Scanned files are automatically named during the scanning process (resulting in names like IMG00023.TIFF). If your scans are stored in separate files, it might be useful to rename the files in two ways: (i) giving them a prefix (with fixed length) denoting the source and (ii) numbering them according to the page numbers in the original source. When renumbering, make use of leading zero's for ordering purposes (e.g. Defoe-RC007.JPG). You should also make a plan for the storage (in one folder or in different folders; on your pc, in a project folder on your institute's network or in the cloud) and back-up of the files.Optical character recognition (OCR)If you want to use the computer to analyse textual sources, the digital images of those sources must be converted to computer readable text. For certain types of documents this can be realised by optical character recognition (OCR). You can use specialised software for this, but often the software that comes with the scanner also has OCR functionalities. The quality of the OCR will depend on the resolution of the scanned document (this should be at least 300 dpi (dots per inch) - the higher the resolution, the better the OCR results will usually be), the format and the quality of the print materials. The language of the original document and the diacritical marks that are used are also affect the quality.The result of OCR will never be 100% accurate, although modern techniques can result in a very high success rate for recent, high quality print. OCR results for print of less quality may contain quite a number of errors. Usually, an assessment has to be made if the material must be corrected. This process can be automated to a certain extent by using global search and replace macros to correct errors that occur regularly and by using a spelling checker (for recent material). However, manual correction will almost always be required as well. The correction process will usually not be possible for certain types of computational analyses of large numbers of texts, whereas it may be essential for certain types of qualitative and interpretative analysis. However, correction is a very time consuming process, so you should always assess if you will use rough OCR with errors or if you will add a correction stage. Always inspect the results of the OCR process on a number of samples from your data, to establish the quality of the OCR.
For printed historic documents, OCR often does not produce satisfactory results, although progress is certainly being made in this area in the last decade. OCR errors may occur because of damaged material, irregular lay-out, and the use of historic fonts, but also because historical language usually contains many spelling and orthographical variants. As with scanner software, it may be advantageous to experiment with the settings in order to obtain maximum quality. When dealing with older documents, despeckling may improve OCR results, for example. Some software packages also allow you to 'train' the program on a sample of the text, so that it can learn the typeface that is used. If a digital text cannot be converted adequately to computer readable text by means of OCR, it must be transcribed by hand (see Transcription of text). DocumentationDuring the digitisation process, two aspects of data management are especially important: file management (see above) and documentation of the digitised objects. The documentation, usually stored in the form of so-called metadata, may describe both the provenance and the type of the object. When you digitise a large collection of objects, these metadata have the added value of enabling you to search for objects that have specific characteristics. It is advised to incorporate documentation as a part of the digitisation process, because that is when most of the relevant information is closest at hand. Metadata can be stored in a table, but with textual data, they can also be incorporated in the file itself. Usually this is done in a so-called file header.
Further readingCornell Digital Imaging TutorialThis tutorial offers base-level information on the use of digital imaging to convert and make accessible cultural heritage materials.
Digitization — Scanning, OCR, and Re-keying
OCR challenges in historic documents and the contribution of IMPACT (Candian Counciol of Archives)
Mining for the Meanings of a Murder: The Impact of OCR Quality on the Use of Digitized Historical Newspapers
Step-by-step Guides to Digitisation Projects (Candian Council of Archives)
Dublin Core Metadata (DC) in Digital Libraries |
Other topics in this section: Introduction Transcription Preparation Annotation Data modelling Data management |