Digital Humanities Workbench


Home page > From source to data > Transcription > Text

Transcription of text

If you want to use the computer to analyse textual sources, the digital images of those sources must be converted to computer readable text. For printed documents that are relatively recent, this can often be realised by optical character recognition (OCR). For printed historic documents, however, OCR often does not produce satisfactory results, although progress is certainly being made in this area in the last decade (see the section about digitisation for more information about OCR). For handwritten documents (like historical manuscripts, letters and children's writing), OCR usually is very problematic, if possible at all.

Jane Skipwith's Love Letter  [Folgerpedia] Therefore, many older printed documents and most handwritten manuscripts have to be transcribed manually. Usually, this is done in the form of a so-called diplomatic transcription, which follows the original document as closely as possible. In a normalized (also called regularized) transcription, the original text is cleaned up and more easily readable, e.g. using modern orthography. Because a normalized transcription can be made on the basis of a diplomatic transcription, but not vice versa, diplomatic transcriptions are often preferred. This implies that decisions have to be made about how to deal with certain aspects of the original text: page layout (including line length); typeface (capitalization, use of bold and italics, underline, strikeout, accent markers); punctuation (or lack of it); illegible text; older spelling and misspelling; archaic abbreviations; handwritten notes in printed text; and images and drawings in the text. It is important that all transcription decisions you make in this respect are well documented.

Collaboration and crowd sourcing

As with many modern applications, transcription can be done online, which enables groups of students and/or scholars to work together on the transcription of a single (larger) document or a collection of documents. For a growing number of larger transcription projects (usually conducted by academic departments, libraries or digital archives), this is not restricted to the research group, but all interested individuals are asked to participate. Examples of such crowd sourcing transcription projects are Transcribe Bentham (a double award-winning collaborative transcription initiative, which is digitising and making available digital images of this unpublished manuscripts of this philosopher and reformer through a platform known as the Transcription Desk), Making History - Transcribe (Virginia Memory) and Smithsonian Digital Volunteers, but nowadays there are many more projects of this kind. Usually this transcription method implies a workflow in which all participants may be involved in transcription and the reviewing of the work of others, followed by a final check and approval by the project team.

Tools

You can make a transcription of a document by opening two windows: one in which the digital image is displayed and one in which you transcribe it with an editor (as txt, HTML, XML or rtf / docx). However, a number of dedicated tools is available to support the transcription process.

Transcript
Transcript is a desktop-based manuscript transcription tool, in which the viewer and editor are integrated in one program. From within the editor you can move the visible part of the image in many ways using shortcuts. Free for personal use (paid version has more functions), Windows-only.

Logo FromThePage FromThePage
FromThePage is an open-source tool that allows volunteers to collaborate to transcribe handwritten documents. It can also be used by individuals to transcribe documents online.

Transkribus
Transkribus supports scholars who are engaged in the transcription of printed or handwritten documents. It offers a number of tools for the automated processing of documents, such as OCR, handwritten text recognition, layout analysis, document understanding and writer identification. All Transkribus services are available via a webinterface and are provided for free.

Logo eLaborate eLaborate
eLaborate is an online work environment in which scholars can upload scans, transcribe and annotate text, and publish the results as on online text edition which is freely available to all users. It is possible to use it for transcription only, either by individuals or as a collaborative enterprise.

Further reading

Other topics in this section: Speech