Digital Humanities Workbench


Home page > Digital data > Digital text > Types > E-text

E-text

In a general sense, an e-text (electronic text) can be described as any kind of textual information that is available in a digital format that can be read by people using a computer. E-texts are created by retyping or transcribing existing texts (manual labour) or by scanning an image and using optical character recognition (OCR) to convert it to processable text. OCR works best with high-quality printing (usually that means recently printed matter). A correction phase will always be necessary in order to get an error-free text. E-texts that have not gone through a correction phase or have gone through a (semi-)automatic correction phase will usually not be error-free. When the original copy of a digital text is not available, quality control can be difficult.

E-texts allow researchers to consult publications that are less accessible in printed form. Another major advantage compared to printed texts is the ability to search and analyse the texts using textual analysis programs.

The term ' e-text ' is used for different types of text files, which each have different features. Based on these features, we can distinguish three subtypes: plain e-text, annotated e-text and formatted e-text. One subtype is also more suitable for computer-assisted text analysis than the other, although this also depends on the specific file format of the text.

Plain e-text

The most basic form of an e-text is a text file that only contains a digital version of the original text, without any formatting, different fonts, links, images, etc. These files only contain letters, numbers, punctuation marks, spaces, tabs and line-breaks (so-called ASCII characters) and do not contain any form of (internal) coding for text formatting (like bold face, italics, different font types, etc.). The full text is usually incorporated in a single file, making a plain e-text a good basis for various forms of computer-assisted text analysis, because it can be searched and processed in many ways. Plain e-texts are usually less suitable for reading the text from the screen, and seeing as a book, for example, is more than just text, these e-texts often fail to convey the full reading experience (especially in terms of presentation).

Examples:

Annotated e-text

An annotated e-text is a plain e-text enhanced with a markup language for the purpose of research. This can include annotations containing information about the source (so-called metadata), the structure or content of the text. In most cases, these files still only contain letters, numbers and punctuation (so-called ASCII characters). Such e-texts are especially suited for computer-assisted text analysis (although this obviously depends on the type of annotation used). The disadvantages of plain e-texts mentioned above, however, are even more pronounced with these texts.

Examples:

  • William Shakespeare, Romeo and Juliet (with COCOA markup references) and the related information file. The play with structure tags (source: Oxford Text Archive, file 0128).
  • Shakespeare's Sonnets 1609. Sonnets with structure tags. (source: University of Toronto Library).
  • Of the Lawes or Irelande by Sir John Davies (1609). Fragment of a historical text with XML tags for text portions in another language (such as Latin), notes, corrections and titles (of books, for instance). For comparison, see the HTML version.

Although a growing number of e-texts contain source and structural annotations, there are few e-texts available that have been annotated in terms of contents. Researchers typically only add this type of annotation to their own research copies of the e-text(s). For more information, see the page about annotation in text files.

Formatted e-text

More and more e-texts are published with some form of formatting. Although such files also occur in DOC(X) or PDF format, the most frequent format is HTML, which can be read with a web browser. Although the formatting usually makes it easier to read the text from the screen, these texts are typically still not perfect copies of the original text. In the case of older prints, a (digital) facsimile is required in order to guarantee a perfect copy. Moreover, text archives often present texts in a uniform manner, which is another indication that the lay-out will not be identical to that of the original publication.
Such e-texts usually need to be pre-processed before they can be used for computer-assisted text analysis. It is often necessary to remove HTML code and any scripting code from the text. In addition, such texts are often split into a number of subfiles (for example, one subfile for each chapter of a book), making them less searchable, unless the website hosting the files provides an umbrella search feature. It may also be a laborious task to download texts that are split up into a large number of subfiles.

Examples: