Digital Humanities Workbench


Home page > From source to data > Preparation

Preparing digital texts

Before you can use digital texts found on the internet for computer-assisted text analysis, you may have to prepare the texts first. N.B. On this page the focus is on technical preparation, not on adding annotation.

Fragmentation

Online texts are often fragmented, for example because each chapter is included in a separate file. For certain types of computer-assisted text analysis, it can be more convenient to merge these fragments of text into a single file.

Page layout

TXT files that have retained their layout during digitization sometimes contain elements that can reduce the performance of computer-assisted text analysis. Examples:
  • Hyphenated words at the end of a line, which are no longer counted as a single word due to the hyphen. Sweet-
    ness, for example,will be seen as two words: "sweet" and "ness".
  • Page numbers at the end of a page can interfere with the recognition of certain word combinations as a single phrase. The phrase "come and go" will not be recognized as such if it is broken up by a page number. Of course, page numbers can be important for bibliographies and indexes, so simply removing them is not always an option.
There are several ways to deal with this.

Removing tags

Texts that contain HTML tags or bits of JavaScript, for example, can make it difficult for special textual analysis programs to search the text. There are various (free) tools available to remove HTML tags from a text. For instance, you can use NoteTab (option Modify > Strip HTML Tags). As NoteTab is a text editor, you can also use the program to manually correct any other irregularities (such as in the page layout).

Formal annotation is now often done with XML tags An analysis program such as WordSmith and AntConc can deal with XML tags: you can use the tags or ignore them. However, this does not apply to all text analysis software. Removing XML tags, however, is not always easy, and often undesirable (they have been added for a reason).

File conversion

In some cases it is necessary to change the file format to a txt file, because your chosen program cannot handle the original format. This can be the case for doc, pdf and epub files.

Word Use the menu option Save as. Under "Save as:" (at the bottom of the screen), select "Plain text". After you have named the file (ending in .txt) and have clicked on [Save], you see a second dialogue box. Here you must make sure the option "Windows (Standard)" is checked. Usually nothing else needs to be checked. Word will automatically assign the .txt extension to the file. It is recommended that you do not change this.
PDF You need a program such as Acrobat Professional (or a clone) to convert pdf files to txt files. In some cases, the author of the file might have blocked format conversion, in which case converting the pdf file is impossible.
Epub Epub files can be converted using the freeware program Calibre.

Other topics in this section: Introduction   Digitisation   Transcription   Annotation   Data modelling   Data management