Digital Humanities Workbench |
Home page > From source to data > Preparation Preparing digital textsBefore you can use digital texts found on the internet for computer-assisted text analysis, you may have to prepare the texts first. N.B. On this page the focus is on technical preparation, not on adding annotation.
FragmentationOnline texts are often fragmented, for example because each chapter is included in a separate file. For certain types of computer-assisted text analysis, it can be more convenient to merge these fragments of text into a single file.
Page layoutTXT files that have retained their layout during digitization sometimes contain elements that can reduce the performance of computer-assisted text analysis. Examples:
Removing tagsTexts that contain HTML tags or bits of JavaScript, for example, can make it difficult for special textual analysis programs to search the text. There are various (free) tools available to remove HTML tags from a text. For instance, you can use NoteTab (option Modify > Strip HTML Tags). As NoteTab is a text editor, you can also use the program to manually correct any other irregularities (such as in the page layout).Formal annotation is now often done with XML tags An analysis program such as WordSmith and AntConc can deal with XML tags: you can use the tags or ignore them. However, this does not apply to all text analysis software. Removing XML tags, however, is not always easy, and often undesirable (they have been added for a reason).
File conversionIn some cases it is necessary to change the file format to a txt file, because your chosen program cannot handle the original format. This can be the case for doc, pdf and epub files.
|
Other topics in this section: Introduction Digitisation Transcription Annotation Data modelling Data management |