Digital Humanities Workbench


Home page > Digital data > Digital text > File formats

File formats

The term file format (also known as file type) indicates the manner in which the information in a computer file is stored. Format plays an important role in what you can do with a file. There are two main types of file formats: text and binary. Files in text formats only contain readable characters and can be read by many computer programs. One of their main characteristics is that they can also be read and edited with text editors such as Notepad or NoteTab, which is not the case with binary files. A file in text format can contain all kinds of structural or content-based annotations, but these tags or codes will also consist exclusively of letters, numbers, and punctuation.

Binary files contain computer code that can only be interpreted by specific software. Many binary file formats are so-called 'closed' formats (proprietary formats), which are protected by a patent or copyright. The company that has developed them typically does not make the specifications of the format public, so that the files can only be handled with the company's own software.

The extension of a file name usually indicates its format. A file extension is an addition to the end of a file name; it consists of one or more letters (usually three or four) and is separated from the file name with a full stop. There are a lot of file formats and therefore there also are a great many file extensions, as you can see from the overview on Wikipedia. This is an overview of the main file formats used for digital text files.
Note: Many different formats are also used for e-books. Wikipedia offers a comprehensive overview.

A. Text Files

Extension Description
.txt File that contains only letters, numbers, punctuation marks, spaces, tabs and line-breaks. These files do not contain any textual formatting and can be read and edited by virtually all programs on all platforms. Text analysis software is good at handling .txt files.
.htm
.html
File encoded with HTML (hypertext markup language). HTML is used for the presentation of web pages (in web browsers). Seeing as they are text files, HTML files can be opened in most text analysis programs. Because these files typically contain a large number of HTML tags, however, proper analysis is often hard to do, which means it is often sensible (or even necessary) to remove the HTML tags from these files first.
.xml File annotated with XML (extensible markup language). XML has many applications, one of which is the annotation of texts to open up content to scholarly analysis. XML annotation can be quite complex, which means that special software might sometimes be required in order to process or analyse these files. For more information, see the pages about XML and annotation in this Workbench.
.sgm
.sgml
File annotated with SGML (standard generalized markup language). This is a forerunner of XML that shares the same functions. You will still find this format in some text archives, because not every file annotated with SGML has been converted to XML.

B. Binary files

Extension Description
.jpg/ .jpeg
.gif
.tif/ .tiff
.png
.bmp
These are a number of common file formats for storing images in digital form (there are more). For more information, see the page about digital images in this Workbench. Note that digital images (copies) of texts cannot be searched on word level. Images of more recent texts can be converted to text files using optical character recognition (OCR) reasonably well, after which their contents can be analysed.
.doc
.docx
The format of Microsoft Word documents. Not all text analysis software can read this format: in that case the documents must first be saved as text files using Microsoft Word. The file format with the .docx extension was introduced in Word 2007.
.rtf Rich text format. Document format developed by Microsoft in 1987 for exchanging documents on different computer systems. Most word processors can read RTF documents. Not all text analysis programs, however, can handle this format, in which case these documents must first be saved as text files.
.pdf Portable document format. Widespread file format that was developed by Adobe to ensure that formatted files could be displayed and printed identically on all computer systems. The program Acrobat Reader (or a clone) is required to read, search and print these files. Not all text analysis programs can handle pdf files. Whether, and to what extent, these files can be converted to text files depends on how they were created; the (expensive) program Acrobat Professional is much better for this than the (free) Acrobat Reader.
Note: the pdf format is also widely used for e-books
.xps
.oxps
XML paper specification (Open) XPS is a printing and document format developed by Microsoft as an alternative to (and competitor of) pdf.


One technical aspect of file formats relates to character sets, specifically how characters are encoded by the computer. The ASCII or ANSI character sets are traditionally used to encode the western alphabet. Nowadays, Unicode is the prevalent character set, as it can be used to encode every script in the world. UTF-8 is a variant of Unicode and is the dominant character encoding for the World Wide Web.
You may be confronted with the existence of different character sets when you open a text in an analysis program and you find that the text appears to contain a lot of 'strange characters'. In that case you will have to change the settings of the program (if the program supports different character sets). For more information, see the entry for character encoding on Wikipedia.

Other topics in this section: Introduction   Types   Annotation