Germanic Lexicon Project
File formats
Previous Up Next

Graphics formats

My original intent in posting page images was for volunteers to perform OCR on them, producing sharable text. Since TIFF is the most widely used format for scanning and OCR, all of the scanned images were originally posted only in TIFF format.

Later, when I looked at the web server logs, I found something I did not expect. A large number of users was downloading the images; the traffic was heavy and sustained over many months. Occasionally, a volunteer does create a text version of one of these scanned books, but the great majority of the downloads clearly have not been for this purpose.

Obviously, many users are willing to make do with raw page images if no online text is available. Since it will obviously be some time before all the scanned books on this site are converted to corrected online text, I decided that I might as well make the page images a bit more convenient for viewing. So, I made the images available in PNG format as well, since that format is more convenient for viewing in a web browser.

The GIF format is suitable on technical grounds and is widely supported. However, like many developers, I will not use the GIF format because of patent issues and because of unethical behavior by Unisys. ( This site explains the issue in more detail.)

JPEG is not an ideal file format for scanned text, because it is lossy; it is good for photographs with large continuuous gradations of color, but there are often swimmy areas at sharp edges between one color and another, making JPEG a less than ideal choice for black-and-white text.

Text formats

As Unicode comes to be better supported by software tools, I will eventually migrate all of the text data on this site to Unicode character encoding. I could do this now, but the software tools which are available to me at present are not convenient for managing Unicode-encoded data.

During correction, I usually work with the data as (nearly-)plain text files, usually in ISO-8859-1 character encoding (the distributed Bosworth/Toller and Cleasby/Vigfusson files are an exception, since portability is a major issue for a volunteer-based project). Characters not in that set are represented with XML-style entities. Bold and italic text are represented with HTML-style <b>...</b> and <i>...</i> tags. I also include elements to show where the page breaks are, since this information is useful for many processing uses during correction.

After the major correction is over, I mark up the files to produce validated XML files. Usually, there is relatively little markup at first (the <entry>...</entry> tags are easily generated, for example). Later in the project, there is a full semantic markup making the structure of the entries fully explicit. I have accomplished this before by writing custom parsers in Perl which take advantage of whatever idiosyncratic properties are found in each dictionary which make it possible to determine the structure of the entry.

I plan for the fully completed texts to conform with the TEI guidelines. None of the texts on this site are yet at the stage of full conformance with TEI.

The one area where I may need to extend the TEI guidelines is in the area of etymology. The TEI guidelines have the following to say on this type of information:

The element <etym> marks a block of etymological information. Etymologies may contain highly structured lists of words in an order indicating their descent from each other, but often also include related words and forms outside the direct line of descent, for comparison. Not infrequently, etymologies include commentary of various sorts, and can grow into short (or long!) essays with prose-like structure. This variation in structure makes it impracticable to define tags which capture the entire intellectual structure of the etymology or record the precise interrelation of all the words mentioned. It is, however, feasible to mark some of the more obvious phrase-level elements frequently found in etymologies, using tags defined in the core tag set or elsewhere in this chapter.

(Emphasis added)

Recording the precise formal interrelations among all of the words in an etymology is in fact exactly what I purpose to do, because my intent is to create applications which can automatically manipulate and evaluate etymological information. This is a difficult mark-up problem, but I do not think that it is an impossible one, and I have sketched what such a markup scheme would need to accomplish and what its general form would need to be. Since the TEI scheme makes no attempt precisely encode etymological relationships, some extension of the guidelines will almost surely be needed.