Please write to me or post on the message board before you start a project with the materials here. Somebody else might already be doing the same project. It's a big waste of effort if two groups create essentially identical online copies of the same text.

Things you can do to help:

The biggest project right now is the correction of Cleasby/Vigfusson and Bosworth/Toller; please click on the "Volunteers" tab to find out how to help with this. However, if you want, there are others ways you can help as well.

1. Scan more books.

If you would like to pick a book on the older Germanic languages and scan it, I will be glad to host the page images. Contact me to discuss how we can transfer the images.

Be sure that the book you choose is out of copyright. In the United States, all books published before 1923 are out of copyright.

Please read the first section of the page "How to digitize a text" for recommendations on scanner settings, filenames, optional image post-processing, etc. Please at least make the filenames match the page numbers before you hand them off to me.

2. If you have OCR software, perform OCR on any of the documents here.

It's OK if you just crank the images thru and send me the text without doing correction. I can post your OCRed text, and somebody else will probably eventually pick up the correction.

If the text contains special characters which you don't have convenient support for, it's OK to have your OCR software substitute some other character which never occurs in the text (perhaps @ or % or £, for example). The important thing is never to lose the distinction between characters.

It's helpful if your OCR software can preserve type attributes such as italics or boldface. A programmer can use this information to help recover the data type of each field (headword, definition, etc.). One good choice would be to have your OCR software output HTML files, since HTML can encode text attributes such as italics.

1. Make page content lists.

For any text here (dictionary or textbook), you could make a list of the header information for each page, like this:


Until the TIFF/PNG page images are eventually converted to text, a lot of people use the materials on this site by simply viewing the page images. Having an index list makes it much easier to find things.

If you want to do this, just send me a list like the example above (page number followed by page information). No need to make an HTML page; I have a script which will automatically produce the web page using the list.

3. Hand-correct the text.

This is the most important task, and also the most time-consuming and labor-intensive. However, it is not difficult: you simply compare the online text against the original page image and make corrections.

One way to do this is to print out the text and images, and compare them side by side, keeping a finger on both to keep your place. You can use a red pen to mark errors, and then enter the corrections at your computer. Of course, this is just a suggestion based on my own experience; you can do it however you like.

It's perfectly OK to correct just part of a text. Starting a whole book doesn't obligate you to finish it, and even a few pages gets us that much closer to completion.

A lot of times I get questions on how the text should be encoded, etc. There really isn't any one right answer here. The important thing is that none of the contrasts between characters be lost. See below for further comments on this issue and on strategies you might choose.

4. Separate text into fields.

The task here is mark up the text to indicate fields such as headwords, definitions, etymologies, etc. This task takes a little bit more technical understanding than the others; see below for a discussion of why this task is necessary and what it involves.

5. Other stuff

Occasionally, folks have scanned books and sent me the image files to post. If you want to do this, then PLEASE make the filenames match the page numbers. This will save me a ton of work. The numbering system I use is to start with a0001, a0002, etc. for Roman-numeralled or other introductory pages, and b0001, b0002, etc. for the numbered pages of the main text (putting zeroes at the front helps the files to sort in the desired order in a directory). Also, if you've scanned both the left and right pages to a single image, it would really be a nice thing if you use a graphics program to separate the two. Some programs, such as Photoshop will allow you to define a macro to process all the files automatically. (The really important thing is the filenames, however.)

Sometimes folks have sent me files which are a trivial conversion of one format to another, such as taking a plain text file and converting it to HTML by simply adding <P> tags to separate the lines. This is generally not very useful unless some new information is added in the process, such as separating fields within the data as described below.