Germanic Lexicon Project
An introduction to character encoding issues
Previous Up Next

When you sit down to enter text in a pre-modern language, you usually face the problem that not all the characters you need are readily available to you. There is often no straightforward way to type or display these special characters.

For example, suppose you are entering a Gothic text. Gothic uses an unusual character, , which is not found in any widely supported standard character set such as ISO-8859-1. This character is included in the Unicode character set, but as of this writing, support for Unicode is still very spotty. Even programs which do allow you to work with Unicode text often don't have a way to enter or display rare characters such as .

So if you're typing in a Gothic text (a dictionary of Gothic with the glosses in English, for example), what do you do when you encounter this character? Following are various strategies you might use, together with the problem with each strategy. The worst strategies are listed first.

Strategy #1: substitute other characters.

Since the Gothic character represents a "hw" sound, why not just type "hw" in its place? After all, anyone who knows anything about Gothic will understand that this is intended to represent .

Here's why this is a problem. Consider this glossary entry in Wright's textbook of Gothic:

assei, wf. sharpness, severity, 138, 344. Cp. OE. hwæss, OHG. hwas, sharp.

When we substitute "hw" for "", the entry turns out as follows:

hwassei, wf. sharpness, severity, 138, 344. Cp. OE. hwæss, OHG. hwas, sharp.

Now suppose that someone else has a way of displaying the character properly, and wants to restore it to the text. No problem, you think: just do a global search-and-replace and change the characters "hw" to :

assei, wf. sharpness, severity, 138, 344. Cp. OE. æss, OHG. as, sharp.

Notice the problem! The character does not belong in Old English or Old High German. The only way somebody can correctly restore the character is by going thru and deciding in every single case whether "hw" should be changed to "". This is a very labor-intensive process, and one which tends to result in errors.

This problem might not seem like a big one now, but you can save big problems down the road if you plan ahead and think about how you or others might want to use the data later. The most important thing is never to lose the contrast between characters. Any other considerations are secondary to this.

One stopgap solution is to use some other character which does not conflict with any character already in the text. For example, the yen (¥) character is very unlikely to be found in a glossary of Gothic, so you might use ¥ in place of , entering assei as ¥assei. This has the advantage of not losing the contrast between characters. The ¥ character might not look anything like the character, but there is likely to be little confusion in the case where there are relatively few special characters. Someone else can always substitute whatever other encoding they prefer by doing a global search-and-replace, since ¥ always stands only for . If you do follow this strategy, I strongly recommend adding a chart at the beginning of the document so that someone else can easily tell what the substituted characters stand for.

Strategy #2: Use a special font which includes the characters you need.

You might be able to find or create a font which includes special characters such as . For example, you might create a special font which uses character number 231 for the character. Why not just use this font to enter your text?

You can do this, and the character might well display properly on your own computer. Notice, however, that you are in effect defining a new character encoding, one which is not standard and which depends on your specific custom font for proper display. Even if you make your font available for others to install, the font format you choose may not be supported on all platforms; and further, fonts with idiosyncratic encodings tend not to remain well supported over many years.

More commonly, what will happen is that the characters simply end up jumbled on other computers. To use the example given above, the character code 231 represents the c-cedilla (ç) character in the widely used ISO-8859-1 character set, and this is how the character will probably be displayed for most users.

Still, this is not an entirely horrible situation, because you have at least encoded with its own unique character code. Somebody else can easily search-and-replace that code with something else if desired. For this reason, it is very helpful if you add a chart to the beginning of your document giving an example of each special character, followed by a description of that character so that someone else need not have to spend a lot of time figuring it out.

However, there is an even worse form of this problem. Suppose your special font substitutes "" in place of the ordinary "h" character (character code 104). Thus, if you type an "h" while using this font, the "" character appears. No problem, you think: use both the special font and an ordinary font in your document, and mark the instances of "" as being in the special font.

This might make things display correctly on your own computer, but it is a very fragile form of encoding which should be strongly avoided. If you export the data from your word processor to another format such as plain text, then all of the "" characters end up as "h", losing the distinction between the two. No word processor format is universally accepted across platforms, and word processor formats always go out of general use after some number of years.

Even if your text is all in one font (which means that you won't lose the contrast between any characters if your text attributes such as the font selection are lost), there is an additional problem. Some kinds of font have a limit of 255 characters per font. This is not enough for all texts; for example, the Fick/Falk/Torp dictionary of Proto-Germanic contains well over 300 different characters, which is above the 255 character limit.

It's better to not even think in terms of fonts. Fonts are platform-specific and tend not to be stable over time (especially if they are specially defined fonts for some unusual purpose). What font to use for a particular character is a display issue, not a question of how your underlying data should be structured for the long term.

Strategy #3: Use XML-style entities.

If you have ever worked with HTML, you have probably noticed that a special character such as á is encoded as an entity such as á. Every entity begins with an ampersand & and ends with a semicolon. The browser is a kind of display system which knows that this abstract entity should be drawn with the concrete character á.

In the widely used markup scheme known as XML, you can define your own entities. For example, you can create an entity &hw; as your way of encoding the character. (Many characters have standard entity names, such as á for á in both HTML and XML. I recommend that you use standard entity names wherever possible, creating a new one only when you can't find the one you need in any list of standard entities).

Notice that this way of encoding your special characters is not tied to any particular font or program.

This strategy has the advantage of being very sturdy, meaning that the encoding tends not to get jumbled up when you transmit the text from one platform to another. It also overcomes the 256 character limit for single-byte encoding, since you can define as many entities as you need.

A disadvantage to this strategy is that entities are harder for a human to read than a direct visual representation of the character itself. However, you need not think of this encoding as being intended for direct human use.

The model which has become widely adopted is that of an abstract base document, which fully encodes all of the distinctions needed in your text but which is not directly concerned with presentation details such as what font to use. From this base document, you use programs or scripts to derive various presentation forms. The commonly used format for base documents is XML; the presentation forms might include HTML, PDF, PostScript, RTF, etc. This kind of approach is said to be data-centric rather than application-centric; the questions as to how your data should be encoded are not guided by the internal needs of your data and not by the concerns of any single platform, program, or font.

This approach recognizes that the software environment changes over time. Your abstract base document is not concerned with concrete presentation details such as how to display your &hw; entity. These presentation details will change over time; but your abstract base document need not change. What changes is how you map your abstract base document to its concrete presentation forms. Today the preferred presentation forms are HTML, PDF, etc., but things might be very different 10-20 years down the road. Even with all of these changes in the software environment, your base document need not change; you simply write new scripts or use new software to convert your unchanging base document to whatever presentation forms are needed.

Strategy #4: Use Unicode.

Most older character sets make use single-byte encoding, meaning that they can include at most 256 (=28) distinct characters. Unicode, by contrast, is a multi-byte encoding (2 bytes per character in the most usual cases). Unicode includes many thousands of distinct characters; nearly any character you are likely to ever need to represent is probably already included in Unicode.

The current problem in using Unicode is not with the Unicode standard itself, but rather with the present software environment. Nearly all of the major players in the software industry recognize that the adoption of Unicode is a very desirable goal and are gradually working toward that end. The progress is happening slowly, however. At present, tools for editing and manipulating text often have support for Unicode which ranges from limited to none. This situation is improving over time, but it will probably be some number of years before Unicode truly becomes the lingua franca for character encoding which it is intended to be.


Contact