Germanic Lexicon Project
Character encoding
Previous Up Next
Downloadables
- Character database: A machine-readable file. All of the project-internal subsystems use this file. It includes the Unicode conversion for all of the characters (except for the small number of characters which aren't in Unicode, of which there are a few). You could use this file as a lookup table if you want to write a little script to convert the project documents to Unicode.
- Typesetting test document (pdf): A test page which shows the results of typesetting each of the project-internal entities. This is mainly used to test the typesetting system, but you might find it useful as a reference.
Following are the project-internal character encoding standards for the text documents in the Germanic Lexicon Project.
The encoding scheme for the base documents has been informed by two considerations:
- The character set must be one which is currently widely supported by text editors, text display systems, and other tools. (Thus, Unicode is unfortunately not a practical choice at present, altho I hope that it becomes one within a few years. However, there is a conversion table below which allows these documents to easily be converted to Unicode.)
- The raw text must be as legible and comprehensible to a human as possible, since humans need to edit and correct it. (Thus, an entity such as á is preferable to á or to &00E1;. However, a directly-encoded character such as á is preferable to an entity such as á.)
With these considerations in mind, the following scheme has been adopted:
- Base documents use the ISO-8859-1 (Latin 1) character encoding. If a character exists within this set, we directly encode that character as itself in the base document.
- If a character is not included in the ISO-8859-1 set, we represent the character with an XML-style entity such as œ. If there is a standard entity name for the character, we use that entity name. Otherwise, we create a novel entity name according to a specific set of rules.
Rules for novel entity names
If the character is atomic (i.e., cannot be decomposed into a base character plus diacritics), then we simply pick a suitable name which does not conflict with any existing standard character name, such as &hw; for the Gothic character. Hyphens may occur in an atomic entity name, as in the case of &s-tall;, &r-runic; and &dash-uncertain.
However, if the character is not atomic, then the entity name consists of the base character (or entity name for the base character in the case of non-ASCII base characters, such as æ for æ) followed by a list of diacritic names, separated by hyphens.
The order for the list of diacritics is as follows:
- First are diacritics which cross any portion of the base character, such as the slash in the ø character.
- Next are diacritics above the base character (such as the acute accent on á). These diacritics are ordered from innermost (lowest) to outermost (highest). If there is more than one diacritic at a horizontal level (as can happen in Greek, for example), the diacritics within that level are listed left-to-right.
- Finally come diacritics below the base character, ordered from innermost (highest) to outermost (lowest).
Thus, the imaginary character would have this encoding: &o-slash-long-tilde-acute-hook;.
The rules for the ordering of diacritics are essentially the same as the Unicode rules. The conversion of these entities to Unicode is transparent.
The list of valid diacritic names is as follows:
Name Unicode combining diacritic code acute U0301 bar U0336 cedil U0327 circ U0302 dasia U0314 (Greek only) diar U0308 (Greek only) hachek U030C hook U0328 long U0304 nonsyllabic U032F ocomma U0315 odot U0307 ohook U0313 oring U030A oxia U0301 (Greek only) peri U0342 (Greek only) psili U0313 (Greek only) short U0306 slash U0338 tilde U0303 udot U0323 uml U0308 uring U0325 varia U0300 (Greek only) ypo U037A (Greek only) Most HTML/XML standard entities do not separate diacritics with hyphens (á, not &a-acute;. By contrast, the novel entities coined according to the scheme here do use hyphen separators (e.g. &a-circ-acute, not âacute). Legibility, particularly in the case of characters with multiple diacritics, is the reason for this variance.
Note that these standards are only followed in the base documents. In derived HTML documents, all pretense of consistency and elegance is dropped; the entities are simply mapped to whatever representations will allow the text to display reasonably correctly on a broad base of web browsers. Many of the less-supported characters are displayed in the derived HTML files by means of embedded image files.
Database of accepted entities
Following is the database of characters outside the ASCII range (whether encoded as ISO-8859-1 characters or as entities) which we recognize as valid within the base documents. Other entities are to be kicked out as errors, even if the entity is a standard one (for example, if the entity ¥ for the ¥ character occurs in one of our base documents, it is almost certainly an error, because it is very unlikely that the yen sign ¥ is actually to be found in a text on the early Germanic languages.) Additional characters will of course be added to the following list when the legitimate need to encode a previously unencountered character arises.
Character Entity name Unicode conversion Is entity name standard? " " U0022 Y & & U0026 Y < < U003C Y > > U003E Y ˜ ˜ U007E N § § U00A7 Y « « U00AB Y ¶ ¶ U00B6 Y » » U00BB Y À À U00C0 Y Á Á U00C1 Y   U00C2 Y à à U00C3 Y Ä Ä U00C4 Y Å Å U00C5 Y Æ Æ U00C6 Y Ç Ç U00C7 Y È È U00C8 Y É É U00C9 Y Ê Ê U00CA Y Ë Ë U00CB Y Ì Ì U00CC Y Í Í U00CD Y Î Î U00CE Y Ï Ï U00CF Y Ð Ð U00D0 Y Ñ Ñ U00D1 Y Ò Ò U00D2 Y Ó Ó U00D3 Y Ô Ô U00D4 Y Õ Õ U00D5 Y Ö Ö U00D6 Y Ø Ø U00D8 Y Ù Ù U00D9 Y Ú Ú U00DA Y Û Û U00DB Y Ü Ü U00DC Y Ý Ý U00DD Y Þ Þ U00DE Y ß ß U00DF Y à à U00E0 Y á á U00E1 Y â â U00E2 Y ã ã U00E3 Y ä ä U00E4 Y å å U00E5 Y æ æ U00E6 Y ç ç U00E7 Y è è U00E8 Y é é U00E9 Y ê ê U00EA Y ë ë U00EB Y ì ì U00EC Y í í U00ED Y î î U00EE Y ï ï U00EF Y ð U00F0 Y ñ ñ U00F1 Y ò ò U00F2 Y ó ó U00F3 Y ô ô U00F4 Y õ õ U00F5 Y ö ö U00F6 Y ø ø U00F8 Y ù ù U00F9 Y ú ú U00FA Y û û U00FB Y ü ü U00FC Y ý ý U00FD Y þ U00FE Y ÿ ÿ U00FF Y &dash-uncertain; - N &e-sub; - N &u-super; - N &aolig; - N &aolig-acute; - N þ-bar; - N &a-acute-hook; U0061+U0301+U0328 N &a-long-acute; U0061+U0304+U0301 N &a-long-short; U0061+U0304+U0306 N &a-odot-acute; U0061+U0307+U0301 N &a-uml-circ; U0061+U0308+U0302 N &a-ohook; U0061+U0313 N &c-hachek-udot; U0063+U030C+U0323 N &c-tilde; U0063+U0303 N &c-udot; U0063+U0323 N &e-acute-hook; U0065+U0301+U0328 N &e-circ-acute; U0065+U0302+U0301 N &e-tilde-hook; U0065+U0303+U0328 N &e-long-short; U0065+U0304+U0306 N &e-long-hook; U0065+U0304+U0328 N &e-odot-acute; U0065+U0307+U0301 N &e-odot-tilde; U0065+U0307+U0303 N &e-uml-acute; U0065+U0308+U0301 N &e-uml-tilde; U0065+U0308+U0303 N &e-ohook; U0065+U0313 N &g-ocomma; U0067+U0315 N &i-circ-acute; U0069+U0302+U0301 N &i-tilde-hook; U0069+U0303+U0328 N &i-long-acute; U0069+U0304+U0301 N &i-long-short; U0069+U0304+U0306 N &i-oring; U0069+U030A N &i-oring-acute; U0069+U030A+U0301 N &i-oring-tilde; U0069+U030A+U0303 N &i-nonsyllabic; U0069+U032F N &k-circ; U006B+U0302 N &k-ocomma; U006B+U0315 N &l-tilde; U006C+U0303 N &l-ocomma; U006C+U0315 N &l-uring; U006C+U0325 N &m-tilde; U006D+U0303 N &m-uring; U006D+U0325 N &n-ocomma; U006E+U0315 N &n-uring; U006E+U0325 N &o-acute-hook; U006F+U0301+U0328 N &o-circ-acute; U006F+U0302+U0301 N &o-circ-hook; U006F+U0302+U0328 N &o-long-acute; U006F+U0304+U0301 N &o-long-short; U006F+U0304+U0306 N &o-uml-circ; U006F+U0308+U0302 N &q-bar; U0071+U0336 N &r-acute-udot; U0072+U0301+U0323 N &r-tilde; U0072+U0303 N &r-long; U0072+U0304 N &r-uring; U0072+U0325 N &s-ocomma; U0073+U0315 N &t-ocomma; U0074+U0315 N &u-circ-acute; U0075+U0302+U0301 N &u-long-acute; U0075+U0304+U0301 N &u-long-short; U0075+U0304+U0306 N &u-odot; U0075+U0307 N &u-uml-circ; U0075+U0308+U0302 N &u-oring-acute; U0075+U030A+U0301 N &u-oring-tilde; U0075+U030A+U0303 N &u-nonsyllabic; U0075+U032F N &v-long; U0076+U0304 N &y-short; U0079+U0306 N &z-odot; U007A+U0307 N &O-slash-long; U00D8+U0304 N &a-tilde; U00E3 N æ-circ; U00E6+U0302 N &o-slash-long; U00F8+U0304 N &A-long; U0100 N &a-long; U0101 N &A-short; U0102 N &a-short; U0103 N &a-hook; U0105 N &c-acute; U0107 N &c-hachek; U010D N &d-bar; U0111 N &E-long; U0112 N &e-long; U0113 N &e-short; U0115 N &e-odot; U0117 N &e-hook; U0119 N &e-hachek; U011B N &g-circ; U011D N &i-tilde; U0129 N &I-long; U012A N &i-long; U012B N &i-short; U012D N &i-hook; U012F N &k-cedil; U0137 N &l-bar; U0142 N &n-acute; U0144 N &O-long; U014C N &o-long; U014D N &o-short; U014F N Œ U0152 Y œ U0153 Y &oelig-acute; U0153+U0301 N &r-hachek; U0159 N &s-acute; U015B N &s-hachek; U0161 N &u-tilde; U0169 N &U-long; U016A N &u-long; U016B N &u-short; U016D N &u-oring; U016F N &w-circ; U0175 N &y-circ; U0177 N &z-hachek; U017E N &s-tall; U017F N &b-bar; U0180 N &hw; U0195 N &wynn; U01BF N &a-hachek; U01CE N &u-hachek; U01D4 N æ-long; U01E3 N &O-hook; U01EA N &o-hook; U01EB N &j-hachek; U01F0 N &g-acute; U01F5 N Æ-acute; U01FC N æ-acute; U01FD N &YOGH; U021C N &yogh; U021D N &z-tail; U0225 N &a-odot; U0227 N &Y-long; U0232 N &y-long; U0233 N &schwa; U0259 N &r-runic; U0280 N Α U0391 Y Β U0392 Y Γ U0393 Y Δ U0394 Y Ε U0395 Y Ζ U0396 Y Η U0397 Y Θ U0398 Y Ι U0399 Y Κ U039A Y Λ U039B Y Μ U039C Y Ν U039D Y Ξ U039E Y Ο U039F Y Π U03A0 Y Ρ U03A1 Y Σ U03A3 Y Τ U03A4 Y Υ U03A5 Y Φ U03A6 Y Χ U03A7 Y Ψ U03A8 Y Ω U03A9 Y α U03B1 Y β U03B2 Y γ U03B3 Y δ U03B4 Y ε U03B5 Y &epsilon-long; U03B5+U0304 N ζ U03B6 Y η U03B7 Y θ U03B8 Y ι U03B9 Y κ U03BA Y λ U03BB Y μ U03BC Y ν U03BD Y ξ U03BE Y ο U03BF Y π U03C0 Y ρ U03C1 Y ς U03C2 Y σ U03C3 Y τ U03C4 Y υ U03C5 Y φ U03C6 Y χ U03C7 Y ψ U03C8 Y ω U03C9 Y &iota-diar; U03CA N &left-half-ring; U0559 N &d-udot; U1E0D N &h-udot; U1E25 N &l-udot; U1E37 N &m-odot; U1E41 N &m-udot; U1E43 N &n-odot; U1E45 N &n-udot; U1E47 N &r-odot; U1E59 N &r-udot; U1E5B N &s-udot; U1E63 N &t-udot; U1E6D N &v-udot; U1E7F N &a-udot; U1EA1 N &a-circ-acute; U1EA5 N &e-udot; U1EB9 N &e-tilde; U1EBD N &y-tilde; U1EF9 N &alpha-psili; U1F00 N &alpha-dasia; U1F01 N &alpha-dasia-oxia; U1F04 N &alpha-psili-oxia; U1F04 N &Alpha-psili; U1F08 N &Alpha-dasia; U1F09 N &epsilon-psili; U1F10 N &epsilon-dasia; U1F11 N &epsilon-psili-oxia; U1F14 N &epsilon-dasia-oxia; U1F15 N &eta-psili; U1F20 N &eta-psili-oxia; U1F24 N &eta-dasia-oxia; U1F25 N &eta-psili-peri; U1F26 N &eta-dasia-peri; U1F27 N &iota-psili; U1F30 N &iota-dasia; U1F31 N &iota-psili-oxia; U1F34 N &iota-dasia-oxia; U1F35 N &iota-psili-peri; U1F36 N &iota-dasia-peri; U1F37 N &omicron-psili; U1F40 N &omicron-psili-peri; U1F40+U0342 N &omicron-dasia; U1F41 N &omicron-psili-oxia; U1F44 N &omicron-dasia-oxia; U1F45 N &upsilon-psili; U1F50 N &upsilon-dasia; U1F51 N &upsilon-psili-oxia; U1F54 N &upsilon-dasia-oxia; U1F55 N &upsilon-psili-peri; U1F56 N &omega-psili; U1F60 N &omega-psili-ypo; U1F60+U0345 N &omega-psili-oxia; U1F64 N &omega-dasia-oxia; U1F65 N &omega-psili-peri; U1F66 N &omega-dasia-peri; U1F67 N &alpha-oxia; U1F71 N &epsilon-oxia; U1F73 N &eta-dasia; U1F74 N &eta-oxia; U1F75 N &iota-oxia; U1F77 N &omicron-varia; U1F78 N &omicron-oxia; U1F79 N &upsilon-oxia; U1F7B N &omega-oxia; U1F7D N &alpha-long; U1FB1 N &alpha-long-oxia; U1FB1+U0301 N &alpha-long-psili-oxia; U1FB1+U0313+U0301 N &alpha-peri; U1FB6 N &eta-ypo; U1FC3 N &eta-peri; U1FC6 N &iota-long; U1FD1 N &iota-long-oxia; U1FD1+U0301 N &iota-long-psili; U1FD1+U0313 N &iota-diar-oxia; U1FD3 N &iota-peri; U1FD6 N &iota-psili; U1FD6 N &upsilon-long; U1FE1 N &upsilon-long-oxia; U1FE1+U0301 N &upsilon-long-dasia; U1FE1+U0314 N &upsilon-diar-oxia; U1FE3 N &rho-dasia; U1FE5 N &upsilon-peri; U1FE6 N &omega-ypo; U1FF3 N &omega-peri; U1FF6 N &omega-peri-ypo; U1FF7 N – U2013 Y — U2014 Y &dash-acute; U2014+U0301 N &highquote; U2018 N &lowquote; U201A N • U2022 Y &sup4; U2074 N This page was last updated on 09 Jan 2006.