Unicode

THDL Toolbox > THL Technologies & Open Standards > Technologies For Dummies > Unicode

What is Unicode?

Simple Introduction to Unicode

Since THDL is an international digital library consisting of members and materials that use a variety of different scripts, including special "diacritic" marks in Roman scripts, the problem of fonts and their underlying encoding is a crucial issue. Computing and the Web has been bedeviled from the start by problems relating to multilingual computing, since there has been poor support for non-Roman scripts, or even special diacritic marks used with Roman script. As a result, much of the world and specialists have had to make due with poor solutions far removed from the easy use and convertibility we have come to expect from standard Roman script fonts. In THDL, we face this most directly with Tibetan script, Devanagari for Nepali and other Indic languages, and diacritic marks used in Roman scripts to represent these languages. Unicode represents an exciting initiative that is establishing world side standards for every script in the world, with the promise of standardized computing support for each standard. While not completely implemented yet, Unicode is already transforming multilingual computing and we at THDL have fully embraced it.

The prevalence of non-European languages in THDL requires that two overarching issues be addressed: the use of non-Roman fonts to represent each language's own scripts, and the representation of those languages with special diacritic marks in Roman script. A common problem on both fronts is the widespread use of specific fonts that use underlying encoding schemes unique to them. For this reason, none of the text entered using these fonts can be converted back and forth between other fonts, and browsers and programs often won't support input/display of these fonts. To understand this problem, it is important to make a distinction between three distinct elements involved in using scripts on a computer:

Glyphs/fonts: this refers to the actual visual form of the character that one sees on the screen.
Character encoding: this refers to the underlying value that is assigned each glyph, or piece of a glyph, for use by the computer. The ordinary user would never have any reason to see or consider the character encoding. However this is the primary feature understood by the computer. When fonts are perfectly convertible back and forth, such as with most Roman script fonts, it means they have the same underlying character encoding, which is then mapped on to their distinct glyphs. A single "font" or set of glyphs could be mapped to any variety of character encoding schemes - the relationship is entirely arbitrary.
Keyboard/input method: this refers to the actual keyboards which determine what glyph one will get when one taps a key of the keyboard. One can use a great variety of keyboards for a specific character encoding - the character encoding in no way determines what type of input method or keyboard one uses. Ideally, keyboards are based upon ease of use so that the most frequent characters are typed in using the most accessible keys; in practice, however, certain keyboards for contingent reasons become popular regardless of how optimal they may or may not be, and these become the standards. A case in point is the standard QWERTY keyboard for typing in English, named after the first five letters stamped upon keys in the upper left hand corner of the keyboard. In fact it is a disastrous keyboard in terms of ease of use and was originally designed to slow typers down since the first generation of typewriters couldn't handle fast typing. However it became habituated, and no one was every able to subsequently successfully implement a new keyboard that was used widely for English.

Thus the "glyphs" and "keyboards" one might use are irrelevant - from our perspective as publishers, we only care about the character encoding. The solution to the problem of idiosyncratic character encoding used for different scripts, and/or diacritic marks, is Unicode. Unicode represents a global initiative to build and implement character encoding standards for all of the world's scripts, so that all fonts for a given script and sets of diacritic marks in Roman script will be completely interoperable and usable within all digital environments. This means that if you use a "Unicode-mapped font", it will be able to take full advantage of all software and browsers supporting Unicode, that it will be convertible into any other font using Unicode, and finally that the documents in question should survive long into the future without needing to be converted manually into other fonts/character encoding.

While this is clearly the future of fonts, and THDL fully embraces Unicode for that reason, the present has only partially realized the full promise of Unicode. Problems are of three varieties:

Lack of Unicode standards, fonts or operating system support for some scripts
Lack of adequate Unicode fonts and/or keyboards for supported scripts
Lack of software support for Unicode-mapped fonts

The first problem is that for some scripts of the world, the Unicode character encoding standard may have to be created and ratified; the standard may be there, but fonts have yet to be created; or the necessary operating system support for that Unicode standard in Windows, or Mac OS, may yet to be implemented. In these cases, there simply may not be a viable solution for that script in Unicode. However, it is important to keep in mind that the situation is in tremendous flux, and what isn't ratified or implemented today is likely to change in the near future.

The second problem is that even if all the necessary standards and support are in place, no one may have yet made a font using the standards. More likely, there may be fonts, but the fonts are not of superior quality - they may have poor resolution on the screen, the italics may be missing, they may look strangely when small sizes are used, etc. Thus the big trade off is that one may be using the world's future in encoding schemes, but the actual appearance is significantly inferior to what one has come to expect from computer fonts. Likewise, a decent font may be available, but no one is supplying a ready made keyboard corresponding to what you generally expect to use, or even at all. However, here too the situation is in rapid flux, and there is every reason to expect increasingly high quality and diverse sets of Unicode-mapped fonts to appear every six months or so. The same applies to keyboards, while in addition creating new keyboards is not an overly difficult task.

The third problem is that a wide range of software continues to not be fully supportive of Unicode-mapped fonts. This is one of the biggest obstacles. However, here too, each year new versions of standard software packages are released with full support for Unicode, and hence support if expanding exponentially.

How is Unicode being used in THDL?

In THDL, we have four major foci in terms of fonts and character encoding:

Tibetan script fonts and input methods /site/26a34146-33a6-48ce-001e-f16ce7908a6a/tibetan input tools
Diacritic support for representing Asian languages and phonetics with Roman script.
Devanagari script for Nepali and Hindi
Chinese characters
Creating and browsing Unicode Web Pages

Please see the respective THDL sites hyperlinked above for additional information.

How Can I Explore Unicode Further?

For future resources on Unicode, please see the following sites:

THDL Toolbox > THL Technologies & Open Standards > Technologies For Dummies > Unicode

What is Unicode?

Simple Introduction to Unicode

How is Unicode being used in THDL?

How Can I Explore Unicode Further?

Provided for unrestricted use by the Tibetan and Himalayan Digital Library