Conversion-reversion For Tibetan Fonts

THL Toolbox > Tibetan Scripts, Fonts & Related Issues > Conversion-Reversion For Tibetan Fonts

Conversion-Reversion For Tibetan Fonts

Unicode is the global standard for encoding used for the world's scripts in computers. Thus only Unicode fonts should be used for working with Tibetan in a computer. However, much work was done in non-Unicode "legacy" fonts before Tibetan Unicode was established. In order to take advantage of this work, it is necessary to use "converters" which convert the legacy font encodings or romanized Tibetan text into Tibetan Unicode font encodings. There may remain a very few contexts where you want to "revert" Tibetan Unicode documents into legacy fonts, and hence you use "reverter" programs. Finally, there may be rare contexts in which you want to convert from one legacy font encoding (say, Sambhota), to another legacy font encoding (say, Tibet Machine). Unfortunately, there is no single universal program that does all of these conversions and reversions. Rather there is a confusing array of such programs, each of which does specific conversions and which often have flaws in their output.

The present page (see below) provides a short introduction to these issues, followed by a short survey of steps necessary in general to convert from legacy Tibetan fonts to Unicode, or from romanized Tibetan text to Unicode. We strongly recommend you first read this page fully. Then use these two pages to find the converter you should use:

Recommendations: this is organized from your starting point, namely what font or romanization your text to be converted is in.
All Converters and Reverters: this is a list of all converters and reverters.

Introduction To Converters-Reverters For Tibetan

Unicode is the global initiative aiming at establishing character encoding standards for all the scripts of the world. Once established for a specific script, Unicode promises that that all fonts made for that script following its standards will be perfectly interchangable, while new versions of software and operating systems will provide robust support for the use of those fonts. In other words, Unicode holds out the promise that all scripts can eventually enjoy the seamless support for use in a computing environment that roman script and a few other major scripts have enjoyed for years now. The situation with Unicode support for Tibetan is that it is slowly evolving – though now far from perfect – so that some individuals and organizations have already switched over to its exclusive use. We expect that by 2008, very little serious work will be done in Tibetan that is not done in Unicode.

However, up until 2005, very little of Tibetan text input was done using Unicode Tibetan fonts. Instead, they were done using Roman scripts in transliteration or with Tibetan fonts with private idiosyncratic encoding. This previous must be updated for permanent archiving by "converting" it into Unicode encoding. Unfortunately, the process of conversion is one with fraught with difficulties and potentially introduce errors into the process of mapping out the "legacy" format to Unicode. In addition, most converters do not convert special formatting that one may have utilized on an input Tibetan text, such as with "styles" in Microsoft Word. Such formatting can represent extensive work, all of which must be redone if the conversion to Unicode strips it out. It should also be noted that not all converters are equal, and that a poorly done converter will produce conversions ridden without error yet not inform the user.

However, once a text has been converted to Unicode, it will never need to be converted again. Thus the text will be able to take advantage of all the latest software and Web formats over the coming years. In addition, since it won't require conversion again, one can be confident that all proofing and error correction, as well as formatting/markup, will never be altered in the future.

A crucial part of the Tibetan Unicode solution is thus type of digital Rosetta Stone that allows for the automated conversion and reversion of legacy formats, fonts and transliteration schemes with proprietary encoding into Unicode encoding while preserving as much as possible the original value-added formatting. "Legacy" or "heritage" systems refer to all Tibetan font and software systems other than Unicode, which will ultimately render "legacy" systems of no value when Unicode is eventually established as the global standard for Tibetan computing. However, a huge amount of Tibetan material has already been entered in these systems, so its essential that a clear migration path be established for those materials to be migrated over into Unicode. One of the complexities in this transfer is that many of them have complex formatting, and if the formatting is not preserved as well, a large amount of work will be lost. “Reverters” go from Unicode to "legacy" fonts, and “converters” go from legacy fonts to Unicode.

"Reversion" refers to transformation of Unicode back into these heritage systems. It may seem superfluous, but especially for a culture such as Tibet with poor technical infrastructure and limited financial resources it in fact plays an important role. As the transition to Unicode takes place, for a period of up to a decade it may well be that many Tibetan users don't have access to computers with the necessary power, operating system and software allowing them to utilize Unicode. For those users, then, it will be important that Unicode Tibetan documents can be "reverted" into the heritage systems they do have access to. Another compelling reason for the importance of reverters is that we anticipate that some heritage software systems which provide specialized functionality – especially in publishing – will be utilized in the Tibetan world for some years to come. If reverters exist, text can be generated in Tibetan, but then reverted back to a legacy encoding for use in a specific context.

Conversion from one system to another and back (e.g legacy->unicode ->legacy) is generally known as "round-trip conversion". Ideally there should be no data loss when a full round trip conversion (A>B>A) is made. Problems can arise when there are characters supported by one system but not the other. (e.g Tony Duff's system supports a few characters not yet encoded in Unicode; and Unicode supports combinations of characters unsupported by other systems.)

Thus the ultimate need is a comprehensive suite of converters to convert Tibetan script text data stored in legacy encodings to Unicode, and also revert Unicode Tibetan back to a select range of those legacy encodings. In addition, it is crucial that both preserve text formatting from different file formats. However, the most pressing need is to offer a migration path whereby fonts of all types can be migrated to Unicode. Increasingly, there is absolutely no point to using non-Unicode fonts. However, there may be a software package that currently doesn't support Unicode, and in that context, some may desire to migrate materials out of Unicode into a non-Unicode system for use within those systems.

Overview Of How To Do A Conversion From Non-Unicode Tibetan Font

1. Assess system: If your text is in a Tibetan font, the first step is to determine what font has been used. It is especially important in either context to watch out for mixed fonts/systems in a single text. Such hybrids will play havoc with any converter. This can often not be clear, and you can waste a lot of time trying to figure out conversion problems which result from various Tibetan fonts being mixed up in a single document. To check this, do a visual scan to see if you can spot font differences; click at various places in the text and check to see what font it is. You could also however not worry about this initially, but if you have conversion problems check this as a potential source of problems.

2. Assess special formatting: The second step is to determine whether the text has been input using styles in a Word processor, whether it has footnotes, and whether it has parenthetical insertions indicating variant readings, translations, or notes of some type. This will affect the converter you choose, and may also require pre-conversion clean up or deletions. For example, if someone has applied styles in Microsoft Word that mark parts of the text as verse, as citations, or place names as place names, most converters will strip these out when they do conversions. If you have important formatting in the original text you want to maintain, pay special attention to which converters might maintain formatting in the process of conversion.

3 . Assess available converters: Check what converters are available for this font to Unicode. If there is not a direct converter available, then you should see what the most efficient set of conversions there is to proceed from the font in question to some intermediary and then to Unicode. Also keep in mind whether the text has any special formatting (styles, footnotes, etc.) that must be kept, and whether the conversions in question will preserve such formatting. If not, it may be you have to strip that formatting out before you run the conversion.

4. Convert: Run the conversion(s).

5. Review results: Carefully scan the output to check for any problems. Especially watch out for problems with Sanskrit stacks and less usual punctuation. If converter has problems, contact its creators for advice and possible fixes.

6. Standardize Unicode: Fix the resultant problems and standardize the Unicode Tibetan text.

Overview Of How To Do A Conversion From Romanized Tibetan Text

1. Assess system: If the text has been input in transliteration, you have to determine what system of transliteration was used. Often the system of transliteration will not be regular or standard. Thus you have to take careful note of how special cases are handled (such as gy vs. g.y, a chen (some use % in an otherwise Wylie system), a chung, Sanskrit stacks, special punctuation marks, etc.). For example, "Wylie" just covers the core characters, but many Wylie texts are not following a coherent system for other characters.

3. Assess available converters: Check what converters are available for this form of transliteration to Unicode and chose the most appropriate one. If there is not a direct converter available, then you should see what the most efficient set of conversions there is to proceed from the font in question to some intermediary and then to Unicode. Also keep in mind whether the text has any special formatting (styles, footnotes, etc.) that must be kept, and whether the conversions in question will preserve such formatting. If not, it may be you have to strip that formatting out before you run the conversion.

4. Standardize transliteration: Go through and standardize the transliteration so that it accords perfectly with the transliteration conventions expected by the Converter. Scan the text to look for problems that can be cleaned up, such as omissions, irregular elements and the like. In such circumstances, one needs to fix what is programmatically fixable, and document clearly anything that remains to be done that requires manual fixing. Texts input in romanization often use a rough Wylie. Issues that need to be watched for there:

Shad: a common conversion issue if you are going from a transliterated Tibetan text is that periods or commas may have been used for shad. All of these must be transformed first to "/", which converters will change into a Unicode shad. In many cases, shads are missing altogether from the input text. In this case you have to experiment with various global search and replaces to insert shad. For example, if the text has been input with carriage returns at the end of each shad-delimited line, you can do a global search for carriage returns to make them into /_ plus carriage returns. If they are verse lines, then it would be /_ plus carriage returns. One thing to watch for is that a final "g" has only white space and no shad, while a final "ng" has a tsheg between it and the shad. Thus after fixing the shad in general, you should run a global search and replace g/ with g; then run a global search and replace for ng/ to ng /. If the text has footnotes between the final syllable and the shad, however, this will obviously cause problems since it won't find those cases of g/ or ng/ that have footnotes before the shad. That can only be manually fixed.
White space: white space should be converted to "_" in Wylie.
Often "gy" and "g.y" have not been properly differentiated
a chen treated idiosyncratically – it should be "a" but people use % and other symbols.
a chung treated idiosyncratically – it should be " ' " but people use "a" and other symbols.
Sanskrit stacks are treated idiosyncratically
Turn on view of tabs, spaces and paragraph marks. it helps you see problems.
Footnotes: footnotes can be a problem in non-Unicode texts that need to be converted.

4. Convert: Run the conversion(s).

6. Standardize Unicode: Fix the resultant problems and standardize the Unicode Tibetan text.

THL Toolbox > Tibetan Scripts, Fonts & Related Issues > Conversion-Reversion For Tibetan Fonts