Sorting Tibetan

THL Toolbox > Tibetan Scripts, Fonts & Related Issues > Sorting Tibetan

Collation of Tibetan

Collation refers to sorting Tibetan words or phrases in an order that reflects the standard sequence of Tibetan words as found in Tibetan dictionaries. A standard use for sorting is in arranging search results by Tibetan “alphabetical order”, or arranging the entries in a dictionary into such an order for browsing. Unfortunately, this is far more complex than sorting English because of the complex nature of a Tibetan syllable. It is not sufficient simply to sort on the Tibetan order (ka, kha, ga, nga, etc.) of the first letter in each syllable, and then the second letter, etc.. Rather, the sorting routine has to analyze the syllable and recognize which letter in the syllable is the “root letter” (i.e. “g” is the root letter in bsgrubs), and then furthermore consider what prefixes, superscript, subscript, suffix, and post-suffix the syllable may have to determine is precise sorting location.

In addition, what tool to use in sorting is dependent on the context in which you are working. Are you trying to do a one time sort for your own purposes on a local computer, or are you trying to sort dynamically in a server-side database? If the former, then obviously you can use offline tools, while the letter will require tools suited to online uses. However, if you are having trouble getting the latter tools to work, one solution is to export the data, sort it offline, and then reimport the data. You can then use the row ids as the sort key. This should work fine, though of course any new items added will not be sorted, and eventually the process will have to be manually repeated to incorporate new items.

Unicode Collation Charts & other Related Issues

The current collation chart for Tibetan <external link: http://www.unicode.org/charts/collation/chart_Tibetan.html> looks *completely* broken as it does not handle prefixes, rago, lago, sago, etc. Someone needs to get to the bottom of this.

Tibetan should be almost identical to Dzongkha e.g. <external link: http://developer.mimer.com/charts/dzongkha.htm> which also gives a correct collation for Tibetan. It is based upon the bod rgya tshig mdzod chen mo dictionary.

An urgent priority is getting someone to contribute Tibetan Locale Data to CLDR. It is one of the main reason there is no proper Tibetan sorting etc. in applications. CLDR is the Common Locale Data Repository maintained by the Unicode consortium (external link: http://unicode.org/cldr/), and offers the largest and most extensive standard repository of locale data. This data is used by a wide spectrum of companies for their software internationalization and localization: adapting software to the conventions of different languages for such common software tasks as formatting of dates, times, time zones, numbers, and currency values; sorting text; choosing languages or countries by name; and many others.

For Tibetan you cannot use ordering based on a locale's LC_COLLATE. Last time I looked, glibc locales had a limitation of 256 collation elements, which is insufficient for Tibetan.

Testing Whether Tibetan Sorting is Working in a Given Application

Once it is working on one system, the application - such as MySQL Dzongkha & Tibetan collation - should be tested on various operating systems as relevant – Mac, Windows, Linux, OpenSolaris, and FreeBSD to start with. At the very least there may be configuration differences between platforms which need documenting.

In addition, the version of the application used has to be verified, and it should not be presumed that because it works in one version, that it will work in later or earlier versions. For example, the Dzongkha sorting Worked in OpenOffice 2.1 but *not* in 2.2 2.3 and, after this was reported, it worked again in 2.4 and 3

The first step in assessing whether sorting is working in a specific application version in a specific operating system environment is to see if it is properly sorting complex syllables. In other words, it may sort the first letter of syllables, but not at all be properly analyzing the root syllables. Thus we would get the incorrect dam, dge, dbang, instead of the correct dge dam, dbang. However, even if the basic syllables are sorting correctly, to make sure Tibetan collation is *really* working properly on any one platform, there are many other things which may not be obvious at first that need testing. For a start -

To make sure that the normalization part is working correctly, it needs to be tested with:

  • a list with strings that contain pre-composed Sanskrit vowels (0F73, 0F75, 0F76, 0F77, 0F78, 0F79, 0F81) and similar strings with the same vowels in their de-composed forms (0F72 0F71, 0F71 0F74, 0F7B2 0F80, 0F7B2 0F71 0F80, 0FB3 0F80, 0FB3 0F71 0F80, 0F71 0F80)
  • the letters gha (0F43 ; 0F93 / 0F42 0FB7; 0F92 0FB7), Dha, dha, bha, dzha, kSha - and U+0F00
  • strings that use 0FBA instead of 0FAD, 0FBB instead of 0FB1 and 0FBC instead of 0FB2
  • strings tha use 0F7E instead of final nga
  • strings that use 0F4A instead of final _gs
  • strings with 0F7B / 0F7A 0F7A & 0F7D / 0F7E 0F7E
  • strings with characters like 0F35 and 0F37 in the middle vs. those without these characters.

Only if all these things are working as expected is Tibetan collation working

Andrew West's Babel Pad is useful for creating lists mixing normalized and un-normalised strings and so on.

Specific Desktop Applications

Office 2007 sorts Tibetan in its various applications.

OpenOffice uses the Dzongkha collation data in ICU - and the Dzongkha collation works.

There is a small command line tool for Windows ( external link: Dzongkha-collate) that does Unicode sorting.

MimerSQL also kindly offers free of charge a external link: MimerSQL offline collation engine. The Windows command line: unidrv -ct tibetan.dat inputFile outputFile will perform a Tibetan sort according to the rules in tibetan.dat.

You can also use the THL Jskad tool, and use its Java Tibetan sort classes.

ICU

ICU stands for “International Components for Unicode” (external link: http://www.icu-project.org/). It is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software. It comes in two versions - ICU4C for C and C++ languages and ICU4J for the Java language. ICU is released under a nonrestrictive open source license that is suitable for use with both commercial software and with other open source or free software. It provides Code Page Conversion to convert text data to or from Unicode and nearly any other character set or encoding; it also provides Collation for various languages based upon the Unicode Collation Algorithm plus locale-specific comparison rules from the Common Locale Data Repository (external link: http://www.unicode.org/cldr/).

It is currently in release 4.0 (December 2008) and as of version 3.6 has supported Tibetan (see external link: http://www.icu-project.org/download/3.6.html). It is currently used by:

  • Open Office - and the Dzongkha collation works.
  • MySQL & ICU now also works

The best place to ask about questions is on the ICU -support list: external link: https://lists.sourceforge.net/lists/listinfo/icu-support.

A “locale” identifies a specific user community - a group of users who have similar culture and language expectations for human-computer interaction (and the kinds of data they process). A community is usually understood as the intersection of all users speaking the same language and living in the same country. Furthermore, a sub-community within that community can use more specific conventions. A program should be localized according to the rules specific for the target locale. Many ICU services rely on the proper locale identification in their function.

In specifying your locale, you begin with the two letter or three letter lowercase language code specified in the ISO-639 standard (external link: http://www.loc.gov/standards/iso639-2/); the optional four-letter script code follows the language code as specified in the Unicode ISO 15924 Registry (external link: http://www.unicode.org/iso15924/iso15924-codes.html); and finally the specific country where the language in question is used is specified by two-letter, uppercase codes as specified in the ISO-3166 standard (external link: http://www.iso.org/iso/country_codes/iso_3166_code_lists/english_country_names_and_code_elements.htm).

Sorting with MYSQL Databases

MySQL & ICU works theoretically, but so far no one has indicated they have actually gotten it to work in practice.

For details on sorting Dzongkha and Tibetan Unicode in MySQL through a Custom Collation, see MYSQL Tibetan Collation. This seems a better method than dragging in the whole ICU library and is easy to modify.

MS SQL Server running on Vista should also have proper Tibetan collation - but that limits you to using an MS Windows server.

Mimer SQL

Mimer SQL from Sweden has had excellent integrated Unicode support for years and has versions which run on well every operating system. They have supported Tibetan since 2004. See:

Non-commercial projects receive a free license for Mimer SQL. It should also be noted that it said to work fine with Ruby on Rails applications (see external link: http://rubyforge.org/projects/ar-mimer).

Sorting within POSTGRES Databases

It is disconcerting that you still need an add-on or a patch to get Postgres to provide proper language specific collation some other database management systems have supported this without any add-ons for ten years. This functionality should really be part of the normal build.

There is wrapper (pg_collkey) for PostgreSQL that allows you to apply UCA/ICU collation to UTF-8 strings. see: external link: http://www.flexiguided.de/publications.pgcollkey.en.html. Its creator is Jan Behrens, who can be contacted by using the CONTACT form on the web site.

UCA is the Unicode Collation Algorithm, which provides a specification for how to compare two Unicode strings while remaining conformant to the requirements of The Unicode Standard. The UCA also supplies the Default Unicode Collation Element Table (DUCET) as the data specifying the default collation order for all Unicode characters.

This should also work for Dzongkha and Tibetan. However, at THL we have been unable to so far to get it to work, and have not spoken with anyone yet who has been successful with making it work. The following details are from the README.

This is a wrapper to use the collation functions of the ICU library with a PostgreSQL database server. Using this wrapper, you can specify the desired locale for sorting UTF-8 strings directly in the SQL query, rather than setting it during database installation. Default Unicode collation (DUCET) is supported. You can select whether punctuation should be a primary collation attribute or not. The level of comparison can be limited (in order to ignore accents for example). Numeric sequences of strings can be recognized, so that 'test2' is placed before 'test10'.

Requirements: PostgreSQL version 8.1.4, external link: http://www.postgresql.org/ (might run with older versions too, but this is not tested); ICU library version 3.4, external link: http://icu.sourceforge.net/ (might run with older versions too, but this is not tested)

The function will only work if the database encoding is set to UTF-8.

In setting parameters, the second parameter asks you to define a locale, and offers “root” or the example of “de_DE” (the first two letter code is the language, and the second the country). If the default "root" or unicode sorting order is the same as the Default Unicode Collation Element Table (DUCET), you probably need to use a table that has the Tibetan or Dzongkha "tailoring" (not the default) or use a Tibetan bo or Dzongkha, Bhutan dz-BT locale. Thus one would expect using bo_CN for Tibetan in China, or dz_BT for Dzongkha in Bhutan.

The third parameter – if set as true, certain characters (puncuation, etc.) are processed at the 4th level (after processing of the case), instead of being processed at the 1st level. The fourth parameter then selects the level of collation:

  • 0: default level (usually 3rd level)
  • 1: 1st level, only base characters are compared, accents, upper/lowercases (and punctuation, if the third parameter is true) are ignored
  • 2: 2nd level, accents and modifications of characters are taken into account
  • 3: 3rd level, upper/lowercase is evaluated
  • 4: 4th level, other differences are evaluated
  • 5: 5th level, strings are only equal if they contain the same codepoints

(after normalization)

The fifth parameter enables numeric sorting, if it is set to true.

It is not clear what one does for the third and fourth parameters for Tibetan, and how that might impact on the results. Could this be necessary for Tibetan syllable-based sorting, which does require multilevel collation.

Pg_collkey uses basically the following C functions of the ICU library to do it's job:

  • ucol_setAttribute with UCOL_NORMALIZATION_MODE set to UCOL_ON (normalize strings first)
  • ucol_setAttribute with UCOL_ALTERNATE_HANDLING set to
    • UCOL_NON_IGNORABLE (if 3rd param of the collkey func. call is false)
    • or UCOL_SHIFTED (if 3rd param of the collkey func. call is true)
  • ucol_setAttribute with UCOL_STRENGTH set to UCOL_DEFAULT
    • or UCOL_PRIMARY or UCOL_SECONDARY or UCOL_TERTIARY
    • or UCOL_QUATERNARY or UCOL_IDENTICAL, depending on the 4th parameter
    • of the collkey function call being 0, 1, 2, 3, 4 or 5 respectivly
  • ucol_setAttribute with UCOL_NUMERIC_COLLATION set to
    • UCOL_OFF (if 5th param of the collkey function call is false)
    • or UCOL_ON (if 5th param of the collkey function call is true)
  • ucol_getSortKey, which is actually creating the sort-key

The attributes for collations in the ICU library are described here: external link: http://icu-project.org/apiref/icu4c/ucol_8h.html#583fbe7fc4a850e2fcc692e766d2826c

You should try to select the UCOL_QUATERNARY or UCOL_IDENTICAL collation level. This is done by passing the level number 4 or 5 as the fourth parameter to the collkey function call, as described in the README file of the pg_collkey package:

The fourth parameter is selecting the level of collation and if you select level 4, it will ask the ICU library to use all known collation rules to determine the sorting order. Selecting level 5 will make the ICU library to use even differences for which there are no explicit collation rules defined (necessary, if you want two strings only considered to be equal, if they consist of the same code-points after normalization).

pg_collkey v0.2 is not compatible with PostgreSQL 8.3, as the API changed slightly. The creator will try to publish an update fixing that issue soon (December 2008).

Discussion of ICU & Postgres

external link: http://wiki.postgresql.org/wiki/Todo:ICU has a discussion of using ICU with Postgres for collation last modified on April 2008.

Currently PostgreSQL relies on the underlying operating system to provide collation support. This means that each platform has a slightly different way of doing collation. Some like the BSDs do not support UTF-8 collation. Some like glibc have fairly complete collation support. An patch for ICU support has existed for several years in FreeBSD ports (external link: http://people.freebsd.org/~girgen/postgresql-icu/readme.html). It only covers a small part of what is necessary.

Basically, the collation support provided by ICU needs to be done either instead of the current system support, or in addition to. Given that client programs will probably be using the system collation, at the very least it should be an option.

ICU supports a very flexible scheme for collation. In addition to just providing standard collations for many languages, it also allows users to customise collations and create their own. This does create the question about how to represent the collation to the server.

The native format of ICU is UTF-16 (not to be confused with UCS-2 which doesn't support all of Unicode). Now, PostgreSQL doesn't support this encoding at all (for various reasons) and we do want to avoid the overhead of converting every string to UTF-16 before comparison.

Fortunately ICU provides an alternative, iterators. An iterator is configured on a string which returns one character at a time. What will need to happen for non-Unicode encodings is that they will have an iterator that converts the characters one at a time for the collation. PostgreSQL already contains all the necessary tables to do this.

For large scale sorting ICU recommends doing a conversion from the strings to sort-keys which can be compared with just memcmp(). This is the same machanism as in POSIX with strxfrm(). However, currently PostgreSQL has no way to store these sort-keys.

This topic has been discussed on and off for many years, most recently (see attached JS-Kit discussion). The objections appear to be:

  • ICU is a large library
  • ICU prefers to use Unicode internally, though it does not require strings to be stored in Unicode
  • It won't match the results returned by sort on the command-line

The pros are:

  • ICU is cross-platform, it supports at least the platforms postgresql does.
  • ICU is generally faster than the system collation support (glibc)
  • ICU is more flexible, allowing users to create custom collators
  • Sorting will match Java sorting

No research has been done on the effect of collation for non-UTF-8 locales. In general any encoding will require some iterator for collation, only UTF-8 has a builtin iterator.

It should be noted that this is not a blocker for SQL COLLATE support. Using just setlocale() and strcoll_l (supported on Windows, GLibc and MacOSX) we can get excellent performance on all the commonly used platforms. However, this does leave BSD/MacOSX users requiring Unicode collation out in the cold. (MacOSX has support for unicode collation internally , but does not support it in any encoding other than UTF-16. Hence it isn't available in the C library).