THL Toolbox > Scanning & OCR > How to OCR a PDF > Assessment of Abbyy software for OCRing Romanized Tibetan with Diacritics

Assessment of Abbyy software for OCRing Romanized Tibetan with Diacritics

Contributor(s): THL Staff

Assessment of Abbyy software (external link: for OCRing romanized Tibetan text with diacritics:

It is hard to say without knowing the Abbyy OCR internals. For example, if text recognition uses a language model to make decisions about or "smooth over" unusual strings of letters, then it is unlikely that it will do well with Romanized Tibetan (since none of the weird Romanized Tibetan words will be part of the model). In that case, training Abbyy OCR with some special purpose list of Romanized Tibetan words may be useful, assuming that is possible. If instead, the OCR recognizes words letter by letter -- and doesn't attempt to coerce Romanized Tibetan syllables to English words, it should do OK. The problem then is making sure Abbyy is aware of the diacritic letters you are using....

I don't have much experience with Abbyy OCR works unfortunately. It appears to be pretty customizable and could likely could accommodate Romanized Tibetan with a little help.

I would suggest that Jeremy try to present Abbyy with a specialized Romanized Tibetan vocabulary if he can (replete with diacritics). (THL doesn't have anything like this, right?) Beyond this, Jeremy may want to look to see if Abbyy has different language settings. Maybe it is possible he can recognize using settings another language that uses diacritics…

Provided for unrestricted use by the external link: Tibetan and Himalayan Library