Overview Of Tibetan Text Processing

Setting Up Cocoon 2.2 > Contact Jiats > Overview Of Tibetan Text Processing

THDL Toolbox > Tibetan Texts > Overview Of Tibetan Text Processing

Overview Of Tibetan Text Processing

Contributor(s): David Germano.

This page provides an overview of the various issues involved with working with Tibetan texts in a digital context. It is very discursive, and not intended at all to be a manual. For manuals, go to Tibetan Texts and explore specific documentation. This page is instead intended to orient people to the various issues pertaining to working with Tibetan texts. Each section is given in terms of the typical sequence in which they would be addressed if one was comprehensively dealing with a given text or body of texts.

Cataloging a Tibetan Text

The first step is to catalog a Tibetan text to create a catalog or bibliographical record. This entails documenting its various titles, the various agents involved with the text (author, translator, editor, etc.), its pagination, and so forth. Such cataloging can be done at various levels, i.e. shallow and deep, or somewhere in between. The most basic or shallow cataloging records, for example, might simply record a title, an author, and pagination/volume information; the most detailed cataloging records will record all variants of titles, input the colophon and comprehensively analyze it, and also input all chapter titles, among other things. Another important issue of cataloging is the assignment of a unique identification code, or "siglum" (plural: sigla), to the text for short-hand reference purposes. Such cataloging is especially helpful for anthologies, compilations, and canons which involve multiple texts in a single volume, often with different authors, whereas available catalogs may only provide a brief reference for the collection as a whole.

Once a text has been cataloged, others can then easily search a larger catalog to find the text, and then from there locate physical copies and so forth. In addition, within THDL, a catalog establishes the basis for establishing a "thematic research collection" devoted to publishing all relevant materials about this text. The catalog thus becomes a front end to access editions of the text, translations, summaries, scholarship, scans, and much more.

Documentation:

Outlining a Tibetan Text

Tibetan Buddhist and Bönpo texts are often characterized by a very clear and detailed internal outline (sa bcad) which constitutes a hierarchical scheme of sections, subsections, sub-subsections and so on that can descend to up to 10 subordinate levels or more. They in effect are a table of contents, but with far more levels than most contemporary books possess. Scholars often extract these outlines as a preliminary step to working on a text, since viewing the outline by itself can be extremely useful in understanding the text's scope and structure, as well as in quickly locating items of interest. Such outlines can be generated as free standing documents to share with others, but also can be best included in a deep cataloging record. If the full Tibetan text is being input, then the outline is also used to generate a structural mark up of that text which can then be used to display the input text with a navigatable outline.

Documentation:

Outlining a Tibetan Text

Scanning a Tibetan Text

Scanning involves taking an image of a Tibetan text with a camera or scanner. The advantage of such scanning is that it is far more rapid than the input of a text, plus one has complete assurance as to what the original manuscript actually has for readings, whereas with an input text one never knows whether one might be reading typos. The disadvantage of course is that one cannot easily search such texts, nor format them in flexible ways. The most useful way to publish scans is to do a detailed table of contents of the text, so that users can access scanned manuscript images while reading the table of contents, and easily jump to various chapters/pages within the scans. Otherwise, simply have a scan with no such system can be cumbersome to consult other than simply printing it out and using it as a means to get one's own print version of the text.

Documentation:

Inputting and Marking up a Tibetan Text

The most time consuming yet powerful to reproduce a Tibetan text is by actually typing it into the computer word by word. One of the most important aspects of this work is proofing, since a badly input edition full of typos is a great disservice to everyone.

A separate issue is creating critical editions, in which one compares various readings and makes critical decisions as to which of a given set of variant readings is correct in each case.

In addition to inputting it, one might also choose to add extensive markup which identifies the different structural components of a text - this is a homage, this is chapter two, this is a colophon, this is verse, this is a citation, etc. - as well as different thematic types of words - this is a text title, a person's name, etc. Such markup of a text can be just as time consuming as the original input, but does add extensively to utility of such input texts in terms of publication formatting and searching/analysis.

Documentation:

Converting Input Tibetan Texts into XML

Once a volume of texts have been input, they can be converted into XML. Preferably, they should be marked up in Word styles prior to conversion, but this is not always required. The only required "mark-up" for conversion are the metadata table and the page and line numbers. A single text or a whole volume of texts may be converted in the same batch manner using the latest Word to XML converter. This process requires the use of a Windows machine.

Documentation:

Batch Converting Input Tibetan Texts

Summarizing a Tibetan Text

Long before anyone translates a text or analyzes it in great detail, it is relatively easy and straightforward for a knowledgeable scholar to summarize the text's contents, perhaps chapter by chapter. Such quick summaries are of great use to others and can radically open up entire corpuses to scholars and non-scholars. In other words, an expert on a given body of texts could easily spend a few days generating reliable and clear summaries of texts, or even chapters within texts, and then these summaries can be published as indexed from the catalog records for those texts.

Documentation:

Summaries

THDL Toolbox > Tibetan Texts > Overview Of Tibetan Text Processing