Creating An Xml Document In Tibetan

THDL Toolbox > Developers' Zone > Web-development > Creating An XML Document In Tibetan

Creating An XML Document In Tibetan

Contributor(s):

Marking up a Tibetan Text

While one can create a new XML document from a template in the program itself, in the case of Tibetan texts it is often necessary to go through several processes to convert a .rtf or text document into an XML file. The instructions that follow detail how to get a Tibetan text that is in a Word document in Tibetan Machine Web converted into a valid XML file using THDL mark-up scheme. These instructions are done in terms of our primary, recommended editor—XMLmind. However, the process would be similar with most XML editors. This process has three parts: 1) converting the Tibetan script to Wylie, 2) adding lineation mark-up and removing paragraph breaks, and 3) creating the XML file.

Converting Files from Tibetan Script to Wylie

One needs to have the external link: Jskad java program installed on one’s machine or it can be run on the web.

  • The first thing is that the Tibetan script text must be in rich text format (rtf). If it was created in Jskad, it should automatically be in rtf format. If, however, it was created in Word, it is automatically in Microsoft's DOC format. Thus one has to first open it up in Word, and then choose "Save as" from the File menu, and specify the file type as "rtf". That will yield a rtf document that Jskad can open. Unfortunately, it appears that Microsoft has altered the rtf format it uses in different versions of Word. Thus depending on which Word you are using (Word 2002 is one of the problematic versions of Word), Jskad may have troubles with the resultant rtf. We are still trying to document exactly which versions of Word have problematic rtf versions. Conversion of Tibetan script to Wylie cannot be done within Word, since our Wylie-Word program within Word only does Wylie to Tibetan script conversion, but not Tibetan to Wylie.
  • Once you have a rtf version, open it from within Jskad. Jskad, in contrast to Wylie Word, theoretically converts from Wylie to Tibetan, and from Tibetan to Wylie. Once you open the document, examine it for potential problems. Unfortunately, the Microsoft rtf problem may result in curly braces {} and back-slashes in Tahoma font scattered randomly through out. This is a result of Jskad misinterpreting the font codes for these characters, possibly because of the idiosyncratic nature of the font. When converted to TibetanMachineWeb, the braces become what they should be, respectively the dreng bu (e) and na ro (o). The backslash, when converted to TibetanMachineWeb1, becomes the stack, sgy. If you see this problem, then you cannot proceed with the Jskad conversion before first fixing it.
  • The problem with curly braces and back-slashes is dealt with in the following way:

  1. Select the whole text in Jskad and copy it into a new Word document.
  2. Find out the name of the font that the curly braces and back-slashes are appearing in. On my computer, it is “Tahoma.”
  3. Press, Ctrl H or from the Edit menu choose “Replace”.
  4. Press “More” Button to display more search/replace options.
  5. Type “{“ in the “Find what:” box.
  6. At the bottom, click on Format. A pop-up menu will appear. Choose font.
  7. Choose the font of the curly braces, e.g., “Tahoma”. This will display under the text box.
  8. Type the same character “{“ in the “Replace with” box.
  9. click on Format > Font again.
  10. Choose TibetanMachineWeb.
  11. Click “Replace all”.
  12. Repeat steps b-k with the close brace “}”
  13. Repeat steps b-k with the back slash “”, except choose “TibetanMachineWeb1” for the “Replace with” font.
  14. Cut all the text and paste it back into Jskad.
  15. Convert the Tibetan to Wylie by selecting all and choosing Tools > Convert Tibetan to Wylie.

Using the lineator macro

For accurate referencing of the digital Tibetan text, there needs to be lineation. However, in a digital context where pages can scroll almost infinitely and line length (i.e., screen width) is variable, determining a standard measurement for a line is difficult. We have decided to consider a line to be a shad-delimited line with 100 shad-delimited lines per digital page. A macro found in Lineator.doc has been created to automatically insert this mark-up prior to pasting the Wylie into an XML document.

The details of lineation

To mark-up and number the shad-delimited lines in a fixed manner, the <seg> element has been chosen. It must have a type attribute set to “shad”. The TEI guidelines says, “The <seg> element may be used at the encoder's discretion to mark any segments of the text of interest for processing.” These elements provide the most flexibility, because they can have as their children any of the following: sentence-level (<s>) elements, phrase level elements (<cl>, <phr>), quotations (<q>, <quote>), and other <seg> elements (of different type). The <seg> elements will also be numbered using their n attribute. The value of the n attribute will be “<hi rend="weak">page</hi>.line”, a page being 100 lines long (i.e., containing lines 1 to 100). Thus, the 253rd shad delimited line would have <seg n= “2.53” type= “shad”> … </seg> for its mark-up.

The Lineator.doc Word document contains a macro for automatically marking up straight Wylie input with the <seg type=“shad”></seg> element. It does so based on punctuation including whitespace. Each shad-delimited line begins with the first letter of that line and ends with the whole string of punctuation that follows it until the first letter of the next line.

How to use the macro

To use the Lineator macro, follow these steps:

  1. Open both the original Wylie text document and the Lineator.doc, making sure that macros are enabled (Tools >Macros >Security, must be either medium or low).
  2. Copy the Wylie text and paste it into the Lineator.doc.
  3. Type Alt C or choose the “FindWylieClauses” macro associated with the Lineator.doc.
  4. The macro will add the lineation elements and delete any extra white-space.

This file can be saved as a separate file or one can proceed directly to step #3.

Creating XML document

The creation of an XML document once the lineation requires a few step. However, each step is relatively straight forward. The following will be described in terms of XMLmind.

  1. Open XML mind
  2. Choose New
  3. Under Tibetan Text, choose Template
  4. Save as the name of the document you are creating in the desired folder.
  5. Close the document and open it in a text editor like NotePad.
  6. Copy the lineated text from the step above and paste it into the new XML document that has been opened in a text editor. It should go between the paragraph tags (<p>) in the body section, replacing the comment there. That section is: body div p ! Body portion of text (required) > /p /div /body
  7. Save the document and close text editor
  8. Open the document in XMLmind and adjust the view as per its Use instructions.

Provided for unrestricted use by the external link: Tibetan and Himalayan Digital Library