Language Markup In Xml Documents

THL Toolbox > Developers' Zone > XML Markup in THL > Language Markup In XML Documents

Language Markup In XML Documents

The default language when the lang attribute is not set is English. The "eng" value is used when English is inserted within a larger string of another language. All elements in THL XML documents have the universal "lang" attribute. The lang attribute can be set to one of the standard, that are defined by the ISO-639 standard. The (For easiest look-up of these codes see the external link: SIL site and use the 639-2/639-5 codes.) The definitive standard is found at the external link: U.S. Library of Congress (though at the time of this writing, the LOC site is down). Within this standard, we use the 3-letter abbreviation code, and when two of those are present we use the latter or the code set for bibliographic applications (ISO 639-2/B). Thus, for Tibetan we use "tib" instead of "bod" or "bo", and for Burmese we use "bur" instead of "mya" or "my". These choices for the lang attribute in the xtib3.dtd and xtibbibl3.dtd are fixed in the DTD and XML markup. The presently accepted values can be seen at the ISO Language Codes Used in THL XML Markup page.

The default language when the lang attribute is not set is English. The "eng" value is used when English is inserted within a larger string of another language.

These codes are used in any element in XML markup to set the language for that element and all children of that element, both text and sub-elements. A simple example of this markup is for the place name “Choné”:

<placeName corresp="entry1" lang="tib" n="Choné">co ne</placeName>

Since this standard lang="tib" markup is used for Wylie, an addition attribute, rend="tib", is added to markup Unicode Tibetan, as in:

<title lang="tib" rend="tib" level="a">སྐྱིད་ཤོད་སྡེ་པའི་སྐོར།</title>

A more complicated example is when a larger, “container” element is marked as one language containing children that are marked as a different language. This occurs frequently in Tibetan passages that have English headers inserted by the editor, such as in the transliteration in Tony Huber’s article “external link: A Critical Edition of the Guidebook to Lapchi:”

<div1 id="b2" lang="tib" rend="tib">
&#x20;&#x20;<milestone unit="page" ed="jiats"/>
&#x20;&#x20;<head lang="eng">Critical Edition	</head>
&#x20;&#x20;<p>
&#x20;&#x20;&#x20;&#x20;<milestone unit="page" n="[1a]" ed="laphyi"/>།། གསང་ལམ་སྒྲུབ་པའི་<note><phr lang="eng">R omits</phr>
&#x20;&#x20;
&#x20;&#x20;&#x20;&#x20;གསང་ལམ་སྒྲུབ་པའི་</note>གནས་ཆེན་ཉེར་བཞིའི་ཡ་གྱལ་གཽ་དཱ་ཝ་རི་འམ།འབྲོག་ལ་ཕྱི་གངས་ཀྱི་ར་བའི་སྔོན་བྱུང་གི་ཚུལ་ལས་བརྩམས་<note><phr lang="eng">R:</phr> 
&#x20;&#x20;
&#x20;&#x20;&#x20;&#x20;རིམ་<phr lang="eng">, D:</phr> ཙམ་</note>པའི་གཏམ་གྱི་རབ་ཏུ་ཕྱེད་<note><phr lang="eng">G: </phr>བྱེད་</note>པ་ཉུང་ངུ་<note>
&#x20;&#x20;
&#x20;&#x20;&#x20;&#x20;<phr lang="eng">R: </phr>དུ་</note>རྣམ་གསལ་ཞེ་བྱ་བ་བཞུགས་སོ། ། </p>
&#x20;&#x20;....
</div1>

Here, the “container” element – div1 – contains Tibetan in Unicode script. Thus, its lang and rend attributes are set to "tib". Then, phrases in English (in this case within critical notes) had their lang attribute set to "eng" to override the default language (lang="tib" rend="tib") of its parent <div1> element.

Language Definition In THL Markup

Defining Language Codes through Markup: Language Definition in XTIB3.DTD

The language codes used in THL markup come from a “fixed” list of values defined differently for the xtib3.dtd for general textual markup and the xtibbibl3.dtd for catalog records. The definition of the language codes, in fact, is the key thing that differentiates those two forms of the DTD. Except for the way the language codes are defined, the two DTDs are identical. Any changes made to one should be made to the other, including the adding of language codes.

When a code is added to the list of usable codes in THL markup, the first step is to look up the proper code at the SIL page on external link: ISO codes. The code to use is under the three-letter code in the “639-2/639-5” column, and if two of those codes are present, the second one for bibliographic references is used. Thus, for Tibetan it is "tib" instead of "bod".

For “standard” TEI-type documents marked up with the xtib3.dtd (or later versions of that), the language codes are defined in the TEI header section of the document. The TEI header contains an element called profileDesc within which there is a langUsage element. All the languages used in the document are then listed in the langUsage element. The resulting markup in abbreviated form looks like:

<profileDesc>
	<langUsage>
		<language id="chi">Chinese</language>
		<language id="eng">English</language>
		<language id="tib">Tibetan</language>
	</langUsage>
</profileDesc>

The language definition consists of a <language> tag whose text contains the name of the Language (usually in English, but in other systems based on other languages it could be in the native language) and whose id attribute is the three-letter ISO-639/2 code.

In THL XML documents the list is longer and is defined separately from the XML document. This allows there to be a single, all-encompassing list of languages used within THL in one or two places that can be used by all THL documents. Thus, the DTD definition of all THL documents contains a entity including the DTD definition of external links, as follows for this JIATS article:

<!DOCTYPE TEI.2 SYSTEM "../../../xml/dtds/xtib3.dtd" [
	<!ENTITY % extlinks SYSTEM "../../../essays/xml/external-links.dtd" >
		%extlinks;
  	<!ENTITY % intlinks SYSTEM "../../../essays/xml/internal-links.dtd" >
		%intlinks;
	<!ENTITY glossary SYSTEM "../glossaries/sweet-gloss.xml">
]>

This <!Entity …> is the definition and the %extlinks; includes the file at that URL. Within that file (/texts/essays/xml/external-links.dtd), there is another entity declaration for THDL Profile Desc:

<!ENTITY thdlprofiledesc SYSTEM "thdlprofiledesc.xml">

which is found at the same location (/texts/essays/xml/thdlprofiledesc.xml). Within that file the full list of languages is defined. This list is included in each THL XML document through calling the entity. It should be called at the very end of the <teiHeader> element:

&#x20;&#x20;</fileDesc>
&#x20;&#x20;&thdlprofiledesc;
</teiHeader>

To completely override the THL predefined languages and codes, one could write a custom <profileDesc> element, place it in this location, instead of the entity call "&thdlprofiledesc;".

When a new language is added, it should be added to that list and recorded on the ISO Language Codes Used in THL XML Markup page. The file at /texts/essays/xml/thdlprofiledesc.xml can also be viewed online through the thlprofiledesc.xml short-cut at the root level of the texts cocoon app, such as external link: external link: http://dev.texts.thlib.org/thlprofiledesc.xml. This points to the same file so that the change only needs to be made once.

Defining Language Codes through the DTD: Language Definition in XTIBBIBL3.DTD

The only difference between the xtib3.dtd and the xtibbibl3.dtd is how the language codes are defined. In the xtibbibl3.dtd, instead of defining the codes in the teiHeader using the <langUsage> element, the codes are defined in the DTD itself. In DTDs elements are defined first with at <!ELEMENT statement and their attributes are then defined with a matching <!ATTLIST statement. Within the latter there is a definition for the lang attribute and this contains the list of language codes. Thus for the root “TEI.2” element, the definition is:

<!ELEMENT TEI.2 
	(teiHeader, text) >


<!ATTLIST TEI.2 
	corresp IDREFS #IMPLIED
	synch IDREFS #IMPLIED
	sameAs IDREF #IMPLIED
	copyOf IDREF #IMPLIED
	next IDREF #IMPLIED
	prev IDREF #IMPLIED
	exclude IDREFS #IMPLIED
	select IDREFS #IMPLIED
	ana IDREFS #IMPLIED
	id ID #IMPLIED
	n CDATA #IMPLIED
	lang (ara | bur | chi | dzo | eng | fre | ger | hin | ind | ita | jpn | kor | lao | lat | mal | mon | nep | pli | pan | per | pol | rus | san | sin | spa | tam | tel | tha | tib | tur | urd | vie) #IMPLIED
	rend CDATA #IMPLIED
	TEIform CDATA "TEI.2" >

So, to add a language code one needs to edit the DTD, which is found at /texts/xml/dtds/xtibbibl3.dtd. Unless one is also adding or modifying other elements or attributes, there is no need to change the xtib3.dtd. (Note: The xtib2.dtd and xtibbibl2.dtd are older versions of the DTD and should not be changed. If you find a document that uses one of these DTDs, it should be updated to use either the xtib3.dtd or xtibbibl3.dtd.). So to add a language:

  1. Open the /texts/xml/dtds/xtibbibl3.dtd file in Oxygen, JEdit or some other comparable editor.
  2. Copy the "lang (ara | bur | chi | … urd | vie) #IMPLIED" line and paste it into the "Search for:" text box of the Search/Replace window
  3. Copy the same line into the "Replace with:" text box of the Search/Replace window
  4. Edit the line the the "Replace with:" text box to add the new language code
  5. Click the "Replace All" button and confirm any alert boxes.
  6. Save the file
  7. Test the file by
    1. Opening a local version of a tibbibl catalog record
    2. Uncomment the DocType Declaration statement: <!DOCTYPE tibbibl SYSTEM "http://texts.thlib.org/catalogs/xtibbibl3.dtd">
    3. Change the system location to the local location of the newly edited document, such as: <!DOCTYPE tibbibl SYSTEM "C:wamptomcatwebappscocoontextsxmldtdsxtibbibl3.dtd">
    4. Validate the document. If it validates, the change is OK.
    5. Revert the DocType Declaration to its original form and comment back out. (This can be done by using the Revert option in Subversion, which will just redowload the file from the repository.)
  8. Commit the file to the repository.
  9. Do this first on Development, then propagate to staging and production.

Provided for unrestricted use by the external link: Tibetan and Himalayan Library