Introduction To Xml

THL Toolbox > Developers' Zone > Web Development > Introduction To XML

Introduction To XML

Contributor(s): Than Grove, Steven Weinberger

The first section of this document describes the basic principles behind the Tibetan and Himalayan Library's implementation of XML for displaying articles, essays, and monographs on the web. The final section concerns the conversion of documents from text format into XML or the creation of new XML documents. For a description of the particular markup used in the THL, please see our page describing the THL Markup Scheme: XML Markup Manual for THL.

We are just beginning to implement the technology for marking up and displaying Tibetan texts on-line. These texts must be marked up in XML using THL's xtib3.dtd. Furthermore, the text itself must be in Extended Wylie transliteration (See our Jskad tool "Download Tested Version" for its built-in converter), and it must have been lineated using the Lineator.doc template and macro to insert sequentially numbered clause (<seg>) elements around each shad-delimited lines with 100 such lines per digital page. The section on Using the Lineator Macro, in the section on creating an XML document, describes this process in detail.

How XML Works

XML, like HTML and RTF, is a text-based mark-up system. It uses underlying "tags", or "elements", to mark different parts of the texts in order to allow specific operations to occur on those parts: formatting, searching and so forth. Thus in a word processor, some text is marked as "bold" so it will appear in a darker font, and so forth. The advantage of XML is precisely that it is "eXtensible", namely that, unlike HTML, one can define the names and use of the elements to customize one's mark up scheme. This allows greater precision in the mark up that enhances usability of the digital document: names can be cross-referenced to a database, titles can be searched on, and so forth.

Hence, XML is far more customizable and powerful than HTML, because it allows users to define their own tags and thereby customize the markup to the type of texts one is working with and the type of tasks one is interested in. However, the use of XML is not as simple as HTML, where one simply creates a text document (.html) and posts it to a server. The extended functionality of XML requires three types of documents:

  1. DTD or Schema: the rules defining the elements, or tags, one has available for marking up text.
  2. XML Document: the actual data, i.e. the text marked up with XML tags and rendered visually by the stylesheet.
  3. XSLT Stylesheet: the rules defining how the XML document is to be transformed into HTML or another XML document.

To use XML one must first create a type of toolbox consisting of elements, attributes and rules for their application; this toolbox is called a DTD or schema. A DTD is a set of rules that defines the format, use, and scope of elements and their attributes for one's own XML-based language. These elements and attributes form the essential tags with which one marks different elements of the texts – this is a title, this is a person's name, and so forth. An alternative to the DTD is the newer XML Schema, which serves the same function but is more structured than a DTD. A DTD is a linear series of element definitions, whereas a schema is hierarchically structured document that uses nested elements - just like an XML document - to define elements for an XML-based mark-up language. Presently, DTDs are in wider use than XML schemas. However, it is likely that in the future schemas will supplant DTDs. Both DTDs and schemas defines the names of the elements, which element is the root element, (1) how the elements can be nested, what attributes each element has, and what kind of data can be placed within the elements and their attributes. For more information on DTDs, see the external link: W3Schools tutorial on DTDs.

Once one has a DTD or schema suited for one's chosen task, one can begin to create XML documents. XML documents are simple text documents, since the elements used for markup are plain text. (2) As in HTML, the elements, or tags, take the form of a name within angle brackets (e.g., <title>). This is matched by a closing tag that takes the same form but with a forward slash between the opening angle bracket and the attribute name (e.g., </title>). The content of the element is placed between the opening and closing tags. Thus, one might find: <title>gsang bdag zhal lung</title>. Attributes of the element are listed in the opening tag after the element's name and take the form of name=value pairs, as in:

<persName type="Buddha" lang="san" rend="italic">Amitabha</persName>.

Attribute values must be enclosed within quotation marks. (For more information on XML, see the external link: W3Schools tutorial on XML. Since all the elements and content of an XML document are just text, an XML document can be created in any text processor, such as TextPad, NoteTab, etc. However, such programs do not not have a way to validate the XML document against its DTD. A program that validates an XML document checks the document agains the DTD (its specific XML rules) to make sure it stays within the guidelines set by the DTD. If the DTD dictates that <div2> elements can only be nested within <div1> elements, a document that had <p><div2>…</div2></p> would not be valid. One reason to use XML editors is that they provide the means for validating a document against its DTD, as one is writing it. Generally, XML editors also provide different views on the XML document, showing a view of its hierarchical structure, a view of highlighted elements, or a plain text view. Furthermore, these editors usually provide windows showing which elements are permissible for insertion at the cursor and the available attributes for the currently selected element. These features make the use of XML editors preferrable to the text editor method of entry.

In addition to a DTD and an XML document, one requires an XSL stylesheet to specify how these various tagged items should visually appear – such as render TITLE as underlined, render PERSON'S NAME as red font, and so forth. These style sheets are written in an XML transformation language, called XSLT. This language is a subset of XML that is used to create a set of "templates" that are applied to the XML document according to its hierarchical order. Each style sheet must contain a template that matches the root element of the document. That template will provide instructions on what to output, e.g. HTML tags, and in turn will call other templates to be applied to the children of the root element and so forth. In this way, the stylesheet instructs the processor to descend down the hierarchy of the XML document and transform its tags into HTML tags. 3 While the XSLT stylesheet is important for the ultimate display of an XML document or its transformation to another data format, an understanding of the XSLT language is not necessary for creation and editing of XML documents. The instructions contained in the document of THL XML Markup of Texts provides all the information necessary so that one can markup a text to be in compliance with the continuing development of our XSLT stylesheets. For further information on the function of XSLT, see the section below on Displaying an XML Document.

Using an XML Editor

XML editors serve several functions. They allow one to easily navigate through an XML document. They allow one to edit the text of the document and the mark-up, by adding, deleting, or changing the tags and their elements. They also provide, to varying degrees, some sort of WYSIWYG ("What you see is what you get") view of the document. Finally, they validate documents against their DTD (Document Type Definition) and make sure they are well-formed. To provide this functionality, the editor needs to be able to locate the DTD for each document edited. The DTD is a set of rules that describe where and how elements can be used within a document, essentially defining the mark up language. A valid document is one that follows all the rules of its DTD. A well-formed document is one that follows the general rules of XML, such as every open tag must have a corresponding closing tag, all attribute values must be in quotes, and so forth. Whereas HTML and SGML documents do not need to be well-formed, all XML documents must be well-formed. However, in XML the structure, the elements used, their attributes and so forth can vary from document to document as defined by the DTD used. In THL Literature Collections, there are two primary DTDs, xtib3.dtd (for marking up texts) and xtibbibl3.dtd (for creating catalog records). Both are based on the TEI, or Text Encoding Initiative, (P4) DTD with certain additions for cataloging and markup of Tibetan texts. Because there are various DTDs, each XML documents are required to declare their DTD in their opening lines, through what is called a Document Type Declaration, or Doctype Declaration. The doctype definition contains the name of the root element of document, connected with a public identifier and/or a system identifier for locating the corresponding DTD. For instance, the doctype declaration for this document at present is:

<!DOCTYPE TEI.2 PUBLIC "-//THDL//DTD TibetanText//EN" "xtib3.dtd">

As the DTD is based on TEI, the root element for documents using this DTD is TEI.2. This is followed by the word "PUBLIC" and the public identifier for the DTD in quotes: "-//THDL//DTD TibetanText//EN". The public identifier is used by catalog files to locate the DTD. Following this and also in quotes is the system identifier, or "xtib3.dtd". This is the actual file name of the DTD, which can also include the path information. It is called a system identifier because it locates the DTD on one's specific system, or computer. The inclusion of a doctype declaration is the first step in connecting an XML document to its DTD.

However, the use of an XML editor requires some set up prior to opening and editing XML documents. XML editors need a way to connect the public identifier of the doctype declaration with the actual file of the DTD so that it can read the rules of the language and validate the document to be edited against those rules. This is done with a catalog file that is usually a list of doctypes, such as TEI.2 and so forth, and the specific path of their DTD relative to the editor. Such catalog files need to be registered with the editor prior to opening an document. As each specific editor is different in how it deals with doctypes and catalogs, the details of this will be explained separately for each of the main editors below.

Many XML editors use Cascading Style Sheets (CSS) to format the display of the XML document in the editor window. By formating font, character, and paragraph specifics the editors can provide something of a WYSIWYG view that emulates the display of the document on the web. These CSS styles are generally created within the editor itself, and the recommended editor, Morphon, provides a useful interface for editing and changing the CSS styles. For Morphon, the basic styles have been written and can be obtained in a package provided by THL for those working on XML documents. These styles can be modified by the user, if so desired. See the W3Schools tutorial on CSS Stylesheets for more information on the CSS language.

Displaying an XML Document

XML is a relatively recent technology, and because of its extensibility, it is more difficult for browsers to display XML than the simpler and static HTML. An XML derrived form of HTML, known as XHTML, which necessarily must be well-formed, has been codified and is gradually taking over as the standard. More recent versions of web-browsers will display an XML document displaying the tags and allowing the user to collapse and expand sections, similar to an XML editor. However, this is of little practical use to the average user unless they wish to learn about XML. Instead, for web display, another XML-based language has been created that allows one to transform an XML document in any number of ways. This language is known as "eXtensible Style Language Transformation", or XSLT.

In simple terms, XSLT starts from the root-element and moves down the hierarchy of the document applying "templates" (transformation instructions) to each element it finds. In more complex terminology XSLT uses the XPath language to select and match elements within an XML document and can copy them, replace them with different elements, or simply ignore them. In this way, XSLT is true to its name. It allows the user to transform an XML document from one state to another. An output tag at the beginning of an XSLT stylesheet determines the form of its output, generally either XML, HTML, or plain text. XSLT is actually an XML defined language so that an XSLT stylesheet has the same hierarchical structure of an XML document, except that it uses a namespace, or prefix to the elements name that insures their uniqueness. XSLT provides much of the enhanced capabilities of XML and significantly augments its "extensibility". However, for our present usage, XSLT is used to transform an XML document into HTML in order to be displayed on the web. Thus, once a document is marked up in XML, it can be presented in different views through the use of different stylesheets. This has the advantage of allowing there to be only one source document so that information does not need to be repeated over multiple web pages.

Perhaps one of the best examples of the usefulness of XML with XSLT stylesheets is found in the THL web pages, such as our Literature Technical Documentation page or Jose Cabezon's essay on the Space of Sera Monastery. In these pages, a table of contents of the whole document appears in a highlight box to the left of the body of the text. As one navigates through the document, the table of contents changes indicating where one is in the document as a whole. All this is done through the use of XSLT stylesheets. The source is always the same XML document, but the stylesheet first goes through the structure of the document to make the TOC, using a passed parameter to determine one's present location within that, and then reiterates through the desired portion of the document to display the text. For the purposes of this document, an in-depth discussion of the workings of XSLT is not necessary. One should look at the external link: W3Schools tutorial on XSL for a discussion of the specific commands and so forth.

The use of a XML document thus requires the coordination of three items. Each (1) XML document has (2) a specified DTD or Schema that governs its markup and (3) an XSLT stylesheet dictating how its marked up text should appear in HTML. Therefore, because of this extensible nature of XML documents, each of which may have its own set of rules and tags, it is impossible to provide a simple set of menus and keyboard shortcuts that will serve the needs of all users. For this reason, XML editors have traditionally been cumbersome and time consuming to use, with considerable manual entry of tags rather than the simple streamlined work we are accustomed to in word processors. In addition, tags are generally visible, producing a tag-cluttered view of the document that is far from the rich, WYSIWYG ("what you see is what you get") environment of the typical word processor.

Though XML files can be created with a simple text editor, XML is easier to implement when using an editor specifically designed for XML mark-up. Software development for XML is several years behind that of HTML. New generations of XMLs are emerging with flexible abilities to switch back and forth between tag-filled and tag-free views, allowing one to edit directly in a style-sheet facilitated WYSIWYG view. However these remain expensive, and still require more than a bit of set up. However, there are free editors that can be used in a fairly streamlined, semi-WYSIWYG mode if they have been prepared properly for the desired tasks.

THL has thus created the necessary DTD, style sheet and manual for doing markup of both academic essays and classical Tibetan literature. In addition, we have customized a free XML editor and created a User's guide to facilitate THL collaborators using it for THL work. We plan to extend this work to cover a second free XML editor (Cooktop) as well as an expensive yet very powerful software package called Xmetal. As new XML editors become available and are investigated, they will be added here, if they seem to meet our needs. The section for each piece of software describes how to obtain the program, how to install it, and how to use it.

A select number of XML editors are described and reviewed to the best of our abilities in XML Editors.

Notes

(1) The root element is the highest level element of an XML document. It is the ancestor element for all other elements in that document, which are nested within it.

(2) However, the "plain text" must be encoded in the UTF-8 Unicode format, according the specifications of XML.

(3) Stylesheets are not limited to outputing HTML. They can also transform a document into plaintext, another form of XML, PDF, RTF, and so forth. In the context of the THL, however, discussion of XSLT transformations and stylesheets will focus solely on transforming XML into HTML.

(4) For instance, as this is being typed, Morphon has the following displayed in its bottom status bar:

/TEI.2/text/body/div1(2)/div2(1)/div3(3)/div4(1)/list/item(1)/text()(2)

The numbers in the brackets indicate the sequential order of that element among its siblings. "../div1(2)" means the second <div1> element and so forth.

(5) If the CSS editor is already open but is minimized or hidden when this option is chosen, nothing will happen. You will have to bring it to the front by choosing it from the taskbar at the bottom of your window.

(6) Some of the options in this menu are rarely or never available and others are redundant or of little use. These options are not listed here.

(7) An XML document consists of a root element that contains children elements, which in turn can contain children, thus creating the tree like structure. All elements can have as their children some mixture of other elements, blocks of text, comments, or processing instructions. Any parent or child from the root on up is called a node. Thus, a text node is a block of text found anywhere within an XML document.

(8) At present, the conversion routines have generally been created for Microsoft Word, but the basic principle could be applied to other word processors, should an adventurous mind choose to take on such a task.

(9) The situation is somewhat easier for Chinese and Japanese because of the extensive support they have received from the operating system manufacturers and the advanced status of their Unicode fonts. For these far East-Asian languages, the “non-Tibetan” method could be employed provided that one’s IME will work in conjunction with the XML editor.

(10) Character styles are those that apply only to a run of one or more characters. Paragraph styles apply to the whole paragraph.

Provided for unrestricted use by the external link: Tibetan and Himalayan Library