Xml

THDL Toolbox > THL Technologies & Open Standards > Technologies For Dummies > XML

In this age of ever-advancing technology, “markup” languages have become the standard for both structuring documents and presenting them over the World Wide Web. The most widely-known of these languages is HTML (HyperText Markup Language), which is used for creating web pages. Now in its fourth version, HTML is actually a specific implementation of SGML (Standard Generalized Markup Language). SGML is described as a markup meta-language in that it is used to define the rules of other markup languages such as HTML.

SGML is an international standard for the description of marked-up electronic text. It is a meta-language that formally describes a markup language. In this context, “markup”—or more precisely, “encoding”—signifies any means of making explicit an interpretation of a text in order to direct the user of the text in how the content of the text should be interpreted. A “markup language” is thus a set of such conventions used for encoding texts that specifies what markup is allowed, what markup is required, and how markup is to be distinguished from the actual primary text. SGML is the parent language of HTML (Hypertext Markup Language), the basic language generally used to encode Web pages. It is a descriptive set of tags that is enabled by the tag-set building rules contained within SGML. EXtensible Markup Language (XML) is a parent language like SGML and is both derived from and compatible with SGML. XML is designed for easier delivery on the Internet than is SGML, and for easier implementation by software developers. XML is beginning to supplant HTML as the language of choice for the Web, though this shift will take several years to complete.

SGML was developed as a legal markup language for IBM by Goldfarb, Mosher, and Lorie in 1969 and was first standardized by the ISO (International Organization for Standardization) in 1986. It was eventually discovered, however, that SGML was in fact too general to be easily implemented by the average user. Then, in 1989, Tim Berners-Lee and Andrew Burglund developed HTML for CERN. Because HTML was a much-simplified and static subset of SGML, it became the “standard” markup language used on the internet. While HTML is convenient because of its simplicity, it is not adaptable to diverse types of information, but is merely a formatting language. The flexibility of SGML, on the other hand, lies in the ability to create user-defined tags for structural encoding (delimiting the structure of a document) and descriptive encoding (describing the nature of a document’s content). This is done through a Document Type Definition (DTD) that determines where tags can be used, what attributes they have, and what content they can hold.

HTML is one language defined using an SGML DTD. Its simplicity led to it being quickly adopted by the fledgling internet community so that now it is the standard for presentation of web documents. However, structural and descriptive markup is almost completely absent from HTML, as it deals only with how the document is to be displayed. For this reason, SGML has continued to be used despite its unwieldiness. In an effort to make SGML more manageable and accessible, a version of it known as XML or eXtensible Markup Language has been developed. Whereas SGML can only be displayed using expensive, proprietary software, XML is paired with a styling language, XSL (eXtensible Style Language), that allows XML documents to be transformed in any number of ways. The theory is that XML will be transformed using XSL into formatted objects that someday browsers will be able to read and display. However, because XML is still relatively new, we are just now seeing its initial implementations in the most common browsers. Until browsers are fully XML-compatible, XML must first be transformed into HTML for it to be presented on the web.

When the Samantabhadra Project first approached IATH for technical assistance in 1997, the W3C consortium, which developed XML, had yet to release XML 1.0, which came out in 1998. Even now XML is still in its infancy. Thus, the information documented and stored in our has been encoded in SGML, in which it will remain until XML becomes sufficiently stable and can be implemented easily enough to make the conversion practical.

The general value of SGML can be understood via three points. The first is that SGML allows one to descriptively markup the intellectual content and structure of a document. In contrast, HTML is chiefly concerned with superficial display issues. SGML uses descriptive rather than procedural markup, meaning that it simply identifies portions of the document rather than specifying what processing should be carried out at particular points in a document. Thus the data structure of a particular SGML file is geared towards an intellectual analysis of the document. In this way, its tags allow one to create richly structured documents by designating/encoding such information as structural divisions (title page, main body of text, scene, stanza, section, date, author, etc.), in addition to conveying information about renditional and typographical elements (changes in typeface, line breaks, etc.). This enables sophisticated searches and other manipulation of the data.

The second advantage is SGML’s universality—as long as the underlying data structure is accurate, it can be understood and used by any program or tool that understands SGML. Instructions are needed to actually process the document. Since the instructions for processing an SGML document for a particular purpose (such as formatting) are strictly separated from the descriptive markup within the document, they are usually collected outside the document in separate procedures or programs. This means that the same SGML document can be readily processed by entirely different pieces of software in quite distinct ways. A single part of an SGML file can thus be processed simultaneously in a variety of ways. In addition, an SGML text consists of plain ASCII text with special tags contained in angle brackets (e.g., <author> Padmasambhava </author>). These “tags” or “elements” identify the type or format of information contained between them. Since the tags are composed of plain text ASCII characters, no special software or proprietary binary code is necessary to create an SGML file. This ensures long-term viability and easy delivery of files across networks and platforms.

The third point is the “Document Type Definition” (DTD), which defines the identity and functionality of the tagging elements that can be used. For a detailed discussion of tagging, see the separate document on “SGML-tagging in The Collected Tantras of the Ancients (rNying ma rgyud’bum))”. Documents are thus understood as having “types”, which is formally defined by its constituent parts and their structure. Documents of the same type can thus be processed in a uniform way. SGML is used in many fields—the airline industry, defense industry—and for each field of use, there tends to be standard sets of elements/tags (or DTDs) that apply to the types of concerns and documents dominating that field. Most pertinently for our purposes, most serious digital library initiatives use the tag set or DTD known as the TEI (Text Encoding Initiative) Guidelines, which was developed specifically for humanities computing projects. A subset of this is referred to as TEI-Lite. TEI is an attempt to set standards for the use and manipulation of texts. HTML versions are then created, often on-the-fly, for web delivery. The downside of relying on TEI is that it can overly restrict one’s flexibility; the upside is that it conforms to standards being used in other humanities projects. Thus TEI will often lack the elements with the precise functionality one wants, at which point one uses a program to create those specialized elements in a TEI-compliant DTD that defines them. In addition to creating new elements, a DTD can also modify attributes of an existing element (making them required fields, etc.).

In order to facilitate data entry for large projects which require multiple files of a similar structure, it is possible to create SGML-based templates consisting of SGML tags in precise hierarchies of arrangement into which the data is entered. The associated DTD then defines the nature of the tagging elements used in that template.

The following sections of this chapter deal with the SGML markup of our information and its display. Because of the unique issues involved in cataloging and representing Tibetan texts electronically, we developed a specific SGML-defined language to accommodate these peculiarities. Working for over a year in weekly meetings with Daniel Pitti from IATH, we developed our TIBBIBL (TIBetan BIBLiography) DTD for the encoding of The Collected Tantras of the Ancients (rNying ma rgyud ’bum) catalog. The TIBBIBL DTD is based on the TEI (Text Encoding Initiative) DTD with significant additions to account for information specific to Tibetan literature such as a text’s “revealer” and “concealer”, multiple titles for a single text, and so forth. This will be the topic discussed in what follows.

THDL Toolbox > THL Technologies & Open Standards > Technologies For Dummies > XML

What is XML?

Simple Introduction to XML

How is XML being used in THDL?

How Can I Explore XML Further?

Provided for unrestricted use by the Tibetan and Himalayan Digital Library