Contributor(s): THL Staff.
The conversion routine is a work in progress. Ideally, each style that is available in the TextToXMLConverter.doc would be marked up in the appropriate XML when converted. We are presently working on the document that lists the correspondences between the styles and the XML markup, while at the same time working on implementing each in the conversion routine. However, at present, some of the more specific, less-used styles are not converted. The most essential ones for structuring the XML document and basic formatting of the text are available. (The discussion is specifically focuses on the conversion process. For a more general discussion of the use of styles in word, see our discussion in the Using Microsoft Word Styles manual.
The initial step in the process is to paste the original document into a copy of the TextToXMLConverter.doc. Rename it appropriately, and this will save both the style information and the Visual Basic macros with the new document. Go through the document and apply the appropriate styles to the paragraphs and words as necessary. There are two kinds of styles available in Microsoft Word. Paragraph styles apply to the whole paragraph of text, while character styles apply only to a certain character or run of characters anywhere within a paragraph. A header style is a paragraph style. Italics are a form of character style. Both kinds of styles are used in the mark-up process.
The basic principle to the conversion is that the metadata table at the top contains all the metadata, while the structure of the text is represented by the nested headers. Heading 1 represents the major divisions of the text, while Heading 2 represent the sub-divisions of those divisions. Heading 3 represent the sub-divisions of the sub-divisions, and so forth. Other than that, tables and lists should be formatted and styled appropriately, and if desired, one can apply specific styles to indicate personal names, place names, titles, and so forth. There is also a function within the conversion routine that will go through all italics in the document and allow the use to apply more specific styles in order to differentiate between titles, foreign languages, names, and other uses of italics. The description of the style/XML correspondences is found at:
http://www.thdl.org/xml/showEssay.php?xml=/tools/scholartools/word_to_xml.xml.
A summary of the style to markup principles are:
Note: One does not need to apply a particular style to the tables. Just insert the table through the table menu and the converter will recognize it as such. However, one must remember to include the metadata table at the beginning as the converter takes the first table in the document to be the metadata table.
Note: These are the regular footnotes entered into a document either by pressing Ctrl+Alt+f or from the Insert menu, Reference, Footnote.
For the markup applied to each of the character styles and for a quick way of differentiating indiscriminate use of italics, see the section of the Italics Conversion Macro (section 4.A) below. One should not spend too much time on applying styles for conversion, since not all styles are functional yet and often the converter does not necessarily convert all of them properly. Furthermore, particularly long documents often will cause the converter to lock up and not finish the conversion. If this is a problem, break the document into smaller parts and combine them latter as XML. The editor should instead focus on creating the structure of the document—headings, paragraphs, tables, notes, and lists—and applying basic character styles—titles, names, etc. More complicate mark up should be done in the XML editor.