ITIG ACRL/NEC Information Technology Interest Group Homepage
ACRL/NEC Home | ITIG Home | ITIG Officers | Join ITIG | ITIG-L
TechCorner | Annual Reports | Meeting Minutes | Programs


ITIG Tech Corner
"XML: Promise and Problems"

Charity Hope, UMass Amherst Libraries
April/May 2000

 

Overview of XML

Most librarians interested in technology will have heard of XML (Extensible Markup Language), which is widely predicted to be the future publishing format of choice for structured information on the Web. But what is XML, and how can librarians begin learning about and using this new technology? This article provides a brief overview of XML and an introduction to XHTML, the latest recommendation from the World Wide Web Consortium, which promises to help Web developers make the transition from the familiar HTML to the more powerful and flexible XML.

 

What is XML?

Unlike HTML, which defines a fixed set of tags that structure and format information for viewing in a browser, XML (Extensible Markup Language) outlines a set of rules that are used to create different tagging vocabularies. (This may sound familiar to people who have worked with Standard Generalized Markup Language - XML is in fact based on a simplified version of SGML). This is what is meant by extensible; a developer can define and use elements and attributes that suit the content of the information. For example, if you're marking up a bibliographic list in XML, you can define a set of semantically meaningful tags that accurately describe that data:

<bibl><author>Hetzer, T. </author> <date>1935, 1948</date> <title>Titian
Geschichte seiner Farben</title> <pubPlace>Frankfurt-a-M</pubPlace></bibl>

Although it is possible, it is not likely that individual Web developers will each separately define an XML tagging vocabulary. Rather, different tagging vocabularies for different kinds of documents and data are being developed collaboratively for shared use. Some of the first XML markup vocabularies to emerge have been from the groups most hampered by the limitations of HTML. For example, MusicML allows musicians to describe musical notation, and MathML allows the markup of mathematical formulas and equations. Each content-specific markup vocabulary follows the rules for XML, but uses different tags.

The tagging vocabulary that sets out the tags that are used can be defined within the XML document itself or separately in a special definitions file, which can then be shared between multiple XML documents. Currently, the most common type of definitions file that is used is a DTD, or Document Type Definition (another aspect of XML inherited from SGML). Another system for defining XML vocabularies---XML Schemas---is being developed by the World Wide Web Consortium as the potential successor to DTDs. XML-aware applications---including but not limited to browsers---don't need to understand in advance the tags that may be used. Instead, they learn about the document's tagging structure as they peruse the document and its associated DTD or Schema.

Tags in XML define only the content of information, not the way it should be formatted in a browser. Formatting instructions are described in a separate style file, which can be either CSS (Cascading Style Sheets) or XSL (Extensible Style Language). Web developers currently using strict HTML 4.0 with CSS are familiar with the advantages of separating content from style. The same content can be formatted differently for different audiences and uses, and style alterations are easier to manage because changes to multiple documents can be effected with a single edit to the style sheet.

The XML document, the vocabulary definition file, and the style file are the three basic parts to the XML puzzle, separately defining XML content, structure, and formatting, as illustrated below:

Basic XML Components: Document, Vocabulary Definition File, and Style File

These are the basics, but XML also works with (or will work with) a whole suite of technologies and standards including: XLink, which allows for enhanced linking capabilities; XPointer, which allows you to target specific parts of remote documents without embedded target tags; XQL, which allows for the querying of XML files; and XSL-T (XSL for Transformations), a syntax that allows you to transform XML documents from one XML vocabulary to another. XSL-T is particularly relevant for current XML projects, as it can be used to transform XML documents (for which browser support is still incomplete) into HTML files for rendering on the Web.

 

The Promise of XML

XML was designed to be simpler to use and implement than its parent, SGML, and it promises several advantages over HTML. As an "extensible language," XML allows you to define your own tags and attributes---the markup structure can be customized to fit the data. Eventually, the more semantically meaningful tags possible with XML could greatly enhance searching capabilities. In addition, XML's separation of content, structure, and style allows the developer to reuse each component. For example, the same structure and style can be applied to many content documents; and the same content can be styled multiple times in different ways for different viewing/rendering/printing devices. Separation of content, structure, and style also has the potential to allow a logical division of labor between people with different skill sets: authors can create marked up content using simple WYSIWYG tools (yet to be developed), technically-minded developers can define the tagging vocabulary to be used in the editing tools, and designers can create attractive styles for different purposes and audiences. XLink, though not implemented in any current commercial browsers, will allow multiple types of links (bi-directional, externally managed links that provide access to a ring of sites or let the user open multiple windows, links with multiple sources, links with attributes, etc.). In addition, the content-specific, textual tags make XML files easy for people to read and write; XML's specific, precise, and strict rules make documents more comprehensible and manageable for computers as well. (In HTML, an application needs to do a lot more work to interpret a document because some sloppiness in the tagging is allowed. XML is more precise: for example, all tags in an XML document must have the same case and be "closed": <br></br> or <br/>, but not <br>. Furthermore, because XML is an open, non-vendor specific standard, and because XML tagged content is text-based and self-describing, XML allows for easier transfer of information between different applications and platforms. Finally, although commercial tools have not yet been developed to meet developers' enthusiasm, XML does have a growing base of tool support (or promises of support) among producers of software for Web development, browsing, and searching.

 

Problems with XML

The current reality for library Web developers, however, does not yet match up to the ideal for several reasons. First of all, although one of the primary design goals for the creators of XML was that XML documents should be easy to create, ease of use is a problem. The tagged XML document itself is fairly easy to create and the availability of ready to use or easily adaptable DTDs is good. But, the complexity of the growing family of changing XML standards could be a significant barrier to implementation. Although XSL-T, XML Schema, XQL, and XPointer are not programming languages, they do make use of programming concepts and processes that, to most librarians, will be unfamiliar territory (recursive processing, objects, etc.). They also utilize complex---and different---syntactic rules for development. In addition, with the exception of XSL-T, none of the XML-associated technologies discussed above has reached the World Wide Web Consortium's recommendation stage: they are all working documents which are not yet stable, and not yet well supported by commercial software vendors. For example, the use of XSL-T to transform XML into HTML is something of a "hack" that is implemented because neither of the formatting mechanisms described above are quite ready for prime time. XSL is still in development and not currently fully implemented in any commercial browser; CSS works, but only for later versions of Microsoft's Internet Explorer. Hopefully, these and similar problems are temporary; when the technologies become standard and full-featured XML tools are created (which most people believe is only a matter of time), the complexity of XML technologies could be hidden behind friendly, easy to use interfaces.

 

XHTML (Extensible HTML): A reformulation of HTML in XML

If you want to begin learning about XML but don't have time to tackle changeable, complex technologies, you might want to start with XHTML, released by the World Wide Web Consortium as a recommendation in late January 2000. XHTML consists of three XML vocabularies that correspond to the three DTDs outlined in HTML 4.0---Strict, Transitional, and Frames. In other words, XHTML is a tagging language that follows the rules for XML, but uses the same basic HTML tags with which Web developers are already familiar. In addition, XHTML documents can be read by current browsers, so you don't need to compromise access as you experiment with new technologies. XHTML offers several advantages over HTML:

XHTML developers will discover that XML's precise rules require some changes from common HTML practice (or malpractice), although these are fairly minor:

 

To learn more about the potential of XML and XHTML...

Start with the World Wide Web Consortium's "XML in 10 Points," and the XML FAQ. Then, explore Alan Richmond's excellent "Introduction to XHTML, with eXamples". Many additional tutorials exist online, linked from the World Wide Web Consortium's homepage. For a more in-depth treatment, Elliotte Harold's book, XML Bible, is excellent. Also, ITIG's own Norman Desmarais has just published a book targeted specifically to librarians, ABCs of XML : The Librarian's Guide to the Extensible Markup Language.

 

About the Author

Charity Hope is a Librarian at the UMass Amherst Libraries, where she is nearing completion of a project to publish two scholarly monographs online, using XML and XSL-T. Contact her through August 1, 2000, at chope@library.umass.edu.

 

Charity Hope
Librarian
W.E.B. Du Bois Library
UMass Amherst Libraries
Amherst, MA 01003
chope@libraries.umass.edu

Comments Welcome!

 




© Copyright 1999-2001, ITIG ACRL/NEC Information Technology Interest Group. All Rights Reserved.
Website currently maintained by ITIG Webmaster, Olga Verbeek
ITIG URL: http://www.acrlnec.org/sigs/itig/
Last updated: Monday, April 30, 2001
ACRL/NEC Home
ACRL/NEC Newsletter
Join ACRL/NEC
ACRL Home