Reviewing XML Options: BLOG@IGP

I have been on a remorseless LinkedIn discussion on the values of XHTML vs. XML DTDs (again). I take the position that the big three DTDs DocBook, TEI and NLM are holes to sink money into. Most others say we see the point but we think all publishers should use DB/TEI/NLM.

These XML options have a place and application, but they are not a suitable option for general publisher content.

I am planning a guideline for creating an XML strategy using XHTML in my next article. Before we start with the XHTML strategy, lets look at these big three which are tirelessly quoted over and over as being suitable for "any" XML strategy. This gives some context. It may be more information than you wanted to know, but here goes.

National Library of Medicine - NLM

Their statement. "The National Center for Biotechnology Information (NCBI) of the National Library of Medicine (NLM) created the Journal Archiving and Interchange Tag Suite with the intent of providing a common format in which publishers and archives can exchange journal content. The Suite provides a set of XML schema modules that define elements and attributes for describing the textual and graphical content of journal articles as well as some non-article material such as letters, editorials, and book and product reviews." URL: http://dtd.nlm.nih.gov/

It's a Journal article XML strategy. Since journal articles cover just about every subject under the sun it is comprehensive. There are four variants for backlist, authoring, publishing and archiving. NLM was first released in 2003 (making it a relatively late starter), the current version is running at 3.0, and it was last updated in 2008. There is a major strategy in hand to rework the Schemas to allow customization (the inevitable end of any committee based XML). If you are an academic publisher evaluate it closely. If you are not, ignore it and any advice to use it.

Text Encoding Initiative - TEI

Their statement. "The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form. Its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics. Since 1994, the TEI Guidelines have been widely used by libraries, museums, publishers, and individual scholars to present texts for online research, teaching, and preservation." URL: http://www.tei-c.org/index.xml

So humanities, social science and linguistics is a pretty broad brush. It started as SGML and moved to XML as TEI Lite. There are around 70 members mostly universities and government organizations. If you are a national government, or university wanting to take an academic approach to your XML this is the one for you. If you are not, ignore it and any advice to use it.

DocBook - DB

Their statement. "DocBook is a schema (available in several languages including RELAX NG, SGML and XML DTDs, and W3C XML Schema) maintained by the DocBook Technical Committee of OASIS. It is particularly well suited to books and papers about computer hardware and software (though it is by no means limited to these applications).

Because it is a large and robust schema, and because its main structures correspond to the general notion of what constitutes a “book,” DocBook has been adopted by a large and growing community of authors writing books of all kinds. DocBook is supported “out of the box” by a number of commercial tools, and there is rapidly expanding support for it in a number of free software environments. These features have combined to make DocBook a generally easy to understand, widely useful, and very popular schema. Dozens of organizations are using DocBook for millions of pages of documentation, in various print and online formats, worldwide." URL: http://www.docbook.org/whatis

OK so they are selling a bit in their statement. But the origins is computer books, and a very large subset of the elements are computer code specific. So if you are into highly structure controlled linear technical books without significant content variation go for DocBook. If you are not, ignore it and any advice to use it.

Summary of the "Big Three"

If you take a ramble through these sites there is one common theme. We are really complicated, have large vocabularies that are hard to learn and use, therefore we provide all these "resources" to get you off to a quick start. In addition they all proudly provide some sort of mini-version with trainer wheels.

What does this mean to any publisher? It is not hard to work out.

They are all specialist DTDs/Schemas, or specialist DTDs trying to become general with updates and customization modules.
They are difficult to learn, therefore difficult to check and test
They can all be extended with cost
By implication of extensibility, they all have tagging limitations
They all claim to be able to be able to be processed to XHTML (Internet) and PDF (print)
They are all very old and have their roots in SGML/SML in the days when it strategies were being formulated and with the exception of NLM, long before the Internet was real.
They all cost a lot of money to get started, use and maintain.

Now take this statement from TEI, which applies to all the above XML's to some extent (NLM less so) " ...there is no one correct way to encode any given text... ". This is a massive problem for future use of any XML and is an identifiable failure of DocBook and TEI in particular. This alone destroys the concept of interchange and predictable future value.

This statement is reinforced by the following "Some information from DocBook on 'Future-proofing' ", extracted from the DocBook site (in a fair-use manner I hope):

"Whether you’re just getting started with DocBook, or curating a collection of tens of thousands of DocBook documents, one question that you have to consider is “how stable is DocBook?” Will the documents that you write today still be useful tomorrow, or next year, or in the next century?

This question may seem particularly pertinent if you’re in the process of converting a collection of DocBook 4.x documents to DocBook V5.0 because we introduced a number of backward-incompatible changes in V5.0.

The DocBook Technical Committee understands that the community benefits from the long-term stability of the DocBook family of schemas. We also understand that DocBook must continue to adapt and change in order to remain relevant in a changing world.

All changes, and especially changes that are backward incompatible (changes that make a currently valid document no longer valid under a new version of the schema), have a cost associated with them. The technical committee must balance those costs against the need to remain responsive to the community’s desire to see DocBook grow to cover the new use cases that inevitably arise in documentation.

With that in mind, the DocBook Technical Committee has adopted the following policy on backward-incompatible changes. This policy spells out when backward-incompatible changes can occur and how much notice the technical committee must provide before adopting a schema that is backward incompatible with the current release.

This policy allows DocBook to continue to change and adapt while simultaneously guaranteeing that existing users will have sufficient advance notice to develop reasonable migration plans"

If you are planning to use DocBook (or any of the others), plan big for the migration, where big means money. Also take note they are not complete strategies. With real publisher content it is easy to run an XSL or other processes across the content in an ETL (Extract, Transform, Upload) operation. It is is significantly different to know that thousands of documents / books / articles, or whatever your content is, has been processed correctly.

If you are planning to use TEI, it is probably going into a database or website so it probably doesn't need any real future-proof value, but remember it is primarily designed for machine reading, not reuse, extraction and future proofing. I could give a dozen tagging examples how TEI encoding destroys content value, and production budgets.

High quality XML for publisher content is ultimately about strictly controlled tagging patterns that are used consistently, always, which encapsulate fluidity for presentation, and it is this that generates current and future value. None of the above XML strategies directly introduces standards for the correctness of tagging patterns on content. They are a technical, and somewhat abstract definitions about the element and attribute vocabularies they support. The tagging model extension method is expensive and usually ill-fitting wide genres of content, and the explicit detail that is needed for individual publisher content for real reuse.

XHTML Power

Hugely missing from the stark simplicity of XML and the complex XML strategies created is the core underlying nature of content. It just isn't there in the definitions. Read any of the "list of elements" for any of the above DTDs. You will get the drift with some of the grammar, but much will be opaque. You cannot tell how or more importantly where an element is to be used from its grammar. So these XML strategies start with a massive learning cost. OK you can get the outsource crowd to do the tagging, but will you know if it is applied well and correctly if your deliverable is an XML file and a ePub. Will you look at every XML file delivered?

Each element in HTML falls into zero or more categories that group elements with similar characteristics together and deconstructs into the following:

Metadata content
Flow content
Sectioning content
Heading content
Phrasing content
Embedded content
Interactive content

This was stated in the previous post - but is so important to grasp it doesn't hurt to repeat it here. This is massively missing in most XML strategies, except by reference to the DTD. It is not an explicit content strategy. These define the primary purpose of each element and its relation to all others. This XML strategy is extremely well known and understood across the world.

Conclusion

This post is not a criticism of these DTDs for what they are. They are all technical tour-de-forces. It is to make people thing seriously about their relevance and applicability for the production of the full gamut of publisher digital content in 2011, and the future possible uses of that content. DL/TEI/NLM will only deliver in extremely narrow publisher genres where the content is for specific objectives.

The main purpose of this discussion is to bring XHTML forward as a real, valuable digital content strategy that fits a wider range of content than specialist XML strategies ever will.

XML consultants always fall back to "the three". Understandably they don't want to reinvent a custom XML vocabulary which is an incredibly difficult and expensive under-taking. But these three for all their qualities and features, fall far short of the requirements of 2011 digital content strategies. An approximate or close fit just doesn't do it. It has to be an exact fit.

Meanwhile XHTML stands patiently waiting to be used, but is ignored because of the "Internet Web Page taint". It gets dismissed instantly as unusable with boiler-plate statements like:

It is not a good archive format because ('n' boilerplate statements)
There is no future value
It is about structure, not purpose
While I understand it can be used I would still recommend DB/TEI/NLM.

Every one of these statements is wrong and exhibit a serious lack of knowledge of what is, and will be, expected of general publisher content from 2011 forward. Shoving it into semi-relevant 1990's containers and saying it is now format neutral and therefore more valuable is stupid. XHTML is an incredibly powerful, internationally supported, highly controlled XML content Schema. The core structures must be valid, the tagging patterns must be well formed but can be highly amorphous. Just what is needed for future value reuse.

Publisher choice boils down to 1) "One of the big three" with all the problems highlighted above, 2) a fully custom XML DTD/Schema, or 3) XHTML with controlled vocabularies.

My next few posts are going to show any publisher, of any size, how they can exploit the massive powers of XHTML to make powerful, low-cost, sophisticated, custom XHTML strategies instantly.

If you are thinking about more use, reuse and extension of your digital content than ePub and Kindle, then XHTML is the only option.

Of course our objective is that smart publishers will use IGP:Digital Publisher, a multi-format XHTML web services production environment, the free IGP:FoundationXHTML maintained tagging pattern library to produce future value XML, print PDF, e-books, and highly remixable and processor ready content.