AVOID XML-FIRST LIKE THE PLAGUE

10 February 2013

XML, IGP:FoundationXHTML, XHTML5, ePub3, Digital Content Strategies

Getting the structure of any digital content correct is core for any real digital content strategy. To start this discussion in the right tone. Publishers! Avoid XML with vehemence and violence. Implement an XHTML strategy with a controlled vocabulary.

O'Reilly's have always been strong proponents of "XML First" and have promoted their cleverness with this approach. Of course they have mainly mono-subject content with a half-life of only a few months. With the O'Reilly TOC conference starting in a few days it seems like the right time to bring this up.

Here is one of the saddest and typical ePub2-3 pain stories from O'Reilly themselves:

http://toc.oreilly.com/2013/02/oreillys-journey-to-epub-3.html

There are a LOT of reasons for selecting XHTML over XML; probably too many for this post, so this fireside chat may roll over into more posts.

Why Such a Strong anti-XML Pitch?

Easy. Because it doesn't work, it is not maintainable, it is not sustainable, it is not extensible, it is not flexible or agile, it costs a lot of money, it never delivers what it promises and it is not ready for the type of future content is now facing.

These are of course generalizations. Sometimes it does work. But just having a cupboard full of XML tagged content is not a real 2013 digital content strategy.

The same way that HTML5 has eliminated Flash and redefined what the Internet is becoming, it redefines publisher content management strategies. XML consultants either don't get it, or don't want to get it. From the O'Reilly article:

"Over the past year and a half, O’Reilly has sponsored the DocBook project’s development of open source XSL stylesheets for transforming DocBook XML content to EPUB 3, which we’ve used to update our own toolchain to produce EPUB 3 output. With the release of iBooks 3.0 in late 2012, a critical mass of O’Reilly’s readers had devices that supported EPUB 3 content. We felt it was time to upgrade our content to EPUB 3 to provide people using 3.0-compliant platforms the best quality reading experience."

Yes folks. It took them only a year. IGP:Digital Publisher was creating ePub3 from five year old stored content in December 2011, just 15 days after the spec was released. No XSL stylesheet to transform DocBook crap to XHTML. All their special handling statements in the article were just implicit in the content.

It only has to be hard if you use XML first!

IGP:FoundationXHTML

Around 18 months ago I started a series of articles on the philosophy and approach of IGP:FoundationXHTML. For various reasons that fizzled to a halt as I went on a blogging sabbatical to focus on the job.

Since then we have put the full IGP:FoundationXHTML Specification documents online so a set of articles is not really required. Those interested in these deep things can read and absorb it at their leisure. However it does seems to be time to restart the dialogue now the resources are available to talk against.

The argument with the "XML crowd" boils down to one word "semantics". It is a load of nonsense. On our IGP:FoundationXHTML Guiding Principles page we define SEVEN properties with equivalent importance. These are:

Structure. In FX structure is king. (Yes XML people; structure is The One). This is the core value which defines the content stack and conditional grouping. It is best understood as the core accessibility value of the content when no styling or processing is applied. The core structure elements are XHTML/5. There is no reinvention or extension of the HTML; and the available HTML elements must be used with thought. Not all HTML web oriented elements and attributes apply for high future value digital content. A Title Page, Chapter, Title, paragraph and list are structural components.
Semantics. Semantic names should only be used if the value is explicit and associated with a structural element. Semantics are always applied as qualifiers to structure. Using HTML as the base ensures it is not possible to create a tagging pattern where structure has to be implied from semantics. This approach prevents digital content "death by semantics".
Styling. Styling is the layout, decoration or prettifying with CSS for increased understanding, user engagement, custom presentation or branding. It is important to understand that CSS is a very powerful tool that works with the XHTML on multiple dimensions. Styling is just one of those dimensions.
Presentation. In FX terms presentation means the "format" context from which a content presentation instance will be delivered. Eg: PDF, e-books, other formats, static sites, fixed and flow layouts, interactivity, CDP/ACO and remixed content. To the extent possible FX tagged content must always be available for any presentation context. Where content or tagging is created for a single or set of presentation contexts, the tagging must be explicit and obvious for those presentation contexts to both humans and processors through controlled grammars.
Behaviour. Modern digital content often has the requirement to exhibit behavioural characteristics in various contexts. FX must allow and enable required behaviour without reducing, siloing or hiding core content value. This includes the ability to be used with CSS modifiers such as transforms, transitions and animation; Javascript assisted interactivity; and flow/fixed/variable layout for a range of digital content reading devices, platforms and reuse environments.
Processing. Digital content of worth requires processing for many purposes. Defined machine processing instructions must be consistent to reduce processing costs and future processing costs to a minimum. FX should be created to ensure tagging patterns provide clarity to allow explicit processing to achieve any required result. IE. Processing is never a "leave it until later" option. FX must implicitly state what and how content is to be processed without ambiguity. Ideally every FX tag is a processing target.
Metadata. Metadata in the FX context is data about content. This can included descriptive, fixity, provenance, rights, third-party vocabularies and processing instructions. Comprehensive and correct metadata is required for all formats, but is essential to allow content to be processed and used correctly in multiple advanced delivery contexts such as SCORM, web pages, extraction and remixing environments, even little ePub generation. FX states metadata is more important than semantic tagging for the correct use and reuse of content, although there are structures (such as references) that can be tagged for metadata extraction directly from semantic selectors. However this is inevitably more costly than providing straight-forward metadata constructions.

You can read more about the arguments of controlled vocabularies using XML "validation" vs. just controlling them with tools here and throughout the FX specification.

With the exception of NLM (and a lot of that is tagged really badly) there is not a XML system out there that delivers the goods for any publisher of any content whatsoever. You hear how DocBook has a large vocabulary (it's actually rather weak and misses details) but it does not even get content structure close to correct. It is very sad.

Disclaimer

We used the NLM Bibliography semantic tagging patterns in IGP:FoundationXHTML because it is excellent, created by experts, complete and one of the areas in content where semantics really is an important property for academic discovery.

You have to respect, admire and use professional excellence in XML. It doesn't happen as much as many people think.

Of course the problem of digital content ownership and production is bigger than atrocious XML strategies.

So you are a small or medium publisher. If you read that article by O'Reilly you should be running screaming for the XML exit door.

What can be worse than XML

As it happens there are worse digital content strategies than worthless XML. Even worse than XML is trying to produce multiple formats in desktop environments such InDesign, Sigil and Calibre, or like ilk. These can eventually produce an ePub format with massive effort.

Now you need an ePub3. Blast. Gotta go throught the same thing again. If you need complex notes, indexes, references, and image positioning it just can't be done sensibly.

People who use these systems think because the PDF was made in InDesign, I can use the same tool to make an ePub. Adobe will get it right, Yeah!

The fact is Adobe can't get it right. You only have to read the IDML document to see why they can never get it right without a total refactoring of the core software (It's an XML garbage can). From a digital content production perspective tools such as InDesign (great for PDF) and Apples iBooks Publish are the digital content gutter; where money runs away and content goes to die. It is incredibly expensive to create that content in the first instance, and it cannot be reused by publishers to make more money.

Back to the plot

The XML crowd (those who think DocBook, TEI, NLM, or some custom XML monstrosity is a content strategy) have got all this wrong.

The O'Reilly article highlights the stupidity of an XML repository in mindless DocBook. If your "XML consultant" is recommending any of these approaches, chances are they are going to cost you, the publisher, a lot of money without much in the way of deliverables, except delivering the consultant more money.

Yesterdays XML solutions do not address the business dynamics required by publishers today. What do you as a publisher need from your digital content? Try this list:

Make money as fast as possible.
Make money from as many channels in the digital content e-retail diaspora as possible.
Adapt instantly to changes in the content delivery landscape.
Use the same content you made your ePub2 formats with a year ago into ePub3, but with advanced changes and modifications to exploit new features.
Address reflowable and fixed layout instantly, even on the same digital content if necessary.
Create hardcover, large-print, paper-back and mass market print PDF editions from the same digital content source.
Instantly handle font sub-setting, obfuscation, SMIL processing, rich-media optimization without having to do any work.
Make new and original products that match todays emerging readers.
Create your own forward looking tablet/desktop/device of tomorrow content engagement experiences.
Constantly and consistently reduce costs, increase profits and improve reader engagement experiences.

Conclusion

OK. If you are writing and publishing a novel a text-editor is fine. If you are a hobbiest type production person enjoying the incompatibility problems of Kindle, iBooks, Nook, Kobo, etc. differences that's fine.

But if you have to deliver education, academic, trade non-fiction, self-help, travel, cooking, magazines and just about every other type of content out there on a schedule and a budget, the print and webpage origin tool-set doesn't cut it.

All the format and channel delivery problems are addressed very easily if you start with your content in the right format. In 2013 that means XHTML5. Don't compromise.