HTML5 Is As Good As It Gets for Publisher Digital Content

07 July 2014

HTML5, XHTML, Digital Content Strategies, IGP:FoundationXHTML, IGP:Digital Publisher

An entity with infinite properties such as HTML5 cannot be understood without considering all viewpoints. In publishing production that means all book and document genres...

2014 has had some strange and negative commentary about publishing with HTML5. The comments appear to be focused on HTML for trade fiction books and the requirements of publishing genres beyond simple narratives seems to be ignored.

An entity with infinite properties such as HTML5 cannot be understood without considering all viewpoints. In publishing production that means all book and document genres. So lets address this commentary imbalance and try to see the full digital content production reality, and how HTML5 really is as good as it gets; and is getting better.

Let's look at the other sides of this HTML thing.

There's An Elephant In The Room!

From Charles Maurice Stebbins & Mary H. Coolidge, Golden Treasury Readers: Primer, American Book Co. (New York), p. 89 Wikimedia Commons

Of course there are many different dialogs going on in digital content production. It sometimes seems to be like The Blind Men And An Elephant story; except there are different publishers in separate cubicles each reading different book genres as they described digital content:

One persons is looking at a novel on a Kindle and says how digital content needs to focus on the design, flow and typography for readability.
Another is looking at academic papers and journals on a workstation and says how important searching, linking, indexes and references are for understanding.
Yet another looks at K-12 textbooks in an ePub3 reader and says how the most important thing is interactivity, engagement and dyamic feedback.
Then from another cubicle a person looking at corporate and government documentation says it has to be multi-author enabled and dynamically reviewable and available instantly in all formats.
Then there is the cubicle where books just have to be apps, one where only fixed layout will work. There are many, many more cubicles.

So to be clear this post is about digital content production and ownership rather than formats. It is about what you need to do to ensure your production content satisfies all of today's cubicles and any cubicles that may emerge in the future.

The problem is not what you do.
The problem is how you do it.

How to Use HTML for Digital Publishing

(X)HTML(5), CSS(3) and Javascript provide all of the digital content production, engagement and delivery requirements for all publisher genres right now. All of the components are in place, are relatively mature and well understood. It will continue to be the distribution heart of digital content for the foreseeable future. Why not make it the production soul as well!

Back in the old days all genres had just one presentation method. Print books are edited and produced in a production machine that has spent 400 years learning how to do XY content presentation on paper; and there are still lousy looking print books produced. The digital content space in 2014 has many more challenges. We probably need to stop making ebooks a print designer's version of what they think ebooks should be.

Check out our sample book: "Questions and Answers (Reflowing)"

The same XHTML5 content generated:

An interactive ePub3 book with Q&A evaluation Download Best viewed using the AZARDI ePub3 Reading System.
A printed test question PDF Download
A PDF printable Answer Sheet for students Download
A PDF marking sheet for instructors/teachers Download

HTML5, semantics and metadata are not show-stopping problems. The common tools used to create them are the show-stopping problems. HTML production issues with valuable digital content are detailed rather than complex. HTML has been around and working for a long time; and it has all the tools needed for the most sophisticated 2014 publishing.

Publishers need to keep their managed content structurally complete and always ready for business. You can then consistently apply values to the structural components using class, data-, title, lang, translate and dir for semantic, processing, presentation, layout and any other requirements. While this is relatively straight-forward for fiction and even academic text, it is game changing for more complex, interactive, resuable, extensible and dynamic content.

Use the HTML(5) elements you choose carefully and consistently

The hardcore historical elements of HTML are <div />, <h1-h6 />, <p />, <ul />, <ol />, <li />, <img />, <table />, <span />, <a /> and a few others. Between them these define the no-fail, no argument structure of any document. They set the no-CSS fallback content pattern so core accessibility is addressed up-front and not as an after-thought.

The simple structural heart of HTML is its strength. XML consultants will pitch it as a weakness. They are wrong! The other HTML attributes deliver all the components required for the most sophisticated digital content publishing. HTML/5 in all its versions was designed for the world wide web. That is not a perfect fit for managed publisher content. For example:

HTML5 introduces a handful of new elements to help us define the structure of a given web page, such as <section>, <article>, <nav>, <aside>, <header> and <footer>.

We shouldn’t use them... .

If that’s all you needed to know, great. Keep using divs with meaningful class and ID names, and appropriate <h1>-<h6> headings. They’ll be valid forever (more or less), and you’re not missing out on anything.

However, I suggest using some non-HTML5 features when marking up documents, such as ARIA attributes for blind and sight-impaired users and microdata schemas (when appropriate) for search engine results.

Chapter 3 of The Truth About HTML5 by Luke Stevens.

This is good advice and is substantially the approach IGP:FoundationXHTML takes. Process to arbitrary "with-it" HTML5 from IGP:FoundationXHTML if that is what is required. We do that for ePub3 packaging where:

becomes

It happens effortlessly and automatically because the class tagging pattern semantics are correct right at the beginning. The tools just make it happen.

The <section> element is no more semantic that <div> except it has been given a lot of meaningless explanatory luggage in various HTML5 dialogues and debates. Getting picky, <div> or division means some form of separation and <section> means inclusion. Separation is the strength of <div>. As as structural component it neatly divides complex content into boxes which can then be qualified by semantic class statements and targeted by processors. <section> is rigid.

Our other "best practice" is to keep all nesting as shallow as possible. Definitely no <section> inside <section> craziness. These core HTML elements define a SIMPLE, consistent framework for reliable, reusable, processable digital content tagging patterns.

Click to see a short book example of the Foundation Tagging Patterns

This short example shows how all sections are packaged when combined into a single file. Each section can also be packaged individually. The Chapter-rw section has internal structures to show how IGP:FoundationXHTML keeps the content nesting as flat as possible.

FX Section Detail Tagging Specification Available Here

<div class=“galley-rw”>
  <div class="metadata-rw MetadataWork-rw"> .... </div>
  <div class= “frontmatter-rw BookTitlePage-rw”> .... </div>
  <div class= “body-rw Chapter-rw”>
    <div class="title-block-rw">
        <p class="title-num-rw">Section One</p>
        <h1>Section Title</h1>
        <p class="title-sub-rw">Section Sub-title</p>
        <p class="title-author-rw">Author's Name</p>
    </div>
    <h2>First section heading</h2>
    <p>The narrative starts</p>
    <h3>Second section heading</h3>
    <p>The narrative continues</p>
    <div class="text-block-rw extract-rw">
      <p>Extract text here</p>
      <p class="attribution-rw">Extract attribution</p>
    </div>
    <p>The narrative continues</p>
  </div>
  <div class= “backmatter-rw Index-rw”> .... </div>
  <div class= “specials-rw Advertisement-rw”> .... </div>
  <div class= “processor-rw ConfigurationFixedLayout-rw”> .... </div>
</div>

Format processors can simply convert <div> to the <section> or <article> elements bases on the class attribute values if that is really required for any particular package requirement.

Rule No 1. Never use XML namespaces
in production content.

In production we use well formed (X)HTML tagging patterns with HTML5 declarations. Especially never use epub:type in production content (Making ePub Play Nicely With HTML5. Brad Neuberg). This gives us the flexibility of MathML and SVG in the HTML without namespace declarations and the content is well-formed XSL processer ready. The HTML tagging is not done for the Internet and especially not any particular e-book format. The content tagging is done to address:

Complete and accurate structure, semantics, processing readiness, presentation rules, styling and metadata.
A digital expression of the current commercial value of the content
The processing and generation of all output formats including print PDF, all e-books, static sites, LMS packages and anything else required
Preservation of the value of the content for the future
New requirement content processing readiness.

We call this the "tag up, process down" approach. The (X)HTML5 applied to the content stored in IGP:Digital Publisher is rich, semantic, processor ready and complete. If any generated format doesn't need or can't use any particular elements or attributes they are processed out or replaced with suitable simplifications at format generation time.

RULE No 2. Drop website best practice rules. They do not apply to book publishing HTML.

Tagging Patterns

In IGP:FoundationXHTML tagging patterns replace the concept of XML validation. If the content tools consistently apply the correct well-formed patterns, the content is always ready for processing.

Tagging patterns beat the pants of XML DTD's and Schemas because they are complete and can be very easily modified and extended. XML Schemas and custom extensions require constant and expensive maintenance. At some stage version maintenance results in content processing failure at worst.

Because IGP:Digital Publisher is a publishing content production and management application it addresses publisher content directly and has hundreds of tagging patterns for dozens of document genres built in. That means publisher oriented semantic tagging is automatic and easy.

To the extent possible editors using IGP:Digital Publisher do not have to engage with the HTML, they work with a familiar structural and semantic vocabulary of block, paragraph and inline tagging patterns.

To understand more about XHTML tagging patterns the full IGP:FoundationXHTML specification is available online here. These structural, semantic and metadata driven tagging patterns address and resolve all the complaints being made about HTML. They are all built into IGP:FoundationXHTML, are all proven and at work in hundreds of thousands of documents and books.

HTML and CSS

We have to explicitly address the subject of the heavily maligned HTML class attribute. This attribute is the secret to maintainable publisher XHTML.

RULE No 3. The Class Attribute is King for creating content that is useful now and into the future.

For some reason many people want to deride the HTML class attribute. I have never understood why. It is close to infinitely powerful when used well.

This is compounded by the angle-brackets mess of HTML. By completely separating design (CSS), behaviour (JS), and structure (HTML) the specification gods have taken away the context that would make it easier for us mere mortals to give our documents a meaningful structure.

Baldur Bjarnason HTML is too complex

Baldur is possibly a markdown enthuisist and that is fine for manual production of the occassional text-only fiction book. But markdown is also more fragmented than nearly any other text content production system out there!

I don't think the breakdown here is correct. CSS can control presentation attributes (and much more) but core document design is actually in the HTML structure. As discussed above the HTML core gives the meaningful structure.

Whatever original designs there might have been for the class attribute, it’s now the de facto standard for apply CSS styling, and no matter how you try to tweak microformats to be semantic and not just class groupings I always see problems. How can you answer in any reliable way which class values are semantic indicators and which are presentational groupings, for example, especially when the semantics are intended to be extensible? The stopgaps, like prefixing semantic values to make them unique, make for a poor man’s metadata framework (and that’s not touching on the lack of a clear processing model).

Matt Garrish Semantic Overload.

The italics are mine. "Defacto standard for CSS styling?" And... so what!

"How can you answer...?" This is the easiest question to answer. Though a controlled, maintained vocabulary with or without prefixing, suffixing or case-rules. Just like XML without the policing cost of a Schema. It is also inappropriate to give the working solution a sneer-down label like "poor man's metadata framework" and raise vague objections such as "lack of clear processing model" if you haven't done it. If there is a controlled class vocabulary where is the lack of a clear processing model? I would love to understand what a "rich man's metadata framework" looks like.

CSS is only styling. This is at least a five year old viewpoint. Manipulation of class values is fundamental to core Javascript/JQuery interactivity and DOM manipulation as there is neither many alternatives or need for alternatives. HTML5 introduces the data- atttibute which is highly appropriate for processing with Javascript.

Just because a commentatory thinks of the class attribute as "web CSS styling" doesn't mean book production professionals should put their heads into that dark hole. It's called class because of the object-orientation, not a social structure statement. This is why it is so important to take a publisher eye view of HTML5 rather than that of a web designer. To further reinforce this point:

The class attribute, on the other hand, assigns one or more class names to an element; the element may be said to belong to these classes. A class name may be shared by several element instances. The class attribute has several roles in HTML:

As a style sheet selector (when an author wishes to assign style information to a set of elements).
For general purpose processing by user agents.

W3C HTML 4.1 Specification

3.2.5.7 The class attribute

Every HTML element may have a class attribute specified.

The attribute, if specified, must have a value that is a set of space-separated tokens representing the various classes that the element belongs to.

The classes that an HTML element has assigned to it consists of all the classes returned when the value of the class attribute is split on spaces. (Duplicates are ignored.)

Assigning classes to an element affects class matching in selectors in CSS, the getElementsByClassName() method in the DOM, and other such features.

There are no additional restrictions on the tokens authors can use in the class attribute, but authors are encouraged to use values that describe the nature of the content, rather than values that describe the desired presentation of the content.

W3C HTML5 DOM.html

The statement "authors are encouraged to use values that describe the nature of the content" makes it clear that semantic use of class is recommended. How that is done is up to the end user. Drive on the wrong side of the road and you will have an accident. This is where mediocre name-space constructs like the dead-on-arrival epub:type attrubute fail on delivery. Yes you can apply them (we do because it is easy). What do they actually do? Some of the definitions are relatively OK. Many are a digital content compromise disaster.

Every Publisher Needs To Address Their Content

Publishers do not need to, and should not tag to some specification agreed content standard. For a start that specification can never be written given the infinite variety and complexity of content. Publishers need to maintain their content tagging so they control their content. That means a very different approach from the XML Schema camp.

RULE No 4. Never let your content be tagged with an arbitrary XML Schema.

Our customers have been using IGP:FoundationXHTML since 2007. That is hundreds of publishers and tens of thousands of trade, academic and text books; plus magazines, academic articles, exam papers, government reports and many other document genres. In that time the following e-book formats have been introduced: Kindle KF-7, Kindle KF-8, ePub2, ePub3, Google Search inside PDF, WebApps and Apps. We have seen the development of the HTML5 WebApp and now Apps.

The same (X)HTML tagging patterns have seamlessly generated all of these formats as they emerged, including specific platform quirk treatments.

The job is to build off a well designed (X)HTML(5) foundation that makes everything just work rather than complain because you have the wrong tools or approach. A foundation that is applied consistently can be maintained, extended and optimized into and for the future.

There is absolutely no value for a publisher to adopt some common XML Schema. They should be using (X)HTML; the primary, current and future delivery method for all content.

It is absolutely clear that publishers cannot afford to build a valuable digital content strategy on a format package like ePub or proprietary formats. These are delivery formats only and have no valid or valuable properties for long-term digital content ownership. They enscapsulate thinking and technologies of yesterday in a world that that moving fast into tomorrow.

It is a fact that standards and specification bodies fiddle with their things. They constantly make mistakes and wrong decisions. Even worst they make unforgiveable compromises. No publisher should ever buy into those limitations.

Get on the HTML5 Wagon

As publishers and publisher solution providers we have to accept that the content that is made ready for print PDF, e-book or other forms of digital content distribution today, must be ready instantly for any format package, use online, WebApp App or export and processing by any other system. If nothing stated convinces you that HTML5 is the real answer for valuable future ready unstructured content then this article by Robin Berjon of the W3C might. Web 2024. Waiting for the singularity.

By using (X)HTML5 with well formed tagging patterns as the production method "small things" like serialization into JSON for secure delivery are just part of the process. We use JSON today for the delivery of content to AZARDI mobile reading systems.

There has to be a significant change in the way publishers think about the tools they use for digital content production and the value of the content that is generated by those tools.

Posted by Richard Pipe