MOD51 Analysing PDF Typography

24 April 2013

Typography In the Browser, Typesetting, Digital Content Production, Production Challenges

The question we are answering is -How do I get a complete typography specification conformance report from a PDF?- A focus in our never ending adventures in digital-content-land is to work out how many ways to slice n dice content. There is a saying that - any finite set of data has an infinite number of interpretations". Typography analysis is today's content interpretation sub-adventure on the march towards infinity.

The question we are answering is "How do I get a complete typography specification conformance report from a PDF?"

A focus in our never ending adventures in digital-content-land is to work out how many ways to slice 'n dice content. There is a saying that "any finite set of data has an infinite number of interpretations". Typography analysis of a PDF is today's content sub-adventure on the march towards infinity.

One of the big features of IGP:Digital Publisher is to be able to fully typeset multiple editions of books with human control directly from XHTML without going through any desktop programs.

The benefits of this are relatively obvious if your content has any future value:

  • Valuable publisher digital content is managed in a proper manner, off the desktop.
  • You can create multiple print AND e-book editions instantly from the same content source. 
  • Your content is instantly ready for any new business requirements, new formats or new platforms.
  • You can remix and do all sorts of wonderful things with appropriate content and still deliver any format.

We achieve this through IGP:Typography In the Browser. It's a radically different approach to typesetting but delivers substantially the same result as desktop programs with some things being easier, some more difficult.

Content as Numbers

Now about typography layout analysis. Try interpreting content as nothing but metrics! That is what the development team did with our new MOD51 Typography analysis tool for IGP:Typography In the Browser.

On the inside a PDF is just a big "X-Y" layout number map with coordinates for the position and size of every character. We discussed the horror of this in a previous post. It's a confusing stack of numbers.

MOD51 takes any PDF and creates a content layout metric map right down to the fullstops. Then our top-secret analysis algorithm kicks into life and labels all the bits. Next the analysis data is processed through an interpretation processing algorithm into pass/fail typesetting rules. Easy!

MOD51 Typography Exposed

MOD51 is our generic PDF analyser previously discussed here. Not only does MOD51 now faultlessly extract content from a PDF to create awesome production ready XHTML; it has been extended to analyse typesetting properties using the magic layout numbers hidden in every PDF.

alt

MOD51

We have a growing library of content processors that do just about everything to every type of content you could want. When we started this project the next Module number just fell on 51. It was fate. This project was so challenging at the outset the alien connotations were reminiscent of  the legendary Area 51. It seemed wrong not to immortalize that number in the product.

Moving on

Any typesetting application tries to get the defined typography rules right. These rules boil down to something like this:

Section turns Line count on the last page, sometimes character count on the last line

Page turns Widows and Orphans, page turn hyphens, character count on a widow

Paragraph turns Last line character count and hyphenation

Line turns hyphenation, character counts

Ladders. Stacks of hyphens or similar word endings

Bad tracking Loose lines, tight lines, etc.

IGP:Typography in the Browser uses PrinceXML for the primary PDF generation. PrinceXML is a highly competent layout application. But like all print layout software there are things it cannot reasonably do because of the infinite variety of digital content. We humans are not going to be out of a job any time soon fixing up the fuzzy edge decisions which the layout programs can not handle!

MOD51 Typography Analysis works like this

Set up your type-spec

You define your core typesetting rules in the Interface on a universal, imprint or book basis in a pretty standard sort of interface.

Setting up a type-spec is relatively easy. Getting the content to match it the whole way is a little more challenging.

The Document Report

The PDF is generated from the IGP:FoundationXHTML for the applicable Design Profile. It is then submitted to MOD51 Typography and it generates a human readible report for the PDF highlighting all the discovered errors. This is presented in descending document impact order. (The errors have been shortened in this example to keep the post under 10,000 lines!)

SECTION TURN REPORT
0201: Page 55. Line count short on section last page. 2 lines, should be at least 5 lines.
0210: Page 98. Last line word count short, 1 words, should be at least 3 words.
0210: Page 112. Last line word count short, 1 words, should be at least 3 words.
Total Number of Section Errors : 3
WINDOWS AND ORPHANS
Total Number of Page Orphan Errors : 0
0302: Page 7. Widow count is 1. Required is 2.
0302: Page 8. Widow count is 1. Required is 2.
0302: Page 16. Widow count is 1. Required is 2.
0303: Page 16. Widow line character count is short. It must be at least two-thirds of a line.
0302: Page 18. Widow count is 1. Required is 2.
... shortened....
Total Number of Page Widow Errors : 35
PARAGRAPH REPORT
0401: Page 6. Para 30. Last line is 4 characters. Required is 6 minimum.
0401: Page 10. Para 63. Last line is 4 characters. Required is 6 minimum.
0401: Page 10. Para 67. Last line is 5 characters. Required is 6 minimum.
0401: Page 16. Para 99. Last line is 5 characters. Required is 6 minimum.
... shortened ...
Total Number of Last Line Char Count Errors : 35
EOL HYPHENATION REPORT
Total Number of Leading Hyphenation Errors : 0
0502: Page 4. Para 23. Para Line 6. 2 character on trailing hyphenation.
0502: Page 4. Para 23. Para Line 11. 2 character on trailing hyphenation.
0502: Page 7. Para 38. Para Line 3. 2 character on trailing hyphenation.
0502: Page 12. Para 79. Para Line 2. 2 character on trailing hyphenation.
... shortened ...
Total Number of Trailing Hyphenation Errors : 18
0510: Page 60. Para 383 has 3 consecutive hyphenated lines at Para Line 1.
0510: Page 121. Para 859 has 3 consecutive hyphenated lines at Para Line 4.
Total Number of Consecutive Hyphenation Errors : 2
LOOSE LINE REPORT
Total Number of Loose Line Errors : 0
SUMMARY TIMING [A little bit of technical timing stuff]
total time for Section Report : 0.0391ms
total time for Page Orphan Report : 0.0362ms
total time for Page Widow Report : 0.1370ms
total time for Last Line Char Count Report : 0.0420ms
total time for Leading Hyphenation Report : 0.0282ms
total time for Trailing Hyphenation Report : 0.0162ms
total time for Consecutive Hyphenation Report : 0.0282ms
total time for Loose Line Report : 0.0284ms
Analysis Time : 36.3291sec
Processing Time : 0.3568sec 

So MOD51 took around 37 seconds to fully analyse and process around 150 print PDF pages. Notice it is the analysis that takes the time. That's not bad and a lot faster than turning the pages.

Dynamic Section Report

Once a production editor starts the actual typesetting work and the page extent is pretty much set, you are working with a single chapter at a time to achieve that typographical perfection. If a chapter is longish, the widow corrections you make for Page 2 will possibly caused two new widows or some other problem further along in the chapter!

The Dynamic Section Report gives you a page by page look-ahead report for just the section that is being worked on. This time it is sequenced by generated section page numbers so you know exactly which pages to look at within the active section. This report is available at any time within seconds.

Page 2.
Widow count is 1.
Para 17. Para Line 1 is loose line.
Page 4.
Widow count is 1.
Page 8.
Widow count is 1.
Widow line character count is short.
Page 9.
Para 88. Last line is 4 characters.

This report is brief and easy to navigate through while inspecting and making corrections with the TIB tracking and flow-control tools. Having this look-ahead map of typographical layout problems created by any action makes the job a lot faster.

Here is the TIB interface with XHTML on the left, PDF and section report on the right. The temporary paragraph numbers allow correlation between the FX, PDF and report. Click the image for a larger view.

And On Into Infinity...

You can pretty much throw any PDF at MOD51 Typography and it will grind away and tell you how it doesn't conform to your typesetting specification.

With IGP:Digital Publisher and IGP:Typography In the Browser typesetting is an e-production issue as much as a print production issue.There isn't a knife between them.

Typographical settings are stored independently for each print edition away from the primary XHTML. This means there is no effect on the core content for e-Books or other format content. It significantly changes the way production people engage with content and turns print into "Just another format", which is where it should probably be in 2013.

Next our cloud of MOD51 metrics will be used to give recommendations for shortening or lengthening paragraphs to solve vexatious widow and para-turn problems. If it can't it will just state bluntly it can't do it and call for human help. For example "Widow on Page3 can be pulled back by shortening Para 14".

The final part of this journey is a multi-pass processor that has an honest shot at reflowing a book before calling for human assistance to substantially reduce the human effort. However that just may be getting too close to the real Area 51!

   

Posted by Richard Pipe

 

Start a real digital content strategy with

IGP:Digital Publisher

The complete digital publishing content management and production solution.

Available as for Small and Medium publisher:

Subscription Portals

Production Service Portals

IGP:Digital Publisher is also available as a full site license purchase.

Contact us for more information...

Use one master XHTML file to instantly create multiple print, e-book and Internet formats.

comments powered by Disqus