24 April 2013
The question we are answering is -How do I get a complete typography specification conformance report from a PDF?- A focus in our never ending adventures in digital-content-land is to work out how many ways to slice n dice content. There is a saying that - any finite set of data has an infinite number of interpretations". Typography analysis is today's content interpretation sub-adventure on the march towards infinity.
The question we are answering is "How do I get a complete typography specification conformance report from a PDF?"
A focus in our never ending adventures in digital-content-land is to work out how many ways to slice 'n dice content. There is a saying that "any finite set of data has an infinite number of interpretations". Typography analysis of a PDF is today's content sub-adventure on the march towards infinity.
One of the big features of IGP:Digital Publisher is to be able to fully typeset multiple editions of books with human control directly from XHTML without going through any desktop programs.
The benefits of this are relatively obvious if your content has any future value:
We achieve this through IGP:Typography In the Browser. It's a radically different approach to typesetting but delivers substantially the same result as desktop programs with some things being easier, some more difficult.
Now about typography layout analysis. Try interpreting content as nothing but metrics! That is what the development team did with our new MOD51 Typography analysis tool for IGP:Typography In the Browser.
On the inside a PDF is just a big "X-Y" layout number map with coordinates for the position and size of every character. We discussed the horror of this in a previous post. It's a confusing stack of numbers.
MOD51 takes any PDF and creates a content layout metric map right down to the fullstops. Then our top-secret analysis algorithm kicks into life and labels all the bits. Next the analysis data is processed through an interpretation processing algorithm into pass/fail typesetting rules. Easy!
MOD51 is our generic PDF analyser previously discussed here. Not only does MOD51 now faultlessly extract content from a PDF to create awesome production ready XHTML; it has been extended to analyse typesetting properties using the magic layout numbers hidden in every PDF.
We have a growing library of content processors that do just about everything to every type of content you could want. When we started this project the next Module number just fell on 51. It was fate. This project was so challenging at the outset the alien connotations were reminiscent of the legendary Area 51. It seemed wrong not to immortalize that number in the product.
Any typesetting application tries to get the defined typography rules right. These rules boil down to something like this:
Section turns Line count on the last page, sometimes character count on the last line
Page turns Widows and Orphans, page turn hyphens, character count on a widow
Paragraph turns Last line character count and hyphenation
Line turns hyphenation, character counts
Ladders. Stacks of hyphens or similar word endings
Bad tracking Loose lines, tight lines, etc.
IGP:Typography in the Browser uses PrinceXML for the primary PDF generation. PrinceXML is a highly competent layout application. But like all print layout software there are things it cannot reasonably do because of the infinite variety of digital content. We humans are not going to be out of a job any time soon fixing up the fuzzy edge decisions which the layout programs can not handle!
You define your core typesetting rules in the Interface on a universal, imprint or book basis in a pretty standard sort of interface.
The PDF is generated from the IGP:FoundationXHTML for the applicable Design Profile. It is then submitted to MOD51 Typography and it generates a human readible report for the PDF highlighting all the discovered errors. This is presented in descending document impact order. (The errors have been shortened in this example to keep the post under 10,000 lines!)
SECTION TURN REPORT 0201: Page 55. Line count short on section last page. 2 lines, should be at least 5 lines. 0210: Page 98. Last line word count short, 1 words, should be at least 3 words. 0210: Page 112. Last line word count short, 1 words, should be at least 3 words. Total Number of Section Errors : 3 WINDOWS AND ORPHANS Total Number of Page Orphan Errors : 0 0302: Page 7. Widow count is 1. Required is 2. 0302: Page 8. Widow count is 1. Required is 2. 0302: Page 16. Widow count is 1. Required is 2. 0303: Page 16. Widow line character count is short. It must be at least two-thirds of a line. 0302: Page 18. Widow count is 1. Required is 2. ... shortened.... Total Number of Page Widow Errors : 35 PARAGRAPH REPORT 0401: Page 6. Para 30. Last line is 4 characters. Required is 6 minimum. 0401: Page 10. Para 63. Last line is 4 characters. Required is 6 minimum. 0401: Page 10. Para 67. Last line is 5 characters. Required is 6 minimum. 0401: Page 16. Para 99. Last line is 5 characters. Required is 6 minimum. ... shortened ... Total Number of Last Line Char Count Errors : 35 EOL HYPHENATION REPORT Total Number of Leading Hyphenation Errors : 0 0502: Page 4. Para 23. Para Line 6. 2 character on trailing hyphenation. 0502: Page 4. Para 23. Para Line 11. 2 character on trailing hyphenation. 0502: Page 7. Para 38. Para Line 3. 2 character on trailing hyphenation. 0502: Page 12. Para 79. Para Line 2. 2 character on trailing hyphenation. ... shortened ... Total Number of Trailing Hyphenation Errors : 18 0510: Page 60. Para 383 has 3 consecutive hyphenated lines at Para Line 1. 0510: Page 121. Para 859 has 3 consecutive hyphenated lines at Para Line 4. Total Number of Consecutive Hyphenation Errors : 2 LOOSE LINE REPORT Total Number of Loose Line Errors : 0 SUMMARY TIMING [A little bit of technical timing stuff] total time for Section Report : 0.0391ms total time for Page Orphan Report : 0.0362ms total time for Page Widow Report : 0.1370ms total time for Last Line Char Count Report : 0.0420ms total time for Leading Hyphenation Report : 0.0282ms total time for Trailing Hyphenation Report : 0.0162ms total time for Consecutive Hyphenation Report : 0.0282ms total time for Loose Line Report : 0.0284ms Analysis Time : 36.3291sec Processing Time : 0.3568sec
So MOD51 took around 37 seconds to fully analyse and process around 150 print PDF pages. Notice it is the analysis that takes the time. That's not bad and a lot faster than turning the pages.
Once a production editor starts the actual typesetting work and the page extent is pretty much set, you are working with a single chapter at a time to achieve that typographical perfection. If a chapter is longish, the widow corrections you make for Page 2 will possibly caused two new widows or some other problem further along in the chapter!
The Dynamic Section Report gives you a page by page look-ahead report for just the section that is being worked on. This time it is sequenced by generated section page numbers so you know exactly which pages to look at within the active section. This report is available at any time within seconds.
Page 2. Widow count is 1. Para 17. Para Line 1 is loose line. Page 4. Widow count is 1. Page 8. Widow count is 1. Widow line character count is short. Page 9. Para 88. Last line is 4 characters.
This report is brief and easy to navigate through while inspecting and making corrections with the TIB tracking and flow-control tools. Having this look-ahead map of typographical layout problems created by any action makes the job a lot faster.
You can pretty much throw any PDF at MOD51 Typography and it will grind away and tell you how it doesn't conform to your typesetting specification.
With IGP:Digital Publisher and IGP:Typography In the Browser typesetting is an e-production issue as much as a print production issue.There isn't a knife between them.
Typographical settings are stored independently for each print edition away from the primary XHTML. This means there is no effect on the core content for e-Books or other format content. It significantly changes the way production people engage with content and turns print into "Just another format", which is where it should probably be in 2013.
Next our cloud of MOD51 metrics will be used to give recommendations for shortening or lengthening paragraphs to solve vexatious widow and para-turn problems. If it can't it will just state bluntly it can't do it and call for human help. For example "Widow on Page3 can be pulled back by shortening Para 14".
The final part of this journey is a multi-pass processor that has an honest shot at reflowing a book before calling for human assistance to substantially reduce the human effort. However that just may be getting too close to the real Area 51!
Posted by Richard Pipe
Start a real digital content strategy with
The complete digital publishing content management and production solution.
Available as for Small and Medium publisher:
IGP:Digital Publisher is also available as a full site license purchase.
Use one master XHTML file to instantly create multiple print, e-book and Internet formats.