Culture clashes II: PDF, XML and what's in it for me?

When I wrote this post, I left out a whole second "trigger" because of time and energy.

That trigger--once again, wondering whether my humanities background (rhetoric major, math minor) leaves me simply unable to cope with the true Scientific Mind--regarded the format used for publication.

Or, to put it another way, the widespread and vehemently-expressed view that PDF sucks (to use a polite version).

What I saw, in several conversations, was a seeming demand from text-miners that everything must be in HTML (or, better, XML) so it was easy to mine, with a complete disdain for layout and typography as irrelevant. (I can only imagine Donald Knuth's response to the concept that typography and layout don't matter...)

Why some of us humanists use PDF

Because we care about typography. Because we care about the presentation of what we've written. Because PDF--and, of portable formats, only PDF--can assure us that the typefaces and layouts we've chosen will be rendered properly for the reader.

And because it's easy--pretty much automatic on the Mac, and not difficult on the PC (there's a free Office download to define a PDF printer; I use Acrobat because it produces much smaller PDF files and because it can combine many PDFs into a single file, but for 95% of users, the free download's good enough).

Getting from there to HTML

So you want HTML? Make it easy. Actually, for Word2007, it isn't bad: Save as Web page (filtered), and you get not-too-ugly HTML. (Since .docx is actually an XML package, it probably should be better than it is.) But you have to tune an HTML-version stylesheet if you really want to do both well--one that only uses "easy" typefaces, for example. It won't be elegant HTML, but it will work.

But, even here, what's in it for me? Can you demonstrate that I'll get more money, more fame, or even significantly more readers by taking those small steps?
"It makes it easier for me to plunder your text for my own purposes" is not, I hate to say, a terribly convincing reason. It might be for you, but it isn't for me.

Still...after years of doing only PDF for my own peculiar ejournal, I started doing Word's filtered HTML for most essays, because it did seem to serve some subset of readers--and it didn't add substantially to the production task. But whenever I read one of the HTML versions, I wince a little: It's just not as good as the PDF.

Going beyond HTML

But, you know, I think you want more than HTML. I think you want semantics--XML or better.

Provision of good-quality HTML from a regular writing-and-layout stream is at least plausible, with no real extra effort on the part of the writers and editors.

Provision of semantics, though--that's a huge additional effort, and I don't believe it's one that's readily automatable for non-trivial instances.

Which magnifies the question: What's in it for me?

I'm honestly interested in the answers. "Some neato research down the line that will earn someone else grants and tenure" may not be a wonderful answer. Just sayin'


Update, June 25, 2009:
Based on one comment (not here--ah, the multifarious conversational channels!) I should stress that, when I say "What's in it for me?" I'm not suggesting that there are no reasons to use HTML. Of course there are. (Hmm. I'm writing this in HTML, because it suits blogging--and, unlike WordPress' editor, this editor is pretty much raw HTML, other than automatic paragraph breaks.)

I'm suggesting that there are also legitimate reasons to use PDF.

Really, "what's in it for me?" (a phrase I rarely use) has more to do with demands for HTML--not for readability, but for text-mining--and pressures to do more than HTML. And the constant "PDF sucks!" refrain.

As noted above, I do provide HTML versions of (most) Cites & Insights essays (except for a small number that just don't work well that way and one "print bonus" feature that appears sometimes)--because some people asked me nicely to do so as an alternative for those who really want to read online, and because it had been a while since people were demanding that my free publication should be revamped to suit their own preferences.

(Yes, I do mean demanding, in at least one case with fairly strong language. My standard response, after the unmailed two-word/seven-letter one, was that there are lots of other things to read on the web...)

More like this

I was reading the latest issue of the Journal of Digital Information today, and I found myself wishing I could turn the Readability bookmarklet loose on half its PDF-only articles. I'm sorry, authors. I know you tried, but those PDFs are terrible-looking. Times New Roman, really? (The one in Arial…
The joys of markdown are many. Markdown is a formatting “language” like HTML that you can use to specify the final appearance of text. When you use a “word processor” like Microsoft Word or Openoffice.org Writer, the text you generate is “marked up” (or “marked down” as it were) with formatting…
There have been a half-dozen stories in the past few weeks that looked interesting, but didn't even make it into the Links Dump for the day. Why not? Because the stories or studies were only available as PDF files. I have no idea if this is actually getting worse, but I'm finding this more…
Richard Wallis has taken my ribbing in good part, which I appreciate; his response is here and will reward your perusal. He also left a comment here, part of which I will make bold to reproduce: As to RDF underpinning the Linked Data Web - it is only as necessary as HTML was to the growth of the…

Funny you should mention Donald Knuth while at the same time insisting that only PDF can provide true full-fidelity typography and presentation.

By D. C. Sessions (not verified) on 24 Jun 2009 #permalink

Well, here's what I said:

"Because PDF--and, of portable formats, only PDF--can assure us that the typefaces and layouts we've chosen will be rendered properly for the reader."

So can you point me to a free TeX download that anyone can figure out, that plugs into Word or OpenOffice, and produces files that everyone else can look at with full fidelity using trivially-downloadable freeware? When I do a little naive searching, I find recommendations to send people TeX files...converted to PDFs. And, for that matter, variants that turn TeX into PDFs directly.

I may be dead wrong here (and I have the greatest respect for Knuth's work), but when I see direct advice not to send .dvi files and to use TeX-to-PDF converters, I wonder just how "portable" TeX is, in the "available to pretty much anybody with a computer even if they're not techies" sense I have in mind. (Hmm...I see that the LaTeX newsletter is distributed in PDF form...)

A few somewhat random thoughts:

It is all text, and anyone with a reasonable set of tools can mine your PDF's or convert them to HTML if the layout is fairly linear (and even if not, it is just easier if it is linear).

It is not true that PDF renders identical and perfect results everywhere and all the time. That is somewhat of a fallacy. It can depend on the installed fonts, for instance.

One of the beauties of the typical Linux distribution is that saving as PDF is routine in most software, and translating among various formats is run of the mill and the tools and support readily available.

It is easy to use HTML as a quick and dirty formatting technique. It is easy to go from HTML or XML to PDF. Proper HTML can stand in for XML. It is easy to obtain or write your own filter to convert a personalized markdown langauge so that text can be converted to XML/HTML quickly.

In other words, you don't have to chose. Just use mainly text ... content oriented, not format or layout oriented efforts. Then you can play around with format, typography, etc. using many tools and produce many different kinds of products, and have fun and fill up your hard drive and the internet with only a few keystrokes! The average "text editor" now does fancy dancy spell checking (this has been true for years) and markdown works, so why you ever need a "word processor" anyway is kind of beyond me, except at the final stages, to make that XML/PDF/HTML/DOC product. If then.

HTML/XML does not produce the exact same thing on every viewing platform, but it does produce consistent results that will always *act* if not always *look* the same, unless the viewer is screwing around alot, and then that is the viewer's problem.

The problem with .pdfs is that they either open in another window or open within Acrobat in the browser. Having multiple windows is often inconvenient, while Acrobat within Firefox breaks standard key combinations (such as ctrl+t to open a new tab). I would much rather have the option of reading something a bit uglier without having to deal with either of these situations, if the link is primarily text-based; I agree that .pdfs are good for maps and similar.

By thiotimoline (not verified) on 24 Jun 2009 #permalink

Greg: You can set PDF to incorporate all typefaces--you have to do that to do books through Lulu, for example--in which case the results should be identical in all cases.

Otherwise--well, you have your preferred set of tool. "Why you ever need a 'word processor' anyway" may be fine for you, but not for me. Different people, different purposes, different tools. Fine--unless/until you're saying that it's *inappropriate* for me to use the tools I prefer. Or that what's easy for you is automatically easy for a typical writer.

I think we're seeing some of the culture clash going on here.

Your html versions of C&I (plus the Readability plugin for Firefox, which makes the margins short enough for easy reading) mean that I can read it without printing it out, which I sometimes prefer. The PDFs are gorgeous, and I do appreciate your attention to typography, but sometimes I just don't want to print things. But, you know, I deal. If the content is good enough, a reasonable person will generally find some way to deal with the form.

"But, you know, I think you want more than HTML. I think you want semantics--XML or better. "

The trouble with the "semantic web" is that the first day it starts gaining traction, people will start gaming the semantic tags to gain better exposure (since exposure means money and fame), the same way they are gaming the search engine hint keywords, and this will turn into a ugly arm race that will destroy the whole idea (getting significant, non sponsored, results in google is already difficult on a number of subjects, and google is AFAIK mostly discarding the search engine hint keywords of HTML)

folbec: You raise a slightly different but cogent point. My problem with the Semantic Web is that, except for specialized cases, it requires far too much work during writing/creation to be likely to have good data. But yes, it also assumes honest and non-gamed data.

I've avoided it on this new site so far, but I've been, um, skeptical of the SW since first hearing about it--and, on the one occasion when I met Sir Tim B-L (we were both speaking on the same program), I said so. A cordial discussion that certainly didn't change his mind--but didn't convince me either. (Hey, I'll never be famous, but I've had brushes...consistently demonstrating that I'm no good at being an Impressed Follower.)

[If you happened by here when there was a comment #10 echoing exactly the same issues as comment #4: It's not gone because it's raising disadvantages of PDF or because it's redundant. It's gone because it's spam: The content was taken directly from another comment in order to legitimate the link behind the poster's name, thus gaining link love for the spam link. Didn't work for long; too bad the blog software didn't flag it immediately!]

I think you're right, although "two cultures" is probably an oversimplification. (Aren't most dichotomies?)