rdf:about="Shakespeare"

Dorothea has written a typically good post challenging the role of RDF in the linked data web, and in particular, its necessity as a common data format.

I was struck by how many of her analyses were spot on, though my conclusions are different from hers. But she nails it when she says:

First, HTML was hardly the only part of the web stack necessary to its explosion. TCP/IP, anyone?

I'm on about this all the time. The idea that we are in web-1995-land for data astounds me. I'd be happy if I were to be proven wrong - trust me, thrilled - but I don't see the core base of infrastructure for a data web to explode. I see an exploding capability to generate data, and of computational capacity to process data. I don't see the technical standards in place that enable the concurrent explosion of distributed, decentralized data networks and distributed innovation on data by users.

The Web sits on a massive stack of technical standards that pre-dated it, but that were perfectly suited to a massive pile of hypertext. The way that the domain name system gave human-readable domains to dotted quads lent itself easily to nested trees of documents linked to each other, and didn't need any more machine readable context than some instructions to the computer about how to display the text. It's vital to remember both that we as humans are socially wired in to use documents in a way that was deeply enabling to the explosion of the Web, because all we had to do was standardize what those documents looked like and where they were located.

On top of that, at exactly the moment in time that the information on the web started to scale, a key piece of software emerged - the web browser - that made the web, and in many ways the computer itself, easier to use. The graphic web browser wasn't an obvious invention. We don't have anything like it for data.

My instinct is that it's going to be at least ten years' worth of technical development, especially around drudgery like provenance, naming, versioning of data, but also including things like storage and federated query processing, before the data web is ready to explode. I just don't see those problems being quick problems, because they aren't actually technical problems. They're social problems that have to be addressed in technology. And them's the worst.

We simply aren't yet wired socially for massive data. We've had documents for hundreds of years. We have only had truly monstrous-scale data for a couple of decades.

Take climate. Climate science data used to be traded on 9-track tapes - as recently as the 1980s. Each 9-track tape maxes out at 140MB. For comparison's sake, I am shopping for a 2TB backup drive at home. 2TB in 9-tracks is a stack of tapes taller than the Washington Monument. We made that jump in less than 30 years, which is less than a full career-generation for a working scientist. The move to petabyte scale computing is having to be wedged into a system of scientific training, reward, incentives, and daily practice for which it is not well suited. No standard fixes that.

Documents were easy. We have a hundreds-of-years old system of citing others' work that makes it easy, or easier, to give credit and reward achievement. We have a culture for how to name the documents, and an industry based on making them "trusted" and organized by discipline. You can and should argue about whether or not these systems need to change on the web, but I don't think you can argue that the document culture is a lot more robust than the data culture.

I think we need to mandate data literacy the way we mandate language literacy, but I'm not holding my breath that it's going to happen. Til then, the web will get better and better for scientists, the way the internet makes logistics easier for Wal-Mart. We'll get simple mashups, especially of data that can be connected to a map. But the really complicated stuff, like oceanic carbon, that stuff won't be usable for a long time by anyone not trained in the black arts of data curation, interpretation, and model building.

Dorothea raises another point I want to address:

"not all data are assertions" seems to escape some of the die-hardiest RDF devotees. I keep telling them to express Hamlet in RDF and then we can talk.

This "express Hamlet in RDF" argument is a Macguffin, in my opinion - it will be forgotten by the third act of the data web. But damn if it's not a popular argument to make. Clay Shirky did it best.

But it's irrelevant. We don't need to express Hamlet in RDF to make expressing data in RDF useful. It's like getting mad at a car because it's not an apple. There is absolute boatloads of data out there that absolutely needs to be expressed in a common format. Doing climate science or biology means hundreds of databases, filling at rates unimaginable even a few years ago. I'm talking terabytes a day, soon to be petabytes a day. That's what RDF is for.

It's not for great literature. I'll keep to the document format for The Bard, and so will everyone. But he does have something to remind us about the only route to the data web:

Tomorrow and tomorrow and tomorrow,
Creeps in this petty pace from day to day

It's going to be a long race, but it will be won by patience and day by day advances. It must be won that way, because otherwise we won't get the scale we need. Mangy approaches that work for Google Maps mashups won't cut it. RDF might not be able to capture love, or literature, and it may be a total pain in the butt, but it does really well on problems like "how do i make these 49 data sources mix together so I can run a prediction of when we should start building desalination plants along the Pacific Northwest seacoast due to lower snowfall in the Cascade Mountains".

That's the kind of problem that has to be modelable, and it has to run against every piece of data possible. It's an important question to understand as completely as can be. The lack of convenience imposed by RDF is a small price to pay for the data interoperability it brings in this context, to this class of problem.

As more and more infrastructure does emerge to solve this class of problem, we'll get the benefits of rapid incremental advances on making that infrastructure usable to the Google Maps hacker. We'll get whatever key piece, or pieces, of data software that we need to make massive scale data more useful. We'll solve some of those social problems with some technology. We'll get a stack that embeds a lot of that stuff down into something the average user never has to see.

RDF will be one of the key standards in the data stack, one piece of the puzzle. It's basically a technology that duct-tapes databases to one another and allows for federated queries to be run. Obviously there needs to be more in the stack. SPARQL is another key piece. We need to get the names right. But we'll get there, tomorrow and tomorrow and tomorrow...

Categories

More like this

As part of the series of posts reflecting on the move of Science Commons to Creative Commons HQ, I'm writing today on Open Data. I was inspired to start the series with open data by the remarkable contribution, by GSK, to the public domain of more than 13,000 compounds known to be active against…
Technorati Tags: scale, computation, information Since people know I work for Google, I get lots of mail from folks with odd questions, or with complaints about some Google policy, or questions about the way that Google does some particular thing. Obviously, I can't answer questions about Google…
During my summer blogging break, I thought I'd repost of few of my "greatest hits" from my old blog, just so you all wouldn't miss me so much. This one is from July 3, 2007. It's one of the most popular posts I've done, and it was linked quite widely in the science blogosphere. The interview…
A bunch of people have sent me links to an article about MapReduce. I've hesitated to write about it, because the currently hyped MapReduce stuff was developed, and extensively used by Google, my employer. But the article is really annoying, and deserves a response. So I'm going to be absolutely…

It's not a MacGuffin, but only because some RDFfers honestly do seem to believe that everything worth expressing can be expressed in RDF. The idea that one might still need, say, TEI or MathML just horrifies them.

Shut them up, and I'll quit harping on Hamlet. :)

I still despise RDF, though. I wish the data web had settled on topic maps instead. Them I get.

i have to do an essay for school and i can hardly find anything on the subject....i got some information from the answers but i still need some more plzzz.
http://hcginstantdiet.org

By Rozalie Koten (not verified) on 22 Apr 2011 #permalink

It's not a MacGuffin, but only because some RDFfers honestly do seem to believe that everything worth expressing can be expressed in RDF. The idea that one might still need, say, TEI or MathML just horrifies them.

Shut them up, and I'll quit harping on Hamlet. :)

I still despise RDF, though. I wish the data web had settled on topic maps instead. Them I get.

One additional thing: "a small price to pay" may still, pragmatically, be too high for many people to pay it. This is the point I am working hard to make: We have to bring that price down.

I agree with you on how that will happen, if it does. I just think too many of us are underestimating the price.

I'll note that I've not dealt with Semantic Web professionally for about 6 years, but I've yet to get any impression that anything is any different now than it was when I left that field.

Having switched gears from TSW for DARPA (where everything is fit for RDF/OWL/etc because everything they'll use will be on an exclusive, secure network) to the REAL professional world of big corporations (in this case, the telecom industry), I see, in fact almost feel smacked in my face, the REAL resistance to the Semantic Web: corporations don't want to exchange any data at all.

Seriously. They won't do it. It is all they have left. In fact, for the people who control access to the heavy detailed data, for someone else to come in with an analysis of the data they're responsible for and say they've found something (perhaps a few million dollars in potential savings) that the holder has not? Well, that means the holder of the data is likely to be fired.

A corporation as an entity may gain in the long run by opening up, but the people who run the corporation will lose in the short term, either by being seen as a security risk, or being seen as useless when someone else finds something better through the data mining. So they keep it to themselves.

This attitude will be true everywhere in the corporate world - the short term risks (usually to one's job, in a crappy job market like we've had for the last few years now) will never be perceived to be less than the long term gains to be made by data mining and cross-integration that the SOW stands for.

And when working within the corporation itself, with the massive amounts of data involved, a data-warehouse SQL engine will always be faster (by a huge factor) than rdf interpretation, even if the RDF queries are easier for someone to read or write. When you're working with .2 terabytes of new data being added to the system every day, the text processing of RDF just to read the data simply can't keep up.

So, right tool for the right job, but RDF (in fact, any XML syntax) is simply not the right tool for such large scale internal corporate work, and sharing data from such corporations is simply a non-issue to start with.

By Joe Shelby (not verified) on 11 Jul 2010 #permalink

Dorothea - we're in loud agreement about RDF, actually. I don't like it. We're not supposed to like it. Machines are supposed to like it, and they do. What we need is ten years of development of intermediaries, UI, automation, and more that makes it easy for humans and RDF to co-exist.

The only people who are going to drive the transaction cost of using RDF down are the people who desperately need data interoperability, and they're already using it, so it's going to happen. If that's not you, then that's fine (I don't use RDF day to day, for example). But it is happening and will continue to happen, and it's only going to get better. The problem is that we have super-short attention spans at this point, and we don't like systems that have a legitimate 20-year growth requirement before they begin to drive massive economic value. RDF's ten years old now, give or take. It's got ten more to go I think.

Joe, to your comment, thanks. But I'm not sure why you made it. There's nothing in the post about RDF as a tool inside companies, or about companies deciding to share data. FWIW, my belief is that public funds will drive RDF as a way to connect public data, and that companies will adopt RDF if - and only if - they need to use the connected public data. I didn't imply anywhere that a company would flip to RDF behind the firewall and then magically start sharing.

What is "public data", then? How much public data is only public because someone didn't think they should have secured it? (Facebook's consistent privacy changes, anyone?)

Just how much information is out there not protected by somebody's copyright? Certainly a newspaper article is, and the courts still haven't quite decided if the "facts" that a newspaper or non-fiction book publishes are copyright-distinguishable from the written text one reads, or the video one sees. When I say the Padres beat the Mets 5 to 3, is that a fact, or is that data itself really the property of Major League Baseball?

While they have no way to enforce their copyright at the office water cooler, they certainly will be weary of someone mapping their data into a data warehouse and mining it for statistics to analyze without their express written consent. ;)

Dan Brown could have just as easily lost in the lawsuit against him by the writers of Holy Blood And Holy Grail.

What could be considered public in this age where everybody knows everything, but it is highly likely that a corporation paid for that knowledge at some point and expects to be compensated?

As for the real public information, the reports from or for the governments, again you've got privacy issues and then you've got politics, both of which can be a hindrance to use or publication of that data (in spite of the likely legal rights to do so). Plus in this age of ever shrinking budgets, there's no money for putting real metadata around those databases (and given the likelihood of corruption inherent in them, no state-level incentive either).

Of the facts that are left after all that legal wrangling of personal privacy, corporate copyright, and government security, what does semantic technology get that is worth knowing?

By Joe Shelby (not verified) on 11 Jul 2010 #permalink

If you live in a corporate world, you've already done most of the rationalization of your internal data systems, so your pain shouldn't be bad enough to make RDF attractive. But the public data situation is one in which thousands of one-off databases need to be federated, and there are no meaningful choices other than RDF out there to use.

I really think your conception of public data is not only incomplete, but inaccurate.

There is $30B a year alone invested in federal research money via the NIH. All in I think the number tops $100B across the federal government. That generates vast swaths of data about our genomes, our climate, our weather, our educational system, and on and on and on. That's "public data" and there's boatloads of it.

There are no legal or privacy issues to the vast majority of that data. The polar data is public. The human genome and its children are public. Weather data is public. Data pouring in off sensor networks is public (see the link to the CDIAC website in the blog post for a very good example) and the people running these data centers are working into RDF formats even as we speak. Pharma likes RDF as well, as it provides a scalable way to connect the 1000+ public biological databases that were all written into different formats and naming systems. Copyright does not attach to the contents of these databases, at least in the United States (and assuming it's federal - because US government works do not receive copyrights). It's data, and it's public.

Public money is funding ontology work, because those ontologies need to exist in order to wire together disparate public databases and increase returns on investment. It goes on and on, and it cuts across disciplines.

RDF is duct tape for databases. It does that job extremely well, and there is a real demand for that job with billions and billions of dollars of real funding out there. If you don't like it, don't use it - like all standards, it's a choice to use it. I understand why people don't want to use it in their own world; what I don't understand is the persistent need to attack people who make the choice to use it.

RDF works. It sucks, it's ugly, it's not turing complete - but it works. And that in the end is what will win the day. Because scalable connected data is tremendously economically valuable to the funders who pay for its creation.