why you should take your ngrams with a grain of salt

By bioephemera on December 28, 2010.

Because people have been discussing Google ngrams a lot, and because there are always major caveats to new datamining methodologies, I have to link Natalie Binder's excellent series of posts urging caution, not only about the methodology, but about assuming too much about ngrams' utility in social research.

Binder says,

The value of the Ngrams Viewer rests on a bold conceit: that the number of times a word is used at certain periods of time has some kind of relationship to the culture of the time. For example, the fact that the word "slavery" peaks around 1860 suggests that people in 1860 had a lot to say about slavery. Another spike around the 1970s meshes nicely with the Civil Rights Movement.

Well, that's sort of interesting. However, I didn't need ngrams to tell me that a lot of people were writing about slavery in 1860. These data are broad but not deep, which makes them relatively useless to most humanities majors interested in intensive study. To understand the futility of trying to understand history this way, pretend that you've never heard of slavery, the Civll War or civil rights. Now take another look at the chart above. If Ngrams was your first encounter with the word "slavery," could you deduce that Americans owned slaves in the 1860s? Could you say anything other than, "slavery was a pretty big deal back then"? Probably not. But that is what the Google-Harvard team is suggesting we attempt to do, not necessarily with "slavery," but with many other words and ideas.

Binder is, as you can tell, extremely skeptical of ngrams' potential for analyzing trends in literature, even once the various OCR and metadata issues that produce false positives are cleaned up. Maybe she's a bit too skeptical. I have some sympathy for the position that we just don't need more heaps of imperfectly mined data. The internet and related technologies have already given us orders of magnitude more data than we have PhDs to dissertate it. And since I don't think graduate students should be treated as mere instrumentalities to embiggen our national knowledge base, I think we should train fewer PhDs, not more.

On the other hand, the relevant questions are how to ensure the data being mined is high-quality, how to filter out systematic errors, and how to devise questions that maximize the strengths of the data rather than just flailing at it with naive curiosity. And those, really, have always been the important questions about large datasets. They're the same questions I asked when I was 19 years old, counting several thousand mutant fruit flies, and realized halfway through that I'd scored a poorly penetrant allele of achaete as forked (or something like that) and had to start all over. The data is only as good as the filter - person or technology - reading it.

So it's not enough to have piles of data, as intoxicating as the prospect may be. You have to know the data contains what you're looking for, and figuring out how to work around its weaknesses (which means knowing what those weaknesses are). It may be obvious that something is haywire when the word "internet" spikes in the 1920s, but it won't be so obvious with most artifacts. Nobody said science was easy. . . and slapping a "Google" on it certainly doesn't make it so.

More like this

More Pro-Slavery Nonsense from the League of the South

I'm still reading this stuff, and it's just unreal. It's like I've overturned a rock and all these southern nationalist whackos are streaming out. How about this story about a book called Southern Slavery, As It Was, written by League of the South board member Steve Wilkins: Students at one of the…

Cothran defends the Confederacy

Having defended Holocaust deniers and crusaded against gay parents, I shouldn't be surprised that Martin Cothran, lobbyist for the Kentucky affiliate of Focus on the Family and occasional shill for the Disco. 'Tute, would defend treason. In defending the secessionist States, Cothran mostly just…

Slavery and the Bible, Take 2

Eric Seymour has written a response to my post on slavery and the Bible and takes the often-stated position that Biblical slavery wasn't like modern slavery. He writes: But we also must bear in mind that the slavery which existed in the times and cultures in which the Scriptures were written was…

An anti-Semite demands: Why a Holocaust museum but no slavery museum?

The other day, in the midst of a discussion about one of my posts about Holocaust denial, an anti-Semite posting as "bernarda" demanded: Then I read books like Norman Finkelstein's Holocaust Industry and understood that it [the Holocaust] has just become a propaganda tool to create a permanent…

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

Goodbye to Scienceblogs

September 15, 2011

A few weeks ago, I was notified that if I wished to continue blogging at Scienceblogs/National Geographic, I'd have to agree to new terms. After considering these terms, as well as the decision to ban pseudonymous blogging, I don't feel that the new management and I are on the same page. I have…

SpaceChem!

September 14, 2011

A few months ago I got an email from Zachtronics, creators of the Codex of Alchemical Engineering, about the new indie game called SpaceChem. It was billed as "an obscenely addictive, design-based puzzle game about building machines and fighting monsters in the name of science." What's not to love…

Mechanical butterfly, circa 1911

September 14, 2011

Check out this great slideshow of fascinating advertising novelties from 1911, over at Scientific American.

Pseudonymity: Five Reasons the New Scienceblogs/NG Policy is Misguided

September 14, 2011

Recently, Scienceblogs/National Geographic decided it would no longer host pseudonymous science bloggers. As a result, many of my former colleagues have left. I think this decision was wrong. Read on for my reasons. One: simple fairness. Several well-established pseudonymous bloggers had been…

Seeing the invisible? There's an app for that

September 8, 2011

This video from Xperia Studio very effectively conveys how data visualization can both leverage and challenge our conceptions of "reality." The night sky we've seen since childhood, like everything else we see, is just a tiny slice of the spectrum - only what we can perceive with our limited…

why you should take your ngrams with a grain of salt

More like this

More Pro-Slavery Nonsense from the League of the South

Cothran defends the Confederacy

Slavery and the Bible, Take 2

An anti-Semite demands: Why a Holocaust museum but no slavery museum?

Goodbye to Scienceblogs

SpaceChem!

Mechanical butterfly, circa 1911

Pseudonymity: Five Reasons the New Scienceblogs/NG Policy is Misguided

Seeing the invisible? There's an app for that

Starvation is the fountain of youth, for worms at least

Send me your volcano questions

Messier Monday: The Most Blueshifted Messier Object, M86