What can search predict?

You've all heard about how you can predict all sorts of things, from movie grosses to flu trends, using search results. I earlier blogged about the research of Yahoo's Sharad Goel, Jake Hofman, Sebastien Lahaie, David Pennock, and Duncan Watts in this area. Since then, they've written a research article.

Here's a picture:

sharadsearch.png

And here's their story:

We [Goel et al.] investigate the degree to which search behavior predicts the commercial success of cultural products, namely movies, video games, and songs. In contrast with previous work that has focused on realtime reporting of current trends, we emphasize that here our objective is to predict future activity, typically days to weeks in advance. Specifically, we use query volume to forecast opening weekend box-office revenue for feature films, first month sales of video games, and the rank of songs on the Billboard Hot 100 chart. In all cases that we consider, we find that search counts are indicative of future outcomes, but when compared with baseline models trained on publicly available data, the performance boost associated with search counts is generally modest--a pattern that, as we show, also applies to previous work on tracking flu trends.

The punchline:

We [Goel et al.] conclude that in the absence of other data sources, or where small improvements in predictive performance are material, search queries may provide a useful guide to the near future.

I like how they put this. My first reaction upon seeing the paper (having flipped through the graphs and not read the abstract in detail) was that it was somewhat of a debunking exercise: Search volume has been hyped as the greatest thing since sliced bread, but really it's no big whoop, it adds almost no information beyond a simple forecast. But then my thought was that, no, this is a big whoop, because, in an automatic computing environment, it could be a lot easier to gather/analyze search volume than to build those baseline models.

Sharad's paper is cool. My only suggestion is that, in addition to fitting the separate models and comparing, they do the comparison on a case-by-case basis. That is, what percentage of the individual cases are predicted better by model 1, model 2, or model 3, and what is the distribution of the difference in performance. I think they're losing something by only doing the comparisons in aggregate.

It also might be good if they could set up some sort of dynamic tracker that could perform the analysis in this paper automatically, for thousands of outcomes. Then in a year or so they'd have tons and tons of data. That would take this from an interesting project to something really cool.

Categories

More like this

This morning, I was made aware (by my better half) of the existence of Google Flu Trends. This is a project by Google to use search terms to create a model of flu activity across the United States. Indeed, the results have been good enough that they were reported in a Letter in Nature [1] back…
Global cooling again; via a comment on this rather dodgy page I find Stockton, C. W. and W. R. Boggess, Geohydrological implications of climate change on water resource development, Contract Report DACW 72-78-C-0031, for U. S. Army Coastal Engineering Res. Center. That isn't P-R, but contains on p…
Should scientists place bets on how well they can predict the future? I'm not talking about inter-lab wagers on the outcome of some arcane experiment. Such games are commonplace, and usually involve ersatz currencies such as beers or domestic services. But what about real money on something as…
Ah, yes, a real game (kidding, Scrabble people). If you've watched many baseball games or baseball movies, you know that one of the things that makes for a successful hitter is the ability to predict what the next pitch will be. Is it going to be inside or outside? Will it be a fastball or a…