Iterative Hockey Stick Analysis? Gimme a break!

This past weekend, my friend Orac sent me a link to an interesting piece
of bad math. One of Orac's big interest is vaccination and
anti-vaccinationists. The piece is a newsletter by a group calling itself the "Sound Choice
Pharmaceutical Institute" (SCPI), which purports to show a link
between vaccinations and autism. But instead of the usual anti-vac rubbish about
thimerosol, they claim that "residual human DNA contamintants from aborted human fetal cells"
causes autism.

Among others, Orac already covered the nonsense
of that from a biological/medical
perspective. What he didn't do, and why he forwarded this newsletter to me, is because
the basis of their argument is that they discovered key change points in the
autism rate that correlate perfectly with the introduction of various vaccines.

In fact, they claim to have discovered three different inflection points:

  1. 1979, the year that the MMR 2 vaccine was approved in the US;
  2. 1988, the year that a 2nd dose of the MMR 2 was added to the recommended vaccination
    schedule; and
  3. 1995, the year that the chickenpox vaccine was approved in the US.

They claim to have discovered these inflection points using "iterative hockey stick analysis".

First of all, "hockey stick analysis" isn't exactly a standard
mathematical term. So we're on shaky ground right away. They describe
hockey-stick analysis as a kind of "computational line fitting analysis". But
they never identify what the actual method is, and there's no literature on
exactly what "iterative hockey stick analysis" is. So I'm working from a best
guess. Typically, when you try to fit a line to a set of data points,
you use a technique called linear regression. The most common linear regression method is
called the "least squares" method, and their graphs look roughly like least-squares
fitting, so I'm going to assume that that's what they use.

What least squares linear regression do is pretty simple - but it takes a
bit of explanation. Suppose you've got a set of data points where you've got
good reason to believe that you've got one independent variable, and one
dependent variable. Then you can plot those points on a standard graph, with
the independent variable on the x axis, and the dependent variable on the y
axis. That gives you a scattering of points. If there really is a linear
relationship between the dependent and independent variable, and your
measurements were all perfect, with no confounding factors, then the points would
fall on the line defined by that linear relationship.

But nothing in the real world is ever perfect. Our measurements always
have some amount of error, and there are always confounding factors. So
the points never fall perfectly along a line. So we want some way of
defining the best fit to a set of data. That is, understanding that there's
noise in the data, what's the line that comes closest to describing a linear relationship.

Least squares is one simple way of describing that. The idea is that the
best fit line is the line where, for each data point, you take the difference
between the predicted line and the actual measurement. You square that
difference, and then you add up all of those squared differences. The line
where that sum is smallest is the best fit. I'l avoid going into detail about
why you square it - if you're interested, say so in the comments, and maybe I'll write a basics
post about linear regression.

One big catch here is that least-squares linear regression produces a good result
if the data really has a linear relationship. If it doesn't, then least squares
will produce a lousy fit. There are lots of other curve fitting techniques, which work in
different ways. If you want to treat your data as perfect, you can use different techniques to
progressively fit the data better and better until you have a polynomial curve which
precisely includes every datum in your data set. You can start with fitting a line to two points; for
every two points, there's a line connecting them. Then for three points, you can fit them precisely
with a quadratic curve. For four points, you can fit them with a cubic curve. And so on.

Similarly, unless your data is perfectly linear, you can always improve a fit by
partitioning the data. Just like we can fit a curve to two points from the set; then get closer
by fitting it to three; then closer by fitting it to four, we can fit two lines to a 2 way partition
of the data, and get a closer match; then we can get closer with three lines in a three way partition,
and four lines in a four way partition, and so on, until you have a partition for every pair of adjacent
points.

The key takeaway is that no matter what you data looks like, if
it's not perfectly linear, then you can always improve the fit by
creating a partition.

For "hockey stick analysis", what they're doing is looking for a good
place to put a partition. That's a reasonable thing to try to do, but you need
to be really careful about it - because, as I described above, you can
always find a partition. You need to make sure that you're actually
finding a genuine change in the basic relationship between the dependent and
independent variable, and not just noticing a random correlation.

Identifying change points like that is extremely tricky. To identify it,
you need to do a lot of work. In particular, you need to create a large number
of partitions of the data, in order to show that there is one specific
partition that produces a better result than any of the others. And that's not
enough: you can't just select one point that looks good, and see if you get a
better match by splitting there. That's a start: you need to show that the
inflection point that you chose is really the best inflection point.
But you also really need to go bayesian, and figure out an estimate of the chance
of the inflection being an illusion, and show that what the quality of the partition
that you found is better than what you would expect by chance.

Finding a partition point like that is, as you can see, not a simple
thing to do. You need a good supply of data: for small datasets, the
probability of finding a good partition is quite high. You need to do
a lot of careful analysis.

In general, trying to find multiple partition points is simply not
feasible unless you have a really huge quantity of data, and the slope change
is really dramatic. I'm not going to go into the details - but it's basically
just using more Bayesian analysis. You know that there's a high probability
that adding partitions to your data will increase the match quality. You need
to determine, given the expected improvement from partitioning based on the
distribution of you data, how much better of a fit you'd need to find after
partitioning for it to be reasonably certain that the change wasn't an
artifact.

Just to show that there's one genuine partition point, you need to show a
pretty significant change. (Exactly how much depends on how much data you
have, what kind of distribution it has, and how well it correlates to the line
match.) But you can't do it for small changes. To show two genuine change points
requires an extremely robust change at both points, along with showing that
non-linear matches aren't better that the multiple slope changes. To show
three inflection points is close to impossible; if the slope is
shifting that often, it's almost certainly not a linear relationship.

To get down to specifics, the data set purportedly analyzed by SCPI
consists of autism rates measured over 35 years. That's just thirty
five
data points. The chances of being able to reliably identify
one slope change in a set of 35 data points is slim at best. Two?
ridiculous. Three? Beyond ridiculous. There's just nowhere near
enough data to be able to claim that you've got three different inflection
points measured from 35 data points.

To make matters worse: the earliest data in their analysis comes from a
different source than the latest data. They've got some data from the
US Department of Education (1970->1987), and some data from the California
Department of Developmental Services (1973->1997). And those two are measuring
different things; the US DOE statistic is based on a count of the number of 19
year olds who have a diagnosis of autism (so it was data collected in 1989 through 2006);
the California DDS statistic is based on the autism diagnosis rate for children living in
California.

So - guess where one of their slope changes occurs? Go on, guess.

1988.

The slope changed in the year when they switched from mixed data to
California DDS data exclusively. Gosh, you don't think that that might be a
confounding factor, do you? And gosh, it's by far the largest (and therefore
the most likely to be real) of the three slope changes they claim to
have identified.

For the third slope change, they don't even show it on the same graph. In
fact, to get it, they needed to use an entirely different dataset from
either of the two others. Which is an interesting choice, given that the CA DDS
statistic that they used for the second slope change, actually appears
to show a decrease occurring around 1995. But when they switch datasets,
ignoring the one that they were using before, they find a third slope change
in 1995 - right when their other data set shows a decrease.

So... Let's summarize the problems here.

  1. They're using an iterative line-matching technique which is, at
    best, questionable.
  2. They're applying it to a dataset that is orders of
    magnitude too small to be able to generate a meaningful result for a
    single slope change, but they use it to identify three
    different slope changes.
  3. They use mixed datasets that measure different things in different ways,
    without any sort of meta-analysis to reconcile them.
  4. One of the supposed changes occurs at the point of changeover in the datasets.
  5. When one of their datasets shows a decrease in the slope, but another
    shows an increase, they arbitrarily choose the one that shows an increase.

More like this

Yesterday's Wall Street Journal has a *spectacular* example of really bad math. The WSJ is, in general, an excellent paper with really high quality coverage of economic issues. But their editorials page has long been a haven for some of the most idiotic reactionary conservative nonsense this side…
Suppose you've got a bunch of data. You believe that there's a linear relationship between two of the values in that data, and you want to find out whether that relationship really exists, and if so, what the properties of that relationship are. Once again, I'll use an example based on the first…
Yesterday's bad graphic post spurred me to finally get around to doing the "Why Does Excel Suck So Much?" post I've been meaning to do for a while. I gripe about Excel a lot, as we're more or less forced to use it for data analysis in the intro labs (students who have taken the intro engineering…
An alert reader sent me link to a href="http://africa.reuters.com/odd/news/usnPEK21146.html">stupid article published by Reuters about the Olympics and Astrology. It's a classic kind of crackpot silliness, which I've described in numerous articles before. It's yet another example of…

Hi Mark,

Thanks for taking the time to analyze those pseudo-scientific claims, I do appreciate the effort you spend on it.

Can you please elaborate the reason for squaring the difference? I've always been curious why it's more advantageous than taking the absolute value.

Thank you,
Mario.

PS Apologies for using a disposable email for the comment, I only have one email address now.

I call it "iterative whole cloth analysis".

I can see three possible advantages to squaring instead of taking absolute values:

1) Squaring gives a smooth function, so all the tools of calculus can be used.
2) Absolute values may not produce a unique answer. (Example: Suppose you restrict it to horizontal lines. Least squares produces a unique answer (the average). Absolute difference with an odd number of points will produce a unique answer through the median point that ignores all but one of the points; absolute difference with an even number of points allows any line that passes between the two middle points, also ignoring most of the points.
3) Absolute difference seems to ignore much of the data.

I'm curious about what they found when polio vaccine, diptheria (or DTaP), hep B, etc. were introduced as well. How about when these vaccinations were introduced in other countries?

You know, this nonsense sounds like the "analysis" the Geier brothers did a few years back where they arbitrarily fit two least squares lines to autism incidence around the time California banned thiomersal from vaccines. The two lines arbitrarily overlapped a year, didn't take into account the time series structure of the data (I wonder if these guys you mentioned did), and pretty much seemed like they worked back from their conclusion.

Squaring the difference is also popular because it will produce a maximum likelihood estimate if the errors on the data are truly independent and Gaussian.

I always thought that you square the difference because it always results in a positive value, and so a best fit can be arrived at algebraically.

To put it another way - when you add parabolas together, you always get another parabola. Adding absolute values together gives you weird-shaped graphs that are analytically difficult.

@Mario (No. 1)

The reason for squaring the difference is mostly just for technical calculus reasons---the math is much, much nicer, and the algorithm's results are more predictable and stable if you do.

The most intuitively obvious reason why is absolute value has a non-differentiable point at the deviation of zero, while the square of the difference is smooth at zero. So you can apply more calculus techniques near zero than with absolute values.

This was more important before computers. Absolute rather than squared deviation fits are sometimes used now.

Technically, you can use least squares fitting to fit an arbitrary function to your data. For instance, power laws (equivalent to fitting a line to the log of both sides), exponentials (equivalent to fitting a line after taking the log of one side), and low-degree polynomials are often fit this way. But the technique has its pitfalls. One is that if you are going to fit anything other than a line (or something that can be easily transformed into fitting a line), you need to have a good argument for why it should have that dependence, since as in the linear case if you choose the wrong fitting function you can get a formally rigorous but actually meaningless result. Another is that high-order polynomial fits frequently produce stiff results, sticking close to the given data points at the expense of doing strange things between data points.

And of course you need to have many more data points than parameters if you want anybody to actually believe your fit. In this case, they are trying to fit a function with at least 11 parameters (four lines, each with a slope and intercept, plus three breakpoints) to 35 points of data. Yes, you can get an answer, but it's not going to mean a whole lot.

By Eric Lund (not verified) on 30 Apr 2010 #permalink

My expertise lies in the solution of inverse and ill-posed problems, which makes me wonder if their "hockey stick" technique is derived from the legitimate technique called L-curve analysis in regularization?

In order to solve ill-posed problems, you have to regularize the data, meaning you have to damp out the noise. This is done usually in a least-squares situation like this by solving the problem using the singular value decomposition and truncating at a certain point -- i.e. just ignore all singular vectors corresponding to singular values below lambda. You determine this cutoff using the L-curve analysis -- plot error versus the magnitude of the singular value at which you cut off, and it looks like an L-shaped curve. The turning point of the L-shaped curve is where your cutoff should be. The L-curve analysis technique was developed Per Christian Hansen about 10-15 years ago. With a data set this small, regularization would be pretty much useless, but that probably wouldn't stop cranks from trying it.

Squaring originates in the fact that the normal (Gaussian) model originally ruled all - in the larger scheme of things, least squares regression gives estimates that are the orthogonal projections of the observed Y values onto the space spanned by the design matrix.

@3: absolute values doesn't ignore the data at all. fitting by least squares, absolute values, or any other measure of dispersion, leads to estimates that minimize some function of the residuals. in your comment, least squares leads to the mean, minimizing with absolute values leads to the median. different results, but same philosophy.

The newsletter mentions an EPA report, so their "hockey stick" analysis may be similar to the analysis used in the EPA report which can be found here (David Kirby trumpeted it in HuffPo):

http://www.all.org/pdf/McDonaldPaul2010.pdf

In it they describe the Hockey stick method:

"We examined each of the three studies and the assembled worldwide data set for a changepoint in the time sequence of AD cumulative incidence by birth year cohorts in each study by fitting a hockey-stick model (44) to the data in each case. This approach uses ordered data and piecewise linear regression to split the response variable into two groups, generating a linear regression for each group.

where (44) is:
Qian, S. S. Environmental and Ecological Statistics with R.,
Chapman & Hall/CRC: Boca Raton, FL, 2010.

I did not find good explanation for why the assumption of linearity in the data is valid.

So, what happens if you apply this type of analysis to an exponential or other nonlinear set of data? Will you be guaranteed to get a change point somewhere?

Is there a mathematical formula that you can use to compare the fit of different partitions that share all but one inflection point?

For example, for a given set of data, 2 specific partitions give a result P2. You then add a third partition (while leaving the first two in place) and get a new fit P3 which is better than P2. Is there a way to compare P2 and P3, mathematically, given that P3 is the better fit but requires an additional partition?

oops, that was not the question I meant to ask. How unique is the change point you calculate? Does the second best change point produce a significantly worse fit, or will there be multiple options that could produce decent fits, with one being marginally better than the others (assuming low-noise data)?

The way that you measure the fit of a partitioned line is that is the same way that you measure a single line-fit: least squares. You take the sum of the squares of the difference between the fit with the partition, and without the partition.

In real data, there generally won't be exactly one value which is the ideal inflection point; there'll be a narrow range of values. Real data is noisy, which will tend to produce some scatter that makes it hard to pinpoint the exact change point, and real-world changes rarely have an instantaneous impact. In the example in the paper, approving a new vaccine doesn't instantaneously change the vaccination practices of every doctor and every patient immediately. The addition of a new vaccine/the change to a new vaccine occurs gradually over a few years. So there won't be exactly one idea inflection point.

In fact, that's one of the problems with their "analysis" which I didn't discuss. Even if you accept the inflection points that they found, they don't match reality. MMR-II wasn't adopted overnight by every pediatrician in the country. Doctors didn't start giving chicken-pox vaccine to every child overnight. (In fact, a lot of doctors till don't give chicken-pox at all. The uptake of that vaccine is still increasing. So if the bozos were right, you wouldn't expect to be able to find a specific inflection point; you'd instead see a gradual increase in slope starting at the key years and increasing for some number of years after. But they claim to have found single points where things instantaneously changed.

At any rate: there's not usually a single point, but rather a narrow range. If it's real, doing a partition within that narrow range will produce a significantly better fit than any single line, or any single partition outside of that range - better by a large enough factor that a bayesian analysis shows it to be very unlikely to be an artifact.

@13:

If you try to apply linear regression to an exponential curve, you'll get a lousy fit. virtually any partition will improve it significantly. That's one of the reasons that you need to show that your proposed partition is uniquely good. Exponentials will, show the fit-improvement by adding partitions particularly dramatically, since it's diverging from whatever lines you fit so quickly.

They're fitting what's called a change-point model (sometimes also called a broken stick model). One important point is that the lines before and after the changepoint have to meet. These are, if not standard, then fairly well known in statistics. TBH, if I was fitting a curve with 3 changepoints, I'd switch to using splines instead.

When one of their datasets shows a decrease in the slope, but another shows an increase, they arbitrarily choose the one that shows an increase.

And here it is worth mentioning the importance of defining your success criteria in advance rather than scouring the data for for trends with p < 0.05. If you look for enough possible relationships (or look for your desired relationship in enough datasets) you will identify illusory "trends" that have a p < 0.05 just as a result of sheer coincidence. 0.05 is not that small a number, after all...

In fact, a lot of doctors till don't give chicken-pox at all. The uptake of that vaccine is still increasing.

And yet another reason why it's so infuriating when the anti-vaxers claim that the medical establishment throws caution to the wind and just says "Vaccinate! Vaccinate! Vaccinate!" without looking at cost/benefit. The varicella (chicken pox) vaccine doesn't have that great of a protection rate compared to many other vaccines, and add to that the fact that varicella is rarely fatal (before the vaccine, <100 deaths annually in the US despite millions of cases), and you can understand why physicians might initially be wary.

And as you say, the mainstream has been wary. Probably too wary -- there's no doubt anymore that the vaccine has saved lives, not to mention a heaping helping of pain and suffering.

But it certainly ought to put to rest the idea that mainstream medicine is not taking the cost-benefit of vaccines into account. Here we have a vaccine where the cost-benefit is a clear positive based on all available information, but since the cost-benefit wasn't as hugely positive as some other vaccines, the adoption has been cautious and slow.

Bah, stupid me, I typed < instead of &lt;... I don't feel like retyping what got cut off, so in a much less wordy version: While trends with p < 0.05 seem impressive on the surface, if you comb through enough data and/or look for enough different possible trends in your data, the odds you're bound to hit an illusory "trend" with p < 0.05 sooner or later...

@20:

Yes, that's exactly why it's so important to be Bayesian about this kind of thing. When you start looking for the kind of trend you want to find, you're bound to find something eventually. So you need to do that additional layer of analysis to provide evidence that you found something real.

(I'd actually argue that you should always do the Bayesian thing anyway; an awful lot of what passes for null-hypothesis testing is utter crap. I'm not a hard-core bayesian, and I wouldn't throw away the entire idea of NHT, but the Bayesians are right that there's a lot of sloppy work out there that could be caught by some simple analysis.)

You can use least squares to estimate non-linear relationships provided that the model is linear in the parameters. So if I want to estimate say a quadratic increase over time I could just do something like:
y = β0 + β1x + β2x2
This would work just fine. Least squares is a linear regression model because it is linear in the βs, not in the xs.

Of course, when working with data over time, there is always the issue of spurious regression which is where least squares will tell you there is a relationship (even if you get the correct functional form) when there isn't one. This is due to a confounding effect from an omitted variable: time. If both variables increase over time then of course you will significant regression results when you're not accounting time in your analysis. In order to compensate for this, you need to do much more sophisticated analysis (like testing for integration among regressors, deterministic time trends, etc.), which I don't see any evidence of in their article.

Even if all of this data was from a single study, I'd have a problem with the simple fact that there is no definitive test that absolutely states "yes, you have autism". Also, does this data include people who supposedly have the so-called Asperger's Syndrome?

I think when most people read studies like this, they assume that cases of autism represent essentially an army of Rain Men joining society. Most of the people I've met who have "autism" are basically wonderfully normal kids who happen to have interests other than playing basketball with their neighbors.

I'm not saying autism doesn't exist â clearly there is a set of characteristics that are severely debilitating and prevent any semblance of a "normal life" as characterized by the movie Rain Main. I just wonder how much of the rise in autism cases is due instead to people's desire to classify difference and weirdness into convenient little containers rather than embrace diversity.

Oh, and by the way, the link between autism and abortion doesn't surprise me. I've long suspected that the three primary factors that lead to autism rates are abortion, a lack of regular church attendance, and masturbation.

@23:

That's in incredibly important point. I've mentioned it before, and other folks have covered it in detail.

In fact, it appears that most, if not all, of the supposed increase in autism is really a diagnostic change. There used to be a general categorization of "mental retardation", where most kids with autism, downs, and a half dozen other things were all lumped together. And even after autism started to be recognized as a distinct diagnoses, what we now consider "autism spectrum disorder" used to be multiple distinct diagnoses. So we now include Aspbergers as an ASD, whereas the original diagnoses of autism wouldn't have included it.

So now, doctors and government organizations recognize and track multiple distinct diagnoses for different neurological/cognitive impairments; and they also recognize that several things that used to be considered distinct illnesses are actually just different degrees of the same thing.

With all of those changes, just plotting a raw count on a graph with no further information about diagnostic categorization is hopelessly naive at best, and deliberately deceptive at worst.

I'm willing to bet that a few years from now, folks like these are going to be pointing at 2010 as another inflection-point year. Because this year, the new DSM is removing aspbergers as a distinct diagnoses, and moving it to be under ASD. So if you count diagnoses of ASD in 2010 and 2011, you'll be counting different things.

Please! They're not called "points of inflection". They are "break points" or "change points". (And probably other names too.)

Their data looks remarkably piecewise linear. But since they don't say one bit about what they actually did, nor identify what their data actually is, there is no earthly reason to give them credence.

By william e emba (not verified) on 02 May 2010 #permalink

Several people have posted correct reasons why RMS (Root-Mean-Square) errors and not absolute value based errors are preferred. But it's worth knowing the secret, underlying, reason to use squaring: it corresponds to a Hilbert space.

In essence, the functions defined on a given interval form an infinite-dimensional space. The RMS error is simply the "Pythagorean theorem" distance between the (linear) subspace of all functions that interpolate the given data, and the (linear) subspace of all linear functions defined on the given interval.

You can use absolute value, or cube-root-mean-cube error, or maximum error, or the like, but these only give more general Banach spaces, with much weaker geometric structure. Most glaring is the lack of a satisfactory notion of perpendicularity, and of a satisfactory notion of linear algebra.

The result is you can prove endless amounts of theorems about RMS errors and generate endless algorithms, but other notions of error terms are quite poor in comparison.

By william e emba (not verified) on 03 May 2010 #permalink

just a minor note to discourage cranks from misinterpreting:

linear regression demonstrates (subject to probability) a linear correlation between two variables...it doesn't demonstrate any kind of causation. "correlation is not causation" is a major mantra of statisticians, right?

so saying that the variables are "dependent" and "independent" is a little misleading. you'll find an unlikely-to-be-random correlation (if there is one) regardless of which you put on the x and which you put on the y-axes.