Some More Thoughts About Value-Added Teacher Evalution

Tuesday, I criticized the LA Times' use of the 'value-added' approach for teacher evaluation. There were many good comments, which I'll get to tomorrow, but Jason Felch of the LA Times, pointed me to the paper describing the methodology. I'm not happy with the method used.

First, I was right to have concerns about the linearity of test scores. Consider the mean score for each quartile:

highest = 852
second highest = 768
third highest = 730
fourth highest = 682

What this means is that an increase from 40th percentile to 50th is not the same as an increase from 50th to 60th. Now, as far as I can tell, the authors in the paper are using the raw scores, but the model they are using assumes linearity. In light of this, using something like proficiency (a gross cutoff) would seem to be a more accurate, if less precise, measure to use (i.e., something like net percent increase/decrease in proficiency per class).

Second, the authors include lots of effects in their model (and determining the significance of these effects isn't trivial), but there's one glaring omission:

The model was simplified by assuming that the student heterogeneity term (αi) was zero.

In other words, intrinsic student differences are removed from the model. The authors claim that is warranted:

This assumption was consistent with initial data runs that indicated that student heterogeneity was statistically insignificant after controlling for prior year test score and observed student characteristics. More importantly, recent research has shown that this type of model performs well in predicting teacher performance from year to year in both experimental and non-experimental settings (Kane and Staiger, 2008; McCaffrey et al., 2009).

Oddly enough, the Kane and Staiger paper claims that teacher effects disappear after two years, so, well, I'm not sure what the fuss is about. But the larger issue is that this is a really screwy population of students. Here's how the percentage of students who qualify for free lunch (an indicator of poverty) breaks down:

highest = 55
second highest = 89
third highest = 94
fourth highest = 97

This is an incredibly monomorphic population. To give you some idea of what that means, if a class has 25 students, half of the classes in the schools belonging to the lowest quartile will have every student qualify for a free lunch (I'm assuming students are distributed equally, which at 97% is probably a reasonable, if not entirely accurate, approximation). It is difficult to tease out the effects of poverty because so many students are poor. Student variation can be ignored--and has little effect in the analysis--is because the environment of the students is rather invariable, albeit for a shameful reason. In other words, this study primarily deals with a population that is homogeneous for poverty. Thus, we can't say very much about how poverty affects scores in general. It also means that teacher effects will be magnified relative to other student populations. Related to this, Matthew Yglesias, looking at LA's NAEP scores, concludes:

We see that LA's black kids do worse than the average big city black kid. LA's Latino kids do worse than the average big city Latino kid. And LA's poor kids do worse than the average big city poor kid. LA's non-poor kids, its white kids, and its Asian kids are average for kids in big city public school systems. Relative to the national average LA's 8th grade math scores are below average for blacks, Hispanics, and Asians. They're below average for poor kids and they're below average for non-poor kids. But LA's non-Hispanic whites do right in line with the national average.

Finally, according to their analysis, teacher quality accounts for 19% in English and 27% in math (for the stats mavens, the effect sizes are 0.19 and 0.27 respectively). It should be noted that the correlation between years is 0.87, so the greatest contribution to test scores is what the student walked into the classroom with (i.e., students who did well last year will do well this year). If we take the effect sizes at face value, and I think there are other methodological issues, along with what I've raised here, that make that a dubious assumption, we're still talking about ~75% of the effect size is not due to teacher quality.

I'll have some final thoughts and discuss reader comments tomorrow.

More like this

I've described before how there is a significant correlation between poverty and educational performance when we use state-level data. But as I pointed out, one of the interesting things is that the residual--the difference between the expected scores for a given state and the actual scores--can…
Given the fundamental problems that New York City's 'proficiency growth' evaluations of teachers have, it's absolutely unclear why Massachusetts, which leads the nation according to the gold-standard NAEP, would want to adopt them (we'll return to this point later). Yet the contagion of stupidity…
One of the key differences between those who favor educational 'reform'--that is, those who view education primarily as a personnel issue--and those that oppose it is how each group thinks good schools come to be. Consider this by Matthew Yglesias, an education reform supporter (italics mine):…
By way of Observational Epidemiology, we find an interesting NY Times article by Michael Winerip describing a seventh grade teacher's experience with value added testing in New York City. I'll get to value added testing in a bit, but the story also highlights why we need more reporters who have…

The cases that feature in the article all look consistent with regression to the mean -- an award-winning teacher with high scoring students has a slight decline, one with very low scoring students has a marked increase. If those were selected for the story because they represent the extremes, there's no story there.

I've never seen that these statistics are actually showing anything other than the effect of sample size. They keep saying that "having the right teacher makes more difference than the right school", but a classroom is a much smaller sample than a school, so naturally there's more statistical variance among them.

If students are shuffled year-to-year, the classrooms ought to start with something nearer the mean, and the variance among classrooms ought to increase by year end, if the teachers are actually doing anything at all. Can they show that baseline effect?

I think you're right that teacher effect is a very small percentage of the entirety of things that can influence a particular student's scores in a given year.

Still, that is different from influencing many students over many years, on average. As an aggregate effect, some teachers could be doing a lot more good than others. Wasn't there just a story on NPR about how a good kindergarten teacher gave you about a significant income boost long term? *googles* Ah, here it is, an economic analysis of project STAR http://obs.rc.fas.harvard.edu/chetty/STAR_slides.pdf
Ah, and here's a key quote from a NYT article: "All else equal, they were making about an extra $100 a year at age 27 for every percentile they had moved up the test-score distribution over the course of kindergarten."
For one student, not much. For many students, it could add up.
So some parents might look at value-added rankings and freak out their kid didn't get a good teacher that year, which wouldn't be hugely elpful.
On the other hand, if you want the quality of your entire educational system to improve, if we can find out WHY some teachers do better than others (assuming it's truly beyond the noise intrinsic in the zero-sum game in testing), it might well be worth investing it.

(of course, that study has an alternative explanation- they looked at income via tax records. If kids who do badly in kindergarten are more likely to grow up to be people who distort their TAXable income downward, there could be no difference between the group at all)