State of the Statistics: A Nonlinear Non-Diebold Effect?

UPDATE: Diebold effect explained?

Marc has an excellent summary of a flurry of Diebold-related discussions between me, "T", Marc, and Sean.

Sean also has a network model of the apparent Diebold effect.

I think we'll soon hear from Brian Mingus (who's running a meta-classifier) and Steve Freeman (an expert on machine-effects in elections) as well.

At bottom is a disagreement over how to infer causality in observational data, and how to diagnose the functional form of a data set.

The good news is two-fold: there may not be a large "Diebold effect" when nonlinear methods are used, and reason suggests that the apparent Diebold effect will be explained through demographics.

The "bad news" is also two-fold: not everyone agrees those nonlinear methods are appropriate, and there's an alarmingly persistent, consistent, and large Diebold effect when simple - but traditional - inferential statistics are used.

It's still not clear exactly which demographic feature results in such discrepant results between nonlinear and linear models. (Edit 1/21: An important but previously unconsidered variable is how each precinct voted in the 2004 democratic primaries).

More like this

It's been a couple of days since I posted on the New Hampshire recount. At the time, I fully expected that I wouldn't do another post on the topic, but a couple of things that have happened since then changed my mind. First, Scibling Chris Chatham included me in a list of people who he thinks…
Update: Diebold effect explained. Jon Stewart famously accused the Crossfire co-hosts as "hurting America" by imitating the style and appearance of political debate to disguise partisan hackery and vacuous strawman arguments. In the case of the recent NH primary, the same criticism can be leveled…
UPDATES: Diebold effect explained. (previous: 1, 2, 3, 4 5 6 (a nonlinear approach) 7) In contrast to exit pre-election polls, the final vote tally from the NH democratic primary shows a surprise victory for Hillary Clinton. People quickly noticed an anomaly in the voting tallies which seemed…
In the week since the New Hampshire voting, a number of people have become increasingly concerned about some of the things that they've seen in the results. Two things, in particular, have gotten a lot of attention. The first is the difference between the pre-election polling, which had Obama…

I had got a desire to begin my own company, nevertheless I did not earn enough of cash to do that. Thank God my close mate advised to take the home loans. Hence I used the car loan and made real my dream.

I ran a few dozen experiments with a variety of classifiers and meta classifiers and found that J48, which generates a C4.5 decision tree, and Ada Boost (pdf) achieve the best results using my feature set. RandomForests perform poorly over a variety of parameters. I didn't experiment much with feature selection (such as by PCA - takes too long) and I'm not sure which features others are using. Send me your datasets and I'll run them.

Features used
Town,Sqmiles,Votes,Municipalwater,Municipalsewer,
Totalhousingunits,Singlefamilyhomes,Multifamilyunits,
Manufacturedhomes,Totalpopulation,Medianage,Percenthighschoolgraduates,
Percentholdingbachelorsdegree,Totallaborforce,Totalemployed,
Totalunemployed,Percapitaincome,Medianhouseholdincome,Age5andunder,
Age5to19,Age20to34,Age35to54,Age55to64,Age65andup,Employeesinlargestbusiness,
MarkObama,MarkClinton,IncomeDev,EducationDev,Contested,PopDensity

AdaBoostM1
Relation: diebold

Correctly Classified Instances 204 88.3117 %
Incorrectly Classified Instances 27 11.6883 %
Kappa statistic 0.7647
Mean absolute error 0.1565
Root mean squared error 0.3046
Relative absolute error 31.5858 %
Root relative squared error 61.2877 %
Total Number of Instances 231

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class
0.868 0.098 0.918 0.868 0.892 Hand
0.902 0.132 0.844 0.902 0.872 Diebold

=== Confusion Matrix ===

a b <-- classified as
112 17 | a = Hand
10 92 | b = Diebold

J48
Relation: diebold

Correctly Classified Instances 206 89.1775 %
Incorrectly Classified Instances 25 10.8225 %
Kappa statistic 0.7812
Mean absolute error 0.1804
Root mean squared error 0.3088
Relative absolute error 36.4002 %
Root relative squared error 62.1423 %
Total Number of Instances 231

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class
0.891 0.108 0.913 0.891 0.902 Hand
0.892 0.109 0.867 0.892 0.879 Diebold

=== Confusion Matrix ===

a b <-- classified as
115 14 | a = Hand
11 91 | b = Diebold

I hate to add to all this noise, and I have neither fancy statistics nor the time and energy to produce any, but can't resist making two quick points:

1. this may be a good example of why Killeen (e.g. Killeen et al. An alternative to null-hypothesis significance tests. Psychological science : a journal of the American Psychological Society / APS (2005) vol. 16 (5) pp. 345-53) has suggested that null-hypothesis significance tests can be very misleading-- if you don't know the prior probability distributions under the null hypothesis (which you can't usually, and definitely not in this case), Fisher himself said �Such a test of significance does not authorize us to make any statement about the hypothesis in question in terms of mathematical probability� (Fisher, 1959, p. 35). This is a problem we have all tried really hard to ignore because it calls into question much of what we do with statistics, but it seems like a particularly big problem for complex and uncontrolled (in the sense that people were not randomly assigned to conditions, not in the sense that you have not tried to control for confounding variables) correlational studies like this, where we really can't know the priors. Killeens's proposed solution is to use the probability of replication (p-rep) instead, but since this is currently just a mathematical transformation of p, I don't think it really solves the priors problem here (although it seems to has other advantages).
2. This (http://www.nytimes.com/2008/01/06/magazine/06Vote-t.html?ref=magazine) is a really interesting article in the NY Times about voting machines-- basically, everyone worries about them, but the public and the experts worry for very different reasons. The public tends to believe in deliberate fraud, while the experts seem to all agree that the problem is random error and lost votes due to crashes, which could cause very tight races to be cast into doubt. Optical scan of paper ballots is actually the preferred solution, since it enables hand re-counts.

I was about ready for the stats to show no problem. But then Nashua 5 ward result came out.

Somehow a systematic (human) error in that ward is supposed to explain the scaling up every candidate BUT Obama by 9%.

Hillary, Kucinich, Edwards, Richardson all lost around 9% in the recount, Obama went up but by well under 1%.

Question, to what extent were the models looking for a Hillary effect - I wonder about comparing the Obama vote vs the Others vote, or something, to find maybe a stronger signal if it's there?

I have faith in the stats to pick up a problem - and they seemed to, but now they don't, but this just doesn't seem reasonable against the explanation being provided:
1030 x 0.93 [HILLARY]
405 x 0.93 [EDWARDS]
9 x 0.88 [BIDEN]
72 x 0.96 [RICHARDSON]
673 x 1.01 [OBAMA]