Saturday, June 14, 2008

We don't know as much as we think (Big Change #1)

There are two major changes to my methodology, which you already see reflected in the new charts and graphics that are presently on the site. This is the easier of the two to explain, so let's handle it first.

Andrew Gelman of Columbia University was kind enough to share some of his old national polling data with me. His dataset runs from 1952 through 1992. I took his data from 1988 and 1992 (before 1988, there are only a limited number of polls available), then combined it with the data I already had for 2000 and 2004, and tracked down some 1996 data in a magical place on the internet.

If you had looked back at the polls in June in the five previous election cycles, what would you have found?

In 1988, Michael Dukakis was ahead by an average of 8.2 points in 5 June polls. In November, George Bush won by 7.8 points.

In 1992, George Bush was ahead by an average of 4.9 points in 14 June polls. In November, Bill Clinton won by 5.6 points.

I don't actually have any June polls for 1996 (if anybody's sitting on a big stash of Clinton-Dole data, you know where to find me). But in Gallup's July poll, Bill Clinton led by 17 points. In November, Clinton won by 8.5 points.

In 2000, George W. Bush was ahead by an average of 4.7 points in 14 June polls. In November, Al Gore won the popular vote by 0.5 points.

In 2004, John Kerry was ahead by an average of 0.9 points in 16 June polls (this was pretty much his high-water mark all year). In November, George W. Bush won by 2.4 points.

So in four out of the last five elections, an average of June polls would have incorrectly picked the winner of the popular vote. That's kind of a problem for anybody who is overly confident about how this election is going to turn out.

Previously, I had modeled the error in our polling averages based on 2004 data (simply because that's the data I had access to). The issue with that is that the polls were unusually stable in 2004. From April onward, John Kerry never held a lead of more than about 2 points in the Real Clear Politics national average, and George W. Bush never held a lead of more than 6 or 7 points. Those numbers pretty well framed the actual result of Bush +2.4. But as the Gelman data reveals, there was much more fluidity in previous years, and so modeling the error based on 2004 data alone would lead one to underestimate the degree of uncertainty inherent in a general election.

My error estimates are now modeled instead on the 1988-2004 dataset. I do give somewhat more weight to more recent cycles, as in general, the polls have tended to get closer to the actual margin more quickly in recent years. There are a few reasons to think this might not be an accident. For example, (i) the country has tended to get more partisan over time, meaning that there may be fewer true undecided voters than there used to be; (ii) with the proliferation of the Internet and cable news, voters now have more information about the candidates sooner than they used to, and (iii) the science of polling has probably improved over time. Nevertheless, we are accounting for quite a bit more error than we had been before.

If I look at the total miss for each poll based on the number of days until the election, I get the following, very pretty graph:



There is quite a lot of noise there, but the error can be modeled reasonably well as a function of the square root of the number of days until the election. Specifically, the curve I use looks like this:



Presently, with about 145 days to go until Election Day, we would anticipate that a typical national poll will be off my around 6-7 points. We do not know, unfortunately, which direction that miss is likely to be in. But there is reason to believe that the range of possible outcomes -- including scenarios where the election doesn't turn out to be especially close -- is wider than we had been assuming before.

38 comments

theotherjosh said...

I am guessing the person leading in June tends to win in November?

theotherjosh said...

lol, woops. I didn't actually read the rest of the post. How embarrassing.

My Music Eats Faces said...

I agree with you that we don't know as much as we like to think we do. There's 145 days to go (like you say) and a helluva lot can happen in that much time. I think this could be a very close election or it could be a landslide. I do think if it is a landslide it'll be in Obama's favor. The only way I see McCain winning by a large margin is if something truly insane "comes out" regarding Obama (factually based of course not just lame rumors.)

Anonymous said...

Yes. This makes the "Electoral Vote Distribution" graph look a lot more sensible. Previously, it was predicting the near certainty of a very close election. But we are VERY early in the general election cycle, and this could certainly break one or the other, making 340-200 and 440-100 splits very much in the realm of the possible.

Anonymous said...

The big move in national party ID numbers also gives some reason to assume that the near-tied elections of 2000 and 2004 are unlikely to be repeated. Super-close elections (in the electoral count) are an historical anomaly rather than the rule.

Icyclemort said...

That explains the more spread out "electoral vote distribution".

It doesn't explain why the Obama vs McCain comparisons/stats jumped so suddenly.

kubla000 said...

I'm guessing that the Big Update #2 then has to do with adding Partisan Index back into the mix, as it didn't seem to be in the Regression before.

If you add Partisan Index + Higher chance of variability then certainly Obama's highside is more accurately reflected

Still, all the discussion of June-November turn-arounds has me worried, I hope this is a precident breaking year.

Anonymous said...

So your story in the NY Post that asserted that the 2008 election would be a nailbiter...is that still true in your opinion or no longer so?

Anonymous said...

Nate,

I just sent you an email with an explanation for this sort of "square root" behavior. In short, it's what you get if you model the evolution of voter preferences as a sort of random walk (or diffusion).

Slack said...

Ah - that makes complete sense. Do the error rates also apply to state polls though?

It seems a little odd to extrapolate the national errors down to individual states.

Also, you should probably change the Win% tag in your state by state bit so that it describes either McCain or Obama wins - the way it is right now doesn't serve you too well.

lilnev said...

"My error estimates are now modeled instead on the 1988-2004 dataset."

I'm not sure I understand. Does this mean that the amount of randomness added to each of the 10,000 simulation runs has been increased?

I'm guessing big update #2 is including the national tracking polls as a factor in the 538 regression.

Anonymous said...

Which explains the cluster of blowouts on either edge of the distribution graph.

Wagster said...

I'm confused. I never took your analysis to be a prediction of the winner of the election... I took it to be an indicator of the current state of play. How does this days-from-the-election margin of error affect the numbers you show?

Slack said...

Wagster-

Though I'm not certain, it's probable that the change means that all/national polls are not rated as strongly as they were - such that the regression is given greater weight.

The regression was also changed, which will probably be explained shortly.

Stephen C. Rose said...

Have you done anything with the Republican and Democratic votes in the primaries.

Based my predictions on states where Obama's vote was twide the total of McCains and the Democratic TOTAL was twice the Republican total.

Best, S

Rabbit said...

The new numbers seem more intuitive. The only problem I see is Florida. Obama at 38% and McCain at 63% doesn't add up to 100%.

Anonymous said...

Rabbit... it probably rounds up.

Aranae said...

Nate,
I agree with your change in method and in the application of it to the data. The real difference that needs to be pointed out is that you have essentially gone from predicting what the election would look like if it took place today, to an attempt at predicting the November outcome based on what is currently available. Ultimately, as you state, the electoral vote distribution chart is now much flatter, reflecting the mathematical certainty. What hasn't changed, and maybe should, are your pie charts and state by state breakdowns. These are still showing only the results with the highest probability (or is it mean results?), but they do not display any indication of error. I recommend that you use light blue and pink to show 95% (or whatever your cutoff) CI (or is it posterior probability - what is your model anyway, a Markov chain?) on the Obama/McCain scores in the pie chart. That way it will be clear that although Obama wins according to the mean (or highest probability) prediction, your analysis cannot rule out a McCain victory with any confidence.

Basically the mean/highest scored result may be that Obama gets 308 electoral votes, but I think the pie charts should show that you can only predict with any mathematical certainty that he will get at least 100ish. The remaining 200ish in his score are not predicted to be present with any real confidence.