Saturday, June 14, 2008

We know more than we think (Big Change #2)

The other major change to our methodology (which I am surprised nobody guessed in the teaser thread) is that we are now making adjustments to the results of all states based on a time trend.

One of the problems with our previous way of doing things is that polling data tends to roll in at different times in different states. Both state and national polls conducted since the conclusion of the Democratic nomination process have reflected a bounce of a few points for Barack Obama. For example, we know that Barack Obama has experienced a bounce in his polling results in states like Wisconsin, Michigan and New Jersey, as well as in both the Rasmussen and Gallup national tracking polls. It would be naive to assume that Obama won't also experience a bounce in other states like Pennsylvania and Ohio where new polling data has yet to come out. However, we've had no way to account for these changes in states where the polling data is not fresh.

Our objective, then, is to infer what is likely to happen in states where we don't have fresh polling data based on those states where we do. In order to make such an inference, I apply a four-step process. A version of this process was suggested to be by Professor Robert Erikson of Columbia University, who has spent his lifetime studying polling and public opinion, and who is also a family friend.

Step 1: All polls are placed into groups based on (i) the week of the election; and (ii) the state-pollster unit. A state-pollster unit is a combination of a particular state and a particular pollster; for example "Alabama-SurveyUSA" or "New York-Quinnipiac". The current week is defined as having begun seven days before the current date, with weeks progressing backward from there to the start of the calendar year 2008. One very important note: we treat national polls as a "state". For example, there are units for "USA-Rasmussen Tracker" and "USA-Gallup Tracker". One of the most useful elements of national polls, and particularly national tracking polls, is that they provide a robust baseline for measuring changes in candidate support. We do not include national polls directly in our averages. We do use them, however, to help infer trends, which in turn can inform our state-by-state projections.

Step 2: We run a linear regression with a large number of dummy variables. Specifically, we include one dummy variable for each week, and one dummy variable for each state-pollster unit. The coefficients of the weekly dummy variables give us an inkling of a time trend. Specifically, the time trend looks like this:



Let me explain exactly what is going on here. Suppose that in that in Week 15, Rasmussen shows Barack Obama 6 points ahead in Minnesota. Then, in Week 22, it shows him 9 points ahead in Minnesota. This is a piece of information implying that Obama's standing was 3 points better in Week 22 than in Week 15. If we apply this process to all state-pollster units, we get quite a lot of information about in which way the polls are changing. That's all that this process is doing. It's taking the changes that we see in each poll where we have a baseline for comparison, and inferring an overall time trend based on those changes.

Step 3: The time trend is smoothed by means of a LOESS regression. You probably don't think you know what a LOESS regression is, but if you've ever been over to Pollster.com, you have seen one. A LOESS regression is way to create smooth curves through time series data. In our case, that curve looks like this:



When running a LOESS regression, one may choose a "smoothing parameter" that determines how sensitive the regression line is to changes in the data. I use a fairly conservative smoothing parameter, tending toward a smoother rather than a jerkier curve. Nevertheless, we can make out a few fairly clear trends. Obama's numbers surged in February, when he was winning one primary after another. They slumped in March and early April, as stories like bittergate and Jeremiah Wright dominated the landscape. They have since been gradually improving, but particularly so in the last two weeks since he wrapped up the nomination.

Step 4: Polls from previous weeks are adjusted to match the LOESS estimate from the current week. For example, our LOESS regression line tells us that an average poll in the current week has been about 2.5 points stronger for Barack Obama than a poll in the week ending 5/17. Thus, the Quinnipiac poll of Florida taken on 5/17, which showed John McCain ahead by 4 points, is treated as though it had shown McCain ahead by 1.5 points (i.e. 2.5 points better for Obama). The idea, simply put, is to make all old data match the current polling landscape.

* * *

From there, everything proceeds as it always has. We still run a demographic regression, although it is based on the trend-adjusted polls rather than the original ones. (Also, I am now referring to our result in each state as a "projection" rather than an "average", as that nomenclature is more consistent with our process.

This adjustment presently results in an increase of about 2 points in Barack Obama's projected popular vote margin. Because a large number of states in this election are very close, this results in a somewhat dramatic-seeming change in Obama's win percentage and electoral vote projection. Interestingly, Obama's current win percentage of 64.7 percent almost exactly matches the price of Democratic contracts on Intrade, which also has the Democrats with a 64 percent chance of winning the election.

166 comments

Brandon said...

I don't think your statistics should be based on assumptions though. For example, Obama lost 6 points in his Rasmussen poll in Oregon after he locked up the nomination.

Slack said...

What establishes the 0/0 baseline in the super tracker? An average of all polls, or something like that?

Nate said...

He did lose 6 points in Rasmussen's Oregon poll. And that result is accounted for in our extraction of his time trend.

Anonymous said...

Very interesting - I looked at this for the first time since yesterday and wondered what the hell had happened.

If you get a chance, one thing that would be interesting to see is how this new methodology would affect the old Clinton-McCain stats. I ask because it seems to me (not that I know a lot about this) that none of this inherently helps the Democrat, and it would say something about the relative chances both have vs McCain (for people still arguing about it, including myself).

Benjamin Johnstone-Anderson said...

Nate,

Interesting changes, although as a purist, I almost wish there were a "classical version." Any way to account for primary-season related bumps, clearly visible in OR & IN polling (in different ways)? Or are you just waiting for those to "roll off"?

Anonymous said...

Brandon, from what I understand, I don't think that's quite what Nate is doing.

He's basically trying to account for periods like the middle of Wrightgate when Obama's numbers decreased dramatically. The goal is for all polling to correlate to the current environment. If there was another crisis for Obama's campaign, the model would reflect THAT environment as well.

All, in all, the change IS rather dramatic. I will be curious to see if future state polls reflect the map as Nate currently has it.

Anonymous said...

Could you maybe add a column for the LOESS adjustment factor for each poll in the state-by-state polling detail?

Preyanka said...

Nate, this might be a stupid question for people who are much more knowledgeable about what you're doing, but for someone who just happens upon your site and doesn't understand the methodology, are the number at the upper left (piecharts) a snapshot of the current race, or are they predicting the outcome?

Thanks:)

Silifi said...

I liked the old methodology better. This just ruins the fact that it is a state-by-state analysis.

How do you know that Obama's uptrend is going to be the same in all states? When we move forward into a general election mode, are you telling me that if Obama campaigned really hard in one state, that would somehow uptick his ratings in other states?

For example, if Obama spent a lot of time in Virginia, we would expect that he goes up in the Virginia polls. However, under this new formulae, you're assuming that because he went up in Virginia, he's also going to go up in Ohio. When, in fact, the voters in Ohio don't care that he's campaigned in Virginia.

A state trend is not a national trend, and it shouldn't be treated like one.

I would prefer it if you returned to the old projection formula.

Icyclemort said...

Hi Nate,

You use both state and national polls to infer the time trend. (placing national polls as "their own state")

What you didn't elaborate on is how much weight you assign state vs national polls for your time trend?


-----------

It is also likely that states don't react identically to "mood changes" in the whole country.

a) This does introduce another element of uncertainty for individual state projections.

b) Do you plan to adjust for that? E.g. observing during the GE that Florida is not nearly as volatile as (e.g.) Ohio or Virginia?


Regards,
Icyclemort

Anonymous said...

I tend to agree with Silifi, you are assuming that all states are equal in demographics, exposure to ads, local races, etc. In making the projections more sophisticated, you risk actually oversimplifying the model.

Josh said...

Great Work! Do you think that you will put others in you data? Barr could change the results in state like AK, MT, ND, and SD.

Anonymous said...

If Silifi's objection is valid (I myself got lost in the ins and outs of the model, so I can't say for sure) then the globalizing of state-specific trends robs the model of much of its resolution.

james said...

Hmm, interesting. I think this is a good change. There just aren't enough polls for a lot of states now. However, closer to the elction, that might change. If it does, will you drop factoring in the Super Tracker?

Slack said...

Echoing thoughts of others, I think this model is a simplification, but one that's very strongly shown in the results.

If Obama does something that's perceived as rejecting African-Americans over the coming months, your model will show the hit to his numbers as a small one in all states, when in fact the change will vary tremendously from state to state.

In other words, the changes you model have reasons behind them, and those reasons cannot be equally applied nationally.

I think the data you have is useful, but I'm not sure that messing with poll results from