More interestingly (?), I've also given the regression model a makeover. The number of candidate variables that the model considers has been expanded from 8 to 16, and these are:
1. Kerry. John Kerry's vote share in 2004. Note that an adjustment is made in Massachusetts and Texas, the home states of Kerry and George W. Bush respectively, based on Al Gore's results in Massachusetts in 2000, and Bob Dole's results in Texas in 1996.
2. $_Obama (Obama model only) and $_Clinton (Clinton model only). The ratio of the amount of funds raised by Barack Obama and Hillary Clinton, respectively, to those raised by John Kerry in each state (once again, an adjustment is made in Massachusetts). This turns out to have a little bit more 'juice' than the way that I had been applying the fundraising data before.
3. $_McCain. This is the corollary to #2 above: the ratio of John McCain fundraising to George W. Bush fundraising. An adjustment is made in Texas.
4. Partisan ID index. Per 2004 exit polls, the number of self-identified Democrats less the number of self-identified Republicans.
5. Evangelical. The proportion of white evangelical protestants in each state.
6. Catholic. The proportion of Catholics in each state. Yes, Barack Obama is somewhat underperforming John Kerry's numbers among Catholics, while Hillary Clinton is slightly overperforming them.
7. Mormon. The proportion of LDS voters in each state, a.k.a. the Utah reality check (which presently isn't working for Barack Obama, since the only Utah poll showed him performing relatively well there).
Ethnic and Racial Identity
8. African-American. The proportion of African-Americans in each state. Somewhat predictably, Barack Obama is overperforming John Kerry's numbers among African-Americans while Hillary Clinton is underperforming them.
9. Hispanic. The number of Latino voters in each state as a proportion of overall voter turnout in 2004, as estimated by the Census Bureau. The reason I use data based on turnout rather than data based on the underlying population of Latinos is because Latino registration and turnout varies significantly from state to state. It is much higher in New Mexico, for instance, which has many Hispanics who have been in the country for generations, than it is in Nevada, where many Hispanics are new migrants and are not yet registered.
10. "American". The proportion of residents who report their ancestry as "American" in each state, which tends to be highest in the Appalachians. See discussion here. Barack Obama performs very badly in states with significant numbers of "Americans", whereas Hillary Clinton outperforms John Kerry among this group.
11. PCI. Per capita income in each state.
12. Manufacturing. The proportion of jobs in each state that are in the manufacturing sector. Interestingly, both Democrats are outperforming John Kerry's numbers in these states, perhaps because they have been hit hardest by the recession, which is why a state like Indiana (which has the highest proportion of manufacturing jobs in the country) might be in play in November.
13. Senior. The proportion of the white population aged 65 or older in each state. Because life expectancy varies significantly among different ethnic groups, this version had more explanatory significance than when we looked at the entire (white and non-white) population.
14. Twenty. The proportion of residents aged 18-29 in each state, as a fraction of the overall adult population. The relationship between 'Senior' and 'Twenty' actually isn't all that strong -- there are some states like Idaho that have both relatively high numbers of older voters and relatively high youth turnouts -- so it helped to look at each variable independently.
15. Education. Average number of years of schooling completed for adults aged 25 and older in each state.
16. Suburban. The proportion of voters in each state that live in suburban environments, per 2004 exit polls.
Note that all these variables do not survive into the final model; the model drops variables that are not statistically significant via a stepwise process. However, I came to the conclusion that it was important for the model to evaluate a wider range of variables than it had been before, because otherwise I'd have been cherry-picking based on my preconceived notions about what the electorate should look like. I hope you'll agree that this is an interesting and fairly representative set of variables, and they were selected with an eye toward avoiding multicollinearity where possible, although in some cases (such as 'Kerry' and 'Partisan') this was impossible to avoid.
The results have a fairly neutral effect overall. Barack Obama has lost ground in states like Pennsylvania, Florida, and Missouri, while gaining ground in other areas like Michigan, Wisconsin and Nevada -- generally forming a better match for the polling data we have on hand. There are fewer changes for Hillary Clinton, although the regression model now has her winning Ohio, which it had her losing before.
Also, I'm now using the model to take its best guess at the results in the District of Columbia, rather than applying John Kerry's numbers. That it projects Clinton to win DC by "only" 39 points is interesting, and reflects some of the issues that Clinton is having with black voters.
The model is presently specified as follows:
Variable Coeff t-score
Kerry +.510 8.38
Evangelical -.683 -6.13
$_Obama +8.269 4.62
$_McCain -15.987 -4.03
AfricanAmerican +.409 3.78
Manufacturing +.583 3.21
Senior -.877 -3.03
Suburban -.108 -2.73
Catholic -.182 -2.40
"American" -.609 -2.32
Hispanic +.254 2.08
Education +4.419 1.39
Dropped: PCI, Mormon, Partisan, Twenty
Variable Coeff t-score
Kerry +.726 14.19
$_McCain -14.190 -4.83
"American" +.869 3.62
$_Clinton +3.895 2.80
AfricanAmerican -.231 -2.74
Suburban +.092 2.70
Education -5.779 2.44
Mormon -.291 -2.19
Evangelical -.214 -1.99
Twenty +1.099 1.96
Catholic +.102 -1.59
Senior +.502 1.49
Manufacturing +.172 1.32
Dropped: PCI, Hispanic, Partisan