There have been some very good attempts to predict the outcome of the Pennsylvania primary on a district-by-district basis: see this one, or this one, or this one. That isn't going to stop me from trying to approach the problem on my own, however. The difference is that mine will be done entirely by the numbers: no human intervention, no judgment calls. This is not, by any means, inherently a good thing. But, well, it should be ... fun.
I'd been playing around with some state-by-state data in recent days, sort of trying to recreate my experiment back in February to predict the results of the remaining primaries -- which frankly was something of a failure. Then I realized that if I was going to try and predict things by Congressional District, I had better analyze data on a Congressional District level, and so I pulled up a whole bunch of data from American Fact Finder and away I went.
I looked at the results of all states that have held primaries so far on a CD-by-CD basis, with the exceptions of Louisiana, New Jersey, Tennessee, and Texas, which don't yet have results available on a Congressional District basis. I did include Florida, but I didn't include Michigan. I didn't include any caucuses -- except New Mexico, which has a caucus in name only -- nor did I include the beauty contest primary in Washington. I did include the District of Columbia.
And then I just went looking about for relationships in the data. There were nine variables, out of about twice as many candidates, that turned out to have a statistically significant relationship with Barack Obama's two-way vote share against Hillary Clinton in my regression model. Those variables are as follows:
1. Partisan Voting Index. Obama does somewhat better, all else being equal, in CDs with more Republicans. It does not seem to matter whether the state has an open or a closed primary; the effect is the same either way.
2. Percentage of Adults with Bachelors' Degrees or higher. Overedumucated folks like Obama!
3. Percentage of Seniors (Age 65+ Adults), out of all Adults. Old folks don't like Obama!
4. Percentage of Young Voters (18-29), out of all Adults. Young folks do like Obama! It proved to be helpful to have both the "seniors" and the "young voters" variables included, because Obama's vote share by age group resembles something of an S-curve. He does significantly better with young voters and significantly worse with older voters, but everything in between is pretty flat: a 34-year-old isn't that much more inclined to support Obama than a 52-year-old, for instance. It's just on the tails of the distrubtion where you see the effects.
5. Percentage of African-American voters. No surprise on this one, and far and away the strongest relationship in the dataset. Once again, I did not find any relationship between Hispanic voters and Obama's vote share. Latinos haven't voted for Obama in big numbers, but it appears that this can be entirely explained by other variables, like education levels.
6. Percentage of Urban Population. Obama actually does slightly worse in urbanized districts, all else being equal, although this is usually obscured by the fact that highly urbanized districts tend to have a lot of African-American voters. What we may be left with here is some of those white ethnics -- including Jewish voters -- that Chris Matthews likes to talk about. The Census Bureau also has a separate category for "urbanized clusters" -- its term for small towns -- and I looked at that too, but it didn't make any difference.
These next two are pretty interesting.
7. Percentage of Women of Working Age in the Active Workforce. I looked at the percentage of women who are employed out of all women aged 18-64 in the district. Obama does better when a higher percentage of women in the district are employed outside the home. This is arguably somewhat counterintuitive, since working women are supposed to be one of Hillary's main cohorts. But it seems like Hillary's real strength is with stay-at-home women, and not working women. Or it could be that areas in which a lot of women tend to work might have different attitudes about gender or other social norms in ways that tend to work to Obama's benefit. Either way, the variable is highly statistically significant.
8. Percentage of Residents who Identify themselves as "American". Recently, the Census Bureau has begin to ask for an ethnic classification in addition to a racial one (e.g. "Cuban", "Lithuanian"). However, about seven percent of Americans decline to check any of the boxes that the Census Bureau provides, and instead write in that they are simply "American". As you can see, this practice tends to be highly concentrated in certain parts of the country, especially the Appalachian/Highlands region:
To be perfectly blunt, this variable seems to serve as a pretty good proxy for folks that a lot of us elitists would usually describe as "rednecks". And for whatever reason, these "American" voters do not like Barack Obama. That is why he's getting killed in the polls in Kentucky and West Virginia, for instance, where there are high concentrations of them.
9. Home-State Variables. The last variables were dummies indicating the home state(s) of Hillary Clinton (she got credit for both New York and Arkansas) and Barack Obama (Illinois; I would also have given him credit for Hawaii but they held a caucus). The model seems to think that a primary candidate can expect about a 20-25 point bonus from campaigning in his home state.
Other variables I looked at but that did not make the cut:
(i) a whole bunch of things related to income, poverty levels and economics, including some broad occupational categories like manufacturing workers and service employees. It appears that education drives the differences in support between Clinton and Obama, rather than economic class or income levels.
(ii) As I mentioned, the Hispanic variable had no significant impact, neither did a variable for Asians.
(iii) The number of college students in the area was not relevant, probably because it's very redundant with out twentysomethings variable.
(iv) Open versus closed primary status appeared to make no difference whatsoever.
Turning back to the case of Pennsylvania, here is what each of those variables look like in the 19 Congressional Districts in the state:
And here is that map again, since I know this is a long post:
And now we can get into numbers. Keep in mind that all of these predictions you see are what my model tells me; I have not fiddled with them in any way.
CD 1: South Philadelphia, and some burbs (7 delegates). A very working class section of Philadelphia -- just 15.5% of adults have advanced degrees -- which otherwise would not be particularly favorable to Obama, but for the high concentration of African Americans. This district will almost certainly be split 4-3 in Obama's favor. Obama 58-42 popular vote, 4-3 delegates.
*CD 2: West Philadelphia, and some burbs (9 delegates). This is sort of that Maryland region of Philadelphia -- pretty much everyone is black or well-educated, and many are both. Obama should get very close to the 72.5% threshold he needs to get 7 delegates, but the model has him falling just a bit short. Obama 71-29 popular vote, 5-2 delegates.
CD 3: Northwest/Erie (5 delegates). Maybe not as bad for Obama as it's been made out to be. It's working class, but the electorate tilts slightly young and slightly Republican, both of which are favorable to Obama. He'll lose, but is not in much danger of a 1-4 split. Clinton 59-41 popular vote, 3-2 delegates.
CD 4: NW Pittsburgh Suburbs (5 delegates). This is actually a highly educated district -- not part of "Pennsyltucky" -- and Obama might even make a run of it, if not for the fact that the district tilts very old; just 15% of the electorate is under age 29. Clinton 59-41 popular vote, 3-2 delegates.
CD 5: North Central -- State College (5 delegates). The extremely high concentration of young voters around Penn State University should hold Clinton to a 3-2 delegate split. Neither an Obama win or a 4-1 split for Clinton are very likely; this is not a swing district. Clinton 57-43 popular vote, 3-2 delegates.
CD 6: Southeast - Burks and Chester Counties (6 delegates). This would be a swing district, except that there are an even number of delegates. It's among the most highly educated districts in the state, but the rest of the demographics tend to favor Clinton. Those two things will roughly balance out. Whoever wins here might get some bragging rights in the exit polls, however. Obama 51-49 popular vote, 3-3 delegate split.
*CD 7: Western Philadelphia Suburbs (7 delegates). This is the Joe Sestak district, and it should be extremely close. It has the highest education levels in the state, and lots of two-income households. But this is counterbalanced by a slightly older electorate, and a lot of white ethnic voters. Clinton 50.3-49.7 popular vote, 4-3 delegates.
CD 8: Southeast -- Bucks County (7 delegates). Slightly more working-class than the other suburban districts, with slightly fewer African-Americans and young voters. All the differences are subtle, but they add up to project a somewhat comfortable win for Clinton. Clinton 55-45 popular vote, 4-3 delegates.
CD 9: South Central -- Altoona (3 delegates). The least educated district in the state and otherwise a mess for Obama. The one saving grace for Obama is that it's also the most Republican district in the state, so Clinton's institutional support will be less effective here. There are just 3 delegates in play; Clinton won't get the huge margins she'd need to win all 3. Clinton 61-39 popular vote, 2-1 delegates.
*CD 10: Northeast -- Susquehanna Valley (4 delegates). The low education levels are a problem for Obama, but it also has some genuinely rural areas, and Obama tends to fare OK among farming and agricultural populations. That's not much for him to hang his hat on, however, and Clinton could conceivably get a 3-1 split here. Clinton 59-41 popular vote, 2-2 delegate split.
CD 11: Northeast -- Scranton (5 delegates). There's not any one thing that really stands out in this district -- it's just that a whole bunch of little things point toward Clinton, including the upbringing she claims in the region. It's a heterogeneous enough area however that Clinton is unlikely to do better than to win 3 out of the 5 delegates. Clinton 63-37 popular vote, 3-2 delegates.
*CD 12: Southwest -- Johnstown (5 delegates). The worst district in the state for Obama, and the one where he does need to be worried about a 4-1 split. Lots of things work out badly for him; it's among the least educated districts in the state, but also has the highest share of seniors. Still, the model says the tall order of a 4-1 split will not quite come through for Clinton. Clinton 67-33 popular vote, 3-2 delegates.
CD 13: Southeast -- Montgomery County (7 delegates). Nearly identical demographically to CD 7 (Bucks County), and we should see a similar result. Clinton 56-44 popular vote, 4-3 delegates.
*CD 14: Pittsburgh, and some suburbs (7 delegates). One very much to watch on election night. Obama will probably win the city of Pittsburgh itself, where the African-American population is high, but the outlying regions convince the model to tip it slightly toward Clinton. Clinton 52-48 popular vote, 4-3 delegates.
CD 15: East -- Allentown (5 delegates). This is basically a prototypical Pennsylvania district, and Obama is liable to lose prototypical Pennsylvania districts when they don't have many black voters. Given the math, will almost certainly be a 3-2 split for Clinton. Clinton 59-41 popular vote, 3-2 delegates.
CD 16: Southeast -- Lancaster (4 delegates). A mixed bag for both candidates: the district tilts somewhat young, but it also has a high percentage of "Americans". It also has a large Amish population, who presumably are tough to get included in surveys. With an even number of delegates, the result is very likely to be a split. Clinton 55-45 popular vote, 2-2 delegate split.
CD 17: East Central -- Harrisburg (4 delegates). Another very typical, if somewhat conservative district. With an even number of delegates, it is not worth a lot of attention. Clinton 56-44 popular vote, 2-2 delegate split.
CD 18: Western Pittsburgh Suburbs (5 delegates). Like the other Pittsburgh suburban district, it should not be mistaken for Hicksville -- the population is quite educated. But also like CD-4, the demographics are otherwise favorable to Clinton. This is almost definitely locked in to a 3-2 split. Clinton 60-40 popular vote, 3-2 delegates.
CD 19: South Central -- York (4 delegates). An interesting mix: the district definitely has some Pennsyltucky regions, but education levels are about average, and it has the state's highest percentage of women in the workforce. Again, the even number of delegates is likely to rob us of any drama. Clinton 53-47 popular vote, 2-2 delegate split.
CD-level total: Clinton 55 delegates, Obama 48 (Clinton +7).
Overall, the model is spitting out some very sensible results: it expects Clinton to win the state by 7-8 percentage points. That would translate to about a 120,000 pickup in popular votes. She also picks up a net of 5 delegates that the statewide level, for a 12-delegate win overall:
How did I get those popular vote figures, you might be wondering? I ran a regression on those numbers too -- it turns out that turnout is really quite predictable (the R-squared on my turnout model is close to .9, whereas it's .8 for the vote share model). While I won't go into too much detail on the specifics just now, one interesting (if intuitive) finding is that turnout depends on how close the polls are. If Pennsylvania does tighten further, we can expect more people to go to the voting booth next Tuesday. These turnout figures might look a little low as compared with Ohio, but keep in mind that Pennsylvania has a closed primary -- which the model thinks may reduce turnout by as much as 30-40% over what it would be otherwise.
by Nate Silver @ 12:58 PM