
There have been some very good attempts to predict the outcome of the Pennsylvania primary on a district-by-district basis: see this one, or this one, or this one. That isn't going to stop me from trying to approach the problem on my own, however. The difference is that mine will be done entirely by the numbers: no human intervention, no judgment calls. This is not, by any means, inherently a good thing. But, well, it should be ... fun.
I'd been playing around with some state-by-state data in recent days, sort of trying to recreate my experiment back in February to predict the results of the remaining primaries -- which frankly was something of a failure. Then I realized that if I was going to try and predict things by Congressional District, I had better analyze data on a Congressional District level, and so I pulled up a whole bunch of data from American Fact Finder and away I went.
I looked at the results of all states that have held primaries so far on a CD-by-CD basis, with the exceptions of Louisiana, New Jersey, Tennessee, and Texas, which don't yet have results available on a Congressional District basis. I did include Florida, but I didn't include Michigan. I didn't include any caucuses -- except New Mexico, which has a caucus in name only -- nor did I include the beauty contest primary in Washington. I did include the District of Columbia.
And then I just went looking about for relationships in the data. There were nine variables, out of about twice as many candidates, that turned out to have a statistically significant relationship with Barack Obama's two-way vote share against Hillary Clinton in my regression model. Those variables are as follows:
1. Partisan Voting Index. Obama does somewhat better, all else being equal, in CDs with more Republicans. It does not seem to matter whether the state has an open or a closed primary; the effect is the same either way.
2. Percentage of Adults with Bachelors' Degrees or higher. Overedumucated folks like Obama!
3. Percentage of Seniors (Age 65+ Adults), out of all Adults. Old folks don't like Obama!
4. Percentage of Young Voters (18-29), out of all Adults. Young folks do like Obama! It proved to be helpful to have both the "seniors" and the "young voters" variables included, because Obama's vote share by age group resembles something of an S-curve. He does significantly better with young voters and significantly worse with older voters, but everything in between is pretty flat: a 34-year-old isn't that much more inclined to support Obama than a 52-year-old, for instance. It's just on the tails of the distrubtion where you see the effects.
5. Percentage of African-American voters. No surprise on this one, and far and away the strongest relationship in the dataset. Once again, I did not find any relationship between Hispanic voters and Obama's vote share. Latinos haven't voted for Obama in big numbers, but it appears that this can be entirely explained by other variables, like education levels.
6. Percentage of Urban Population. Obama actually does slightly worse in urbanized districts, all else being equal, although this is usually obscured by the fact that highly urbanized districts tend to have a lot of African-American voters. What we may be left with here is some of those white ethnics -- including Jewish voters -- that Chris Matthews likes to talk about. The Census Bureau also has a separate category for "urbanized clusters" -- its term for small towns -- and I looked at that too, but it didn't make any difference.
These next two are pretty interesting.
7. Percentage of Women of Working Age in the Active Workforce. I looked at the percentage of women who are employed out of all women aged 18-64 in the district. Obama does better when a higher percentage of women in the district are employed outside the home. This is arguably somewhat counterintuitive, since working women are supposed to be one of Hillary's main cohorts. But it seems like Hillary's real strength is with stay-at-home women, and not working women. Or it could be that areas in which a lot of women tend to work might have different attitudes about gender or other social norms in ways that tend to work to Obama's benefit. Either way, the variable is highly statistically significant.
8. Percentage of Residents who Identify themselves as "American". Recently, the Census Bureau has begin to ask for an ethnic classification in addition to a racial one (e.g. "Cuban", "Lithuanian"). However, about seven percent of Americans decline to check any of the boxes that the Census Bureau provides, and instead write in that they are simply "American". As you can see, this practice tends to be highly concentrated in certain parts of the country, especially the Appalachian/Highlands region:
To be perfectly blunt, this variable seems to serve as a pretty good proxy for folks that a lot of us elitists would usually describe as "rednecks". And for whatever reason, these "American" voters do not like Barack Obama. That is why he's getting killed in the polls in Kentucky and West Virginia, for instance, where there are high concentrations of them.
9. Home-State Variables. The last variables were dummies indicating the home state(s) of Hillary Clinton (she got credit for both New York and Arkansas) and Barack Obama (Illinois; I would also have given him credit for Hawaii but they held a caucus). The model seems to think that a primary candidate can expect about a 20-25 point bonus from campaigning in his home state.
Other variables I looked at but that did not make the cut:
(i) a whole bunch of things related to income, poverty levels and economics, including some broad occupational categories like manufacturing workers and service employees. It appears that education drives the differences in support between Clinton and Obama, rather than economic class or income levels.
(ii) As I mentioned, the Hispanic variable had no significant impact, neither did a variable for Asians.
(iii) The number of college students in the area was not relevant, probably because it's very redundant with out twentysomethings variable.
(iv) Open versus closed primary status appeared to make no difference whatsoever.
Turning back to the case of Pennsylvania, here is what each of those variables look like in the 19 Congressional Districts in the state:
And here is that map again, since I know this is a long post:
And now we can get into numbers. Keep in mind that all of these predictions you see are what my model tells me; I have not fiddled with them in any way.
CD 1: South Philadelphia, and some burbs (7 delegates). A very working class section of Philadelphia -- just 15.5% of adults have advanced degrees -- which otherwise would not be particularly favorable to Obama, but for the high concentration of African Americans. This district will almost certainly be split 4-3 in Obama's favor. Obama 58-42 popular vote, 4-3 delegates.
*CD 2: West Philadelphia, and some burbs (9 delegates). This is sort of that Maryland region of Philadelphia -- pretty much everyone is black or well-educated, and many are both. Obama should get very close to the 72.5% threshold he needs to get 7 delegates, but the model has him falling just a bit short. Obama 71-29 popular vote, 5-2 delegates.
CD 3: Northwest/Erie (5 delegates). Maybe not as bad for Obama as it's been made out to be. It's working class, but the electorate tilts slightly young and slightly Republican, both of which are favorable to Obama. He'll lose, but is not in much danger of a 1-4 split. Clinton 59-41 popular vote, 3-2 delegates.
CD 4: NW Pittsburgh Suburbs (5 delegates). This is actually a highly educated district -- not part of "Pennsyltucky" -- and Obama might even make a run of it, if not for the fact that the district tilts very old; just 15% of the electorate is under age 29. Clinton 59-41 popular vote, 3-2 delegates.
CD 5: North Central -- State College (5 delegates). The extremely high concentration of young voters around Penn State University should hold Clinton to a 3-2 delegate split. Neither an Obama win or a 4-1 split for Clinton are very likely; this is not a swing district. Clinton 57-43 popular vote, 3-2 delegates.
CD 6: Southeast - Burks and Chester Counties (6 delegates). This would be a swing district, except that there are an even number of delegates. It's among the most highly educated districts in the state, but the rest of the demographics tend to favor Clinton. Those two things will roughly balance out. Whoever wins here might get some bragging rights in the exit polls, however. Obama 51-49 popular vote, 3-3 delegate split.
*CD 7: Western Philadelphia Suburbs (7 delegates). This is the Joe Sestak district, and it should be extremely close. It has the highest education levels in the state, and lots of two-income households. But this is counterbalanced by a slightly older electorate, and a lot of white ethnic voters. Clinton 50.3-49.7 popular vote, 4-3 delegates.
CD 8: Southeast -- Bucks County (7 delegates). Slightly more working-class than the other suburban districts, with slightly fewer African-Americans and young voters. All the differences are subtle, but they add up to project a somewhat comfortable win for Clinton. Clinton 55-45 popular vote, 4-3 delegates.
CD 9: South Central -- Altoona (3 delegates). The least educated district in the state and otherwise a mess for Obama. The one saving grace for Obama is that it's also the most Republican district in the state, so Clinton's institutional support will be less effective here. There are just 3 delegates in play; Clinton won't get the huge margins she'd need to win all 3. Clinton 61-39 popular vote, 2-1 delegates.
*CD 10: Northeast -- Susquehanna Valley (4 delegates). The low education levels are a problem for Obama, but it also has some genuinely rural areas, and Obama tends to fare OK among farming and agricultural populations. That's not much for him to hang his hat on, however, and Clinton could conceivably get a 3-1 split here. Clinton 59-41 popular vote, 2-2 delegate split.
CD 11: Northeast -- Scranton (5 delegates). There's not any one thing that really stands out in this district -- it's just that a whole bunch of little things point toward Clinton, including the upbringing she claims in the region. It's a heterogeneous enough area however that Clinton is unlikely to do better than to win 3 out of the 5 delegates. Clinton 63-37 popular vote, 3-2 delegates.
*CD 12: Southwest -- Johnstown (5 delegates). The worst district in the state for Obama, and the one where he does need to be worried about a 4-1 split. Lots of things work out badly for him; it's among the least educated districts in the state, but also has the highest share of seniors. Still, the model says the tall order of a 4-1 split will not quite come through for Clinton. Clinton 67-33 popular vote, 3-2 delegates.
CD 13: Southeast -- Montgomery County (7 delegates). Nearly identical demographically to CD 7 (Bucks County), and we should see a similar result. Clinton 56-44 popular vote, 4-3 delegates.
*CD 14: Pittsburgh, and some suburbs (7 delegates). One very much to watch on election night. Obama will probably win the city of Pittsburgh itself, where the African-American population is high, but the outlying regions convince the model to tip it slightly toward Clinton. Clinton 52-48 popular vote, 4-3 delegates.
CD 15: East -- Allentown (5 delegates). This is basically a prototypical Pennsylvania district, and Obama is liable to lose prototypical Pennsylvania districts when they don't have many black voters. Given the math, will almost certainly be a 3-2 split for Clinton. Clinton 59-41 popular vote, 3-2 delegates.
CD 16: Southeast -- Lancaster (4 delegates). A mixed bag for both candidates: the district tilts somewhat young, but it also has a high percentage of "Americans". It also has a large Amish population, who presumably are tough to get included in surveys. With an even number of delegates, the result is very likely to be a split. Clinton 55-45 popular vote, 2-2 delegate split.
CD 17: East Central -- Harrisburg (4 delegates). Another very typical, if somewhat conservative district. With an even number of delegates, it is not worth a lot of attention. Clinton 56-44 popular vote, 2-2 delegate split.
CD 18: Western Pittsburgh Suburbs (5 delegates). Like the other Pittsburgh suburban district, it should not be mistaken for Hicksville -- the population is quite educated. But also like CD-4, the demographics are otherwise favorable to Clinton. This is almost definitely locked in to a 3-2 split. Clinton 60-40 popular vote, 3-2 delegates.
CD 19: South Central -- York (4 delegates). An interesting mix: the district definitely has some Pennsyltucky regions, but education levels are about average, and it has the state's highest percentage of women in the workforce. Again, the even number of delegates is likely to rob us of any drama. Clinton 53-47 popular vote, 2-2 delegate split.
CD-level total: Clinton 55 delegates, Obama 48 (Clinton +7).
Overall, the model is spitting out some very sensible results: it expects Clinton to win the state by 7-8 percentage points. That would translate to about a 120,000 pickup in popular votes. She also picks up a net of 5 delegates that the statewide level, for a 12-delegate win overall:
How did I get those popular vote figures, you might be wondering? I ran a regression on those numbers too -- it turns out that turnout is really quite predictable (the R-squared on my turnout model is close to .9, whereas it's .8 for the vote share model). While I won't go into too much detail on the specifics just now, one interesting (if intuitive) finding is that turnout depends on how close the polls are. If Pennsylvania does tighten further, we can expect more people to go to the voting booth next Tuesday. These turnout figures might look a little low as compared with Ohio, but keep in mind that Pennsylvania has a closed primary -- which the model thinks may reduce turnout by as much as 30-40% over what it would be otherwise.
Wednesday, April 16, 2008
Pennsylvania Prediction: Clinton to net 12 delegates, 120K popular votes
-- Nate at 11:58 AM
Labels: demographics, pennsylvania, primaries
39 comments
So, she would net just 3 more delegates than she netted in Ohio. Hi Hi Hi Hi Hi Hi! That would be a huge victory for Obama.
Could you go one level deeper and tell us what the confidence interval around your prediction is? I would be interested in seeing the best case and worst case scenarios. I think that would give us an idea of how much wiggle room there is in these numbers.
Well, I tend to be of the feeling that a win is a win in politics. Expectations tend to matter more beforehand, than afterward when you see the numbers on the scoreboard and get to give a victory speech and point toward whatever exit poll findings look good for you, because the exit polls always look good for you when you win.
At the same time, a win of this magnitude would probably not be enough to change the intertia in the campaign, which probably leads toward a small Clinton win in Indiana, and a large Obama win in North Carolina, which collectively might be enough to end the primaries.
The standard error of the forecast at the CD level is roughly +/- 6 points. I actually have no idea how to translate this to a standard error at the *state* level.
I'd put it like this, however: something relatively fundamental would have to change in the way that voters are viewing the election for Obama to win Pennsylvania. Likewise, something fundamental would have to change for him to get blown out in Pennsylvania. But sometimes, those fundamentals *do* change, and we've got a big debate tonight that could be the catalyst.
Poblano,
Any geographic pattern to your residuals? More specifically, how well does your model match the Ohio districts?
Also, what does it predict for Florida and Michigan, and the caucus states?
Nice work as usual! :)
The model thinks that Obama underperformed by about 4 points in Ohio relative to what he "should" have. Could that be the Limbaugh Effect?
I notice that your model does not include any variable for religion, while in the past you've found a correlation between Obama's vote share and statewide percent of Baptists, particularly Southern Baptists. Why the change? Lack of data at the CD level to test it or not independantly significant. Seems like the "percent of residents who identify themselves as Americans" variable has a significant overlap with percent who are Baptists.
Yeah, there's no data available at the CD level on religion. Actually the Census doesn't publish data on religion, period. With that said, there is a fairly strong relationship between this "American" variable and evangelical populations, which is one reason I thought to pick it.
Regarding Ohio, could that have been the NAFTA-gate effect? Good timing there by the Clinton campaign.
A four-point underperformance in Ohio seems pretty darn close to the actual results, considering that not all elections were held on the same day. Could be partially the Limbaugh effect, although comparing exit polls in IL and OH suggests the Limbaugh effect was not much more than a point or two in OH. Could also be the NAFTA thing or various other bits of the kitchen sink.
Remember that Obama campaign spreadsheet? They had then expected to lose Pennsylvania by 5%, 83-75 (-8 delegates). So it sounds like their prediction was within the MoE of your prediction. On the other hand, they have managed to outperform some of those predictions in the past; this wouldn't be as good an outcome as they've hoped for, but I don't think it'd be a surprising one, and it certainly wouldn't be the blowout victory that the Clinton campaign surely wants and needs right now.
You're on to something with the "Americans." Sociologist-demographer Stanley Lieberson first called serious attention to the growth of what he called "unhyphenated Americans" in the U.S. census. These were individuals who were racially "white" but could not identify any particular national origin (German, Irish, etc.) in their background in response to the census questions of 1980. Generally their ancestors had immigrated two or more generations earlier. I am not able to access the original paper that he published on this in Ethnic and Racial Studies (1985), but I do recall one conversation with him when he was first investigating the phenomenon and calling this group "pure Americans." He did not stick with that term in his publications.
On the "Americans" thing...
The map shown here has a striking correlation with maps of Scots/Irish settlement in the U.S. In fact, when I first saw the map, before reading the description, I assumed it had something to do with Scots/Irish ethnicity. (Although to be fair, the S/I extend further up into NY and down into TX.)
I should add regarding the unhyphenated Americans identified by Lieberson that the unhyphenated white “American” identity was seen as emerging mainly among those of European origin and most frequently among less educated, rural, Southern whites. This is the same population known from other studies as the most racist element of American society (which isn't saying they are mostly racist, btw).
"the most racist element of American society"
I would shy away from these sort of analyses, as 'racism' is highly subjective, and exceedingly difficult to quantify.
For example, Barack Obama's white grandmother: racist, or no? There's simply too much gray area here to get bogged down in 1960s-era racial hand-wringing.
I guess the reason I want to know what the model would say on the standard deviations is not whether Obama can win, which I doubt, but whether Obama could possibly come out even in the delegate count. If Obama could tie the delegate count (not counting PLEOs in that) that would be a huge victory.
A quick google resulted in this map of Irish ancestry:
http://www.mnplan.state.mn.us/maps/ancestry/us/irish.gif
Correlates strongly with the unhyphenated "American" regions.
Great site. What is your occupation that affords you the time to do these insanely detailed projections?
citizen grim, the map you found looks like it may have combined Irish and Scots-Irish.
Here's some links to ancestry maps at the census website (I hope they work). You can zoom in and change customize to your heart's content. Scotch-Irish and American overlap in many places as you noted, but not so well Irish and American. Take care to look at the legend on the left though as the colors mean different percents in each map.
Scotch-Irish: http://tinyurl.com/3gav7t
Irish: http://tinyurl.com/5zkykx
American: http://tinyurl.com/58as49
Citizen Grim: It's hard to quantify racism because most racists are sensitive enough about that to avoid expressing their racism overtly in surveys. (There's a large literature on "symbolic racism" that addresses the measurement aspect.) But this doesn't mean there is no racism or that we cannot find consequences of it such as in voting patterns (in elections and initiatives) or residential and school segregation.
That said, I also did not want to convey the idea that racism is only or primarily a "southern" issue. Gunnar Myrdal made the point 60 years ago that it was as much or more a "northern" issue because northerners often repressed the very idea that they might be racist. And as someone who lives in a northern, midwestern state that has the most segregated schools in the country, I'm not about to point fingers at people from other states. But I do think there are some correlates of racism in individual characteristics, and that's what I was pointing to in my earlier post.
BTW/ in Lieberson's analysis it was people of Scotch/Irish /English ancestry -- 3 generations removed -- who were most likely to become unhyphenated whites. And it was rural, less educated who were perhaps less likely to preserve a memory or link to their ethnic ancestry. In any case, however, we're not talking about a large percentage who become unhyphenated whites. But its growth was something that Lieberson noticed and wrote about.
My impression is also that American is a good proxy for Scotch-Irish, though most such people have been in the US long enough that they have mixed ancestry that is plurality Scotch-Irish. The thing is that it is hard to get at who is Scotch-Irish since (a) many such people do not know their ancestry, (b) most also have some other English or German ancestry, (c) many might say that they are "mostly English" or "mostly Scotish" or even say that folks were from Ireland but none of this is a good indication of whether their ancestors were Scotch-Irish. Basically, most "Scotch-Irish" folks were lowland Scots or English who lived just south of the border who moved to Nothern Ireland before leaving for the New World soon thereafter (their cousins who stayed in Ireland are Ulster Protestants). A lot of folks also moved to America directly from the Scotland-England border area - these folks were never Irish, but are culturally the same as the Scotch-Irish and moved to the same communities in the New World.
The map posted basically confirms the above theory. Since Scotch-Irish folks are often a swing constituency, this is not a good thing for Obama. On the other hand, it is very interesting that Obama nevertheless does so well among whites in the interior west since a whole lot of these folks are also of Scotch-Irish ancestry. Though in these areas, the regional culture has a lot more influences that are less present in the upper south and lower midwest . . .
Though it could be the case that Scotch-Irish areas tend to be more racist or at least they voting behavior in these areas is more racists (or that Democrats there are more likely to be racist), I think that it is not a good idea to jump to conclusions about racial animosity being the main causal factor here. Scotch-Irish areas are culturally different from other parts of the US (especially Obama friendly Yankee areas) in lots of ways that influence voting behavior and political views. I can think of a number of reaons why Obama might not be a great cultural fit in these areas, though it is hard to say which of them are causally significant.
Finally, I think that is important to point out that ancestry is probably not an important causal factor - heavy Scotch-Irish ancestry in a population is probably just a proxy for the sort of places that have Scotch-Irish influenced political cultures that make all the locals, regardless of their own particular ancestry, less likely to be Obama voters.
The presumptive Democratic nominee is going to lose a primary in a major state to one of the most disliked people in politics by 120,000 votes. Ouch.
Can't win 'em all. Let's see whether the 120K figure holds up. I'll take the under.
This is really strange for me, as I spent the day doing almost exactly the same thing, projecting delegates at a CD-level using regression!
And I thought I was so clever and original. I used the same idea, and a bunch of the same variables, and unsurprisingly got a very similar result. My total was identical, but I had the Pittsburgh CD 4-3 Obama, and the 2nd Philly one only 5-4 Obama. I'd trust your's over mine, though, as my model, had major trouble projecting blowouts properly
Great analysis and very interesting.
I know that you are not using poverty as a variable, but it may be worth noting the CD with the lowest poverty levels is 8, Bucks Co. (if I remember correctly). I suspected it might go 4-3 for Obama on that basis. Similarly for Montgomery Co.
I live in CD 2 and a 7-2 split there would not surprise me in the least. If Hillary has a presence in this district it is very muted.
I also think you're exactly right about CDs 7 and 14. It will be very exciting to see what happens there.
So if Obama has a very good day, he would have 5 more than you predict. I'm keeping my fingers crossed.
USA Today has results by CD, e.g., for Texas.
Great analysis as always, I love your site.
Just 2 minor glitches in the text. CD2 should be 6-3 not 5-2. CD5 only got 4 delegates.
CD 6 is "Berks" not "Burks". Also, the Amish do not vote except in very small numbers.
I'm a bit surprised at the high margins you're predicting for Clinton in certain areas of eastern PA. In Lancaster, Obama seems to have a lot of support. In Allentown, although it could swing that heavily to Clinton, esp. now that Mayor Pawlowski has endorsed Clinton, I could also see Obama eking out a win (depending on which voters turn out). I don't see why Bucks should go that heavily to Clinton, and in fact I would have predicted an Obama win there. I'd be even more inclined to predict Obama wins Montgomery.
What does this model say about the statewide results in NC and IN? The IN polls seem a bit eratic. My sense is that the state should lean Clinton, but I'd be curious what the numbers say.