For the most of you who haven't followed my baseball work, I am best known for inventing a forecasting system called PECOTA, which generates predictions by comparing baseball players with a large database of historical peers and identifying the most similar ones. This same technology -- which is really just a variant of nearest neighbor analysis -- can be applied to virtually anything, including identifying the similarity of any two states along a number of dimensions of political salience. In fact, that's exactly what I've done in the chart below, with each state listed along with its three most similar states.
What factors go into the similarity score? There are quite a few, which are weighted in rough proportion to their importance in determining the Kerry-Bush result in 2004 and the McCain-Obama polling this year according to an analysis of variance.
Specifically, those variables are: (1) Partisan ID index; (2) Likert liberal-conservative score; (3) Average years of completed schooling per adult; (4) Per Capita Income; (5) 18-29 year old population; (6) senior population; (7) African-American population; (8) Hispanic population; (9) percentage of white evangelicals; (10) Catholic population; (11) Mormon/LDS population; (12) percentage of military veterans; (13) percentage of same-sex partner households; (14) gun ownership rate; (15) percentage of adults identifying ancestry as 'American'; (16) percentage suburban; (17) percentage of state jobs in manufacturing sector; (18) current unemployment rate, and (19) latitude and longitude (e.g. geographic distance).
The highest score theoretically achievable is 100, for two states that are exactly identical along each of these 19 dimensions. The highest score in practice is 71 between North and South Carolina. A score of 0 represents states that are as dissimilar as similar, and negative scores are both possible and quite common (though I list them as zeroes in the table above).
Note that some states really aren't like any other states at all, including big ones like Florida and Texas and small ones like Alaska, Utah, and New Mexico. Then there are other states that are sort of within the main sequence but need to pull from different regions -- like Indiana, whose three most similar states span the Midwest (Ohio), South (North Carolina) and the Prairies (Kansas).
And yes, this does have implications for our model, which will become clear at some point soon.
Monday, July 7, 2008
State Similarity Scores
-- Nate Silver at 7:18 AM
Labels: indiana, meta, portfolio theory
82 comments
So states like Utah and (DC) are so unique that nothing is anything like them? and HI is almost like that, but if i wanted to know at least with 10% accuracy what HI is like i ought to go to CO?
hm... interesting. i wonder how we could visualize this in a compelling way.
As far as I can see, all of the states with similarity scores greater than 60 are projected to go the same way, with one exception: Missouri. It's closest neighbor is Ohio, but Ohio is projected for McCain, and MO for Obama. I wonder if the old saying "as Missouri goes, so goes the nation" will be wrong in 2008. Anyone know the last time MO and OH went for different candidates in a presidential election?
Oops. Above I meant OH for Obama, and MO for McCain.
Florida's set is peculiar. I would have had a hard time guessing one of those states as part of its group.
None of Florida's comparables are any good. Pennsylvania and Arizona show up because they have a lot of old people (Arizona also has a lot of Hispanics, though diverges from FL politically). I'm not really sure what Delaware's doing there.
Pretty cool, Nate did these have predictive value in past elections to a significant degree? It may prove insightful to predict outcomes based on state patterns. Look forward to your manipulation of it.
I love this, Nate. One question: is there any data on percentage of people born in the state v. transplants? I'm very interested in this particular set of people (being one of them, to the seventh degree).
I'm assuming that people will see a connection between education level and inter-state moves, but it's more than that. Living in Texas, you could even say that an intra-state move voters are more similar than those who have stayed within their region.
Just wondering if that's where you'd find the difference that are so obvious to those who live in North and South Carolina. (How could they possibly have edged out the Dakotas?)
Nate, do you think you can figure out what sort of similarity scores are significantly similar (or different)? I know that with the Jamesian creation, similarity scores of over 950 are very significant, 900 are "pretty" significant, and ones under 850 are not really that significant at all. Just like when you have Barry Bonds's best comparable to be Willie Mays at around 760, that means that Barry is in a league of his own...
What similarity score puts something in a league of it's own? Obviously, Alaska and Hawaii have their own eccentricities, and the Carolinas and the Dakotas swing pretty together - but where's the (fuzzy) line in between them?
I know I'm asking more questions than posing answers, but I hope that they're at least good questions! =)
Nate said "None of Florida's comparables are any good. Pennsylvania and Arizona show up because they have a lot of old people (Arizona also has a lot of Hispanics, though diverges from FL politically). I'm not really sure what Delaware's doing there."
I can offer some reasons why Delaware may be a "nearest neighbor" to Florida.
The party IDs of both states are probably very similar. The Wilmington/Newark area has a relatively high percentage of Republican identification, though many of the Republicans in this metropolitan area are more in the "Rockefeller Republican" mold, and are much more liberal on social issues.
Given the fierce identification of many Delaware Republicans with the party (despite the disparity in beliefs about social issues), the Likert scores for Delaware are probably over-ranked towards Republican identification.
Also, there is a tremendous disparity in Delaware between the Wilmington/Newark area and "downstate." Wilmington/Newark is probably similar in many respects to south Florida (other than average age), while downstate is comparable to northern Forida and the panhandle.
All in all, there are some good reasons why Delaware might be identified as a nearest neighbor to Florida, but it is a conclusion which will not permit any further inferences to be drawn, since Delaware's score is very distant from Florida's score, especially for being a nearest neighbor.
As someone who was a reader at Baseball Prospectus long before FiveThirtyEight existed, let me state for the record that one person being responsible for the creation of either PECOTA or 538 would be truly impressive; that one person is responsible for both is nothing short of remarkable.
And yes, Nate, this new tool feels quite familiar. When can we expect a post evaluating McCain's potential VP picks by Popularity Above Replacement Governor?
Nate,
Could you put a 'national' score on this chart? I've heard several folks say that a Florida election is like a mini-national election. It might be nice to know which state looks most like the national average.
Once again, very cool. Another great way of quantifying conventional wisdom.
I have a few questions: (surprise!)
1) I'm not sure I understand why you have introduced the discontinuity by forbidding negative numbers. By setting all negative numbers to zero, are you effectively saying that DC is as similar to Maryland as it is to Utah?
2) In a nearest-neighbor analysis, where do the negative numbers come from? Any sort of measure of distance should give positive numbers, and it should not be too hard to translate distance into a scale of (1 = identical, 0 = infinitely far apart).
3) How do you determine that states are "as dissimilar as they are similar?" This makes sense in the context of a correlation-like metric, but such a metric is dependent on the sample mean. Does this mean that in your metric, the similarity between two states may change as a result of changes in the demographics of the other states?
Off-topic
I couldn't help but think of this site when I saw this map on Sullivan. Is Obama's real weakness among fat white people? Perhaps he should pack on the pounds. :-)
Hi Nate,
This is fascinating!
Seeing your chart, the gears in my mind start turning on a dozen fronts, and it sounds to me like your state similarity scores could have a bearing on a million other applications -
When it's ready, will you consider representing the data geographically, instead of in table form? I think a color-coded or shaded/graded US state map highlighting these relationships would be very enlightening and useful.
Secondly, what is your stance on sharing these kinds of datasets themselves? I start drooling when I think of a spreadsheet with (your) "Likert liberal-conservative scale" by state (something I've dug around for on the net and have come up dry). I know a lot of this data is publicly available or in the census, but it is damn hard to track down. I'm doing some GIS work related to state progressive politics and environmental policy, and this is the kind of thing that could be very helpful.
Have you thought about making certain sets of data available to the public through fivethirtyeight? If not, is there anything you'd be willing to share?
Either way, hope to see a state-similarity map sometime soon!
-Jeremy
Link
Here's another place where your analytical & baseball expertise can help answer a question.
Nate,
your data isn't symmetric! For example, in the row labeled "Idaho" you have "Wyoming 47"; in the row labeled "Wyoming" you have "Idaho 40".
But nice work! It's possible to derive a sort of "map" of the states from this, which I've made a first stab at and which is interesting in ways I'm having trouble articulating; I'll probably post it to my own blog soon.
According to the secretary of state's office, 55,560 more Democrats than Republicans are on the active voter rolls in Nevada, as of the end of June. The gap widened from 50,020 in May and represents 5 percent of the 1,031,984 active voter