Using R to Solve a Geography Puzzle
Posted on September 25, 2013 by Todd Schneider in R bloggers | 0 Comments
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The puzzle: find two points inside the United States such that
- Both points are in the same state
- The straight line segment (shortest great circle) connecting them crosses the largest number of distinct states
This came up during a recent road trip through Pennsylvania, Maryland, West Virginia, and Virginia, where I noticed that it’s possible to draw a straight line that has both endpoints in West Virginia, and along the way crosses each of PA, MD, and VA, for a total of 4 states:
Are there any states where you can do better than 4 total? To help find the answer, I wrote an R script that analyzes state-level spatial data to determine where else we could draw line segments that start and end in the same state, and might cross more than 4 states. It turns out that there are two such states, although one of them gets an asterisk…
How does the algorithm find the best line segments?
To start, I make the claim that we only have to check line segments that start and end on a state border. If we have a segment that starts or ends in the interior of a state, we can always extend the segment in each direction until it reaches a border, and we can be guaranteed that this extended segment touches at least as many states as the unextended segment, because the points along the unextended segment are a subset of the points along the extended segment.
With that insight, we can perform a simple brute force search for each state that checks every segment connecting two boundary points of that state. We enumerate every possible pair of boundary points defined by the spatial data, calculate the great circle that connects those two points, and count up the number of states that include at least one point along the great circle. Here’s an animation of the calculation for West Virginia (this animation doesn’t enumerate every pair of border points, but you can see the process):
It turns out that it’s computationally impractical to do a complete brute force search, so we simplify the problem a bit by removing boundary points that are sufficiently close together. New York, for example, is defined by just over 40,000 boundary points. There would be more than (40,000 * 39,999) / 2 = 799.98 million pairs of points to check if we wanted to check all of them. Applying a little common sense suggests that this would be overkill: most of the boundary points are along the jagged coast of Long Island, where every little nook and cranny requires its own definition. If we simplify the boundaries of New York by checking only boundary points that are at least, say, 10 kilometers from the previous point we checked, then we cut down the total number of points to 226, leaving us a far more manageable 35,000 pairs to check – a speed improvement of more than 20,000 times!
One downside of this simplification is that we increase the chance that we’ll miss the globally optimal line segment, but after experimenting with various boundary-simplification schemes, 10 kilometers seemed like a reasonable approach. It’s worth noting that even without simplifying the boundaries, we still might miss the global optimum, because the globally optimal segment might have an endpoint that falls on the interpolated segment connecting two of the defined boundary points (imagine a state with relatively simple borders like Colorado: it requires far fewer boundary points than New York!).
The Results
With the R script in hand, we can let it crunch through the various states and see what it finds.
There are many states that contain line segments crossing 4 states, including some surprises (at least to me!).
The Kentucky Bend plays a role for all of Kentucky, Missouri, and Tennessee.
Kentucky:
Missouri:
Tennessee:
Arkansas sneaks in 4, touching Missouri, Tennessee, and Mississippi:
Nebraska’s border along the Missouri River touches South Dakota, Iowa, and Missouri:
Illinois benefits from having its borders officially include large portions of the Great Lakes:
West Virginia, the original inspiration for this exploration:
Virginia (including Washington, DC):
And the winner is…
Of course, even before I wrote the program, both my fellow roadtripper and I guessed that New York would be the answer, traveling from the eastern end of Long Island to the northeastern corner bordering Vermont and Quebec, but we weren’t sure exactly which states we’d cross in between. Connecticut, Massachusetts, and Vermont seemed like a given, but would the line segment catch any of Rhode Island? What about New Hampshire?
It turns out the best you can do is 5 states, including New Hampshire but excluding Rhode Island:
And so New York is the top dog with a total of five states crossed: NY, CT, MA, NH, and VT, but there’s another 5-spot out there: Maryland also gets 5, but with an asterisk because one of them is Washington, D.C., and well there is that whole “taxation without representation” thing.
In Conclusion
If we had to summarize the results into something a bit more general, we could say:
- Look for states with borders that are defined by bodies of water. The Mississippi River in particular accounts for many of the surprising cases in the midwestern states, largely owing to the Kentucky Bend. The Missouri and Potomac Rivers, Long Island Sound, and Chesapeake Bay all have some contribution as well. This is pretty intuitive in retrospect, since the irregular, meandering borders defined by bodies of water have more opportunities for crossings compared to simpler borders defined along parallels and meridians
- The smaller eastern states fare better than their larger western counterparts. That might be partially due to bodies of water, but it also helps that smaller states are closer together, so even though New York is among the largest of the eastern states, it is surrounded by many smaller states
Where’s the code?
You can run all the of the code for yourself; the software and data are both freely available. Some suggested ideas that might be fun to explore: what about the same problem applied to countries of the world, instead of US states? How could we make an algorithm faster/more intelligent than a brute force search? Please feel free to comment with any questions and (especially) suggested improvements!
If you just want the highlights, the innermost loop of the code is only 3 lines. Given two points p1
and p2
, calculate the number of states crossed by the connecting great circle:
line = gcIntermediate(p1, p2, n = gcPoints, addStartEnd = FALSE) points = SpatialPoints(data.frame(x = line[,1], y = line[,2]), proj4string=CRS(" +proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0")) states = unique(over(points, gadm)$NAME_1)
Related
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.