Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In the US we think of Minneapolis and St. Paul as the Twin Cities, but Ben Olin, author of Math with Bad Drawings, https://mathwithbaddrawings.com/, posed this data science question: Which U.S. cities are the true twin cities? He imposed three conditions:
1. the cities must be at most 10 miles apart,
2. each city must have at least 200,000 people, and
3. the populations have to be within a ratio of 2:1.
This seemed like a nice data analysis problem, so I searched for a dataset containing both population and location. Interestingly, each of population and location has its nuances, and I learned a lot more from this problem than I expected.

I found https://simplemaps.com/data/us-cities has a dataset of 30,844 cities containing latitude, longitude, and population. However, when Minneapolis’ population came out as 2.9 million, Ben recognized that the population was shown for the broader metropolitan area, not the city proper. I got a second dataset of populations of city propers from https://www.census.gov/data/tables/time-series/demo/popest/2020s-total-cities-and-towns.html, I joined the two datasets, and I used the populations from the second database.

How do you measure the distance between two cities? This is not as simple a question as it sounds.

As a start, I used https://www.distancecalculator.net/ and entered two cities I am familiar with, New York, NY and Hoboken, NJ. That website calculated a distance of 2.39 miles, and it provided a map. The site further clarified that it used the great circle distance formula. So this raises two questions: How does it decide which two points to measure from, and what is a great circle distance?

Hoboken has an area of 1.97 square miles, so it probably doesn’t matter too much which point in Hoboken to use. New York City has an area of 472.43 square miles, so it does matter which point to use. It is not obvious which point it used, and it certainly did not use the closest point, but from other work, I believe it used City Hall or something close.

Although some sites will measure distance between two cities as driving distance, it is more common to use great circle distance, which is the shortest distance along the surface of a sphere. The earth is not exactly a sphere, but it is approximately a sphere.

Latitude and longitude is a coordinate system to describe any point on the earth’s surface. Lines of latitude are horizontal lines parallel to the Equator, and represent how far north or south a point is from the Equator. Lines of longitude represent how far a point is east or west from a vertical line called the Prime Meridian that runs through Greenwich, England. Both latitude and longitude are measured in degrees, which are broken down into smaller units called minutes and seconds. For convenience they are also expressed in decimal degrees. If D is degrees, M is minutes, and S is seconds, then the conversion to decimal degree uses D + M/60 + S/3600.

When we use trig functions to calculate distances, we need to convert decimal degrees to radians by multiplying by π / 180. We also need to know the radius of the earth which is 3963.0 miles.

If point A is (lat1, long1) in decimals and point B is (lat2, long2) in decimals, then the distance formula d is the great-circle distance on a perfect sphere using the haversine distance formula, which is derived from principles of three-dimensional spherical trigonometry including the spherical law of cosines. A haversine of an angle θ is defined as hav(θ) = sin2(θ/2), and this concept is used in the derivation.

d = ACOS(SIN(PI()*[Lat_start]/180.0)*SIN(PI()*[Lat_end]/180.0)+COS(PI()*[Lat_start]/180.0) *COS(PI()*[Lat_end]/180.0)*COS(PI()*[Long_start]/180.0-PI()*[Long_end]/180.0))*3963

As I mentioned, I used https://simplemaps.com/data/us-cities which contains cities with their latitude and longitude, and I applied the above distance formulas to pairs of cities. But I was still curious about the choice of a latitude and longitude for a particular city. That file lists New York City as (40.6943, -73.9249). Another website that finds a street address from a decimal latitude and longitude, https://www.mapdevelopers.com/reverse_geocode_tool.php , lists the address of (40.6943, -73.9249) as 871 Bushwick Avenue, Brooklyn, which is some distance from City Hall, but does not appear to be the centroid of New York City either. Wikipedia’s choice of latitude and longitude for New York City is 42 Park Row which is close to City Hall.

The following map from https://www.mapdevelopers.com/reverse_geocode_tool.php?lat=40.694300&lng=-73.924900&zoom=12 shows the approximate location of 871 Bushwick Avenue, Brooklyn.

I joined the dataset of locations with the populations of city propers from the second dataset, and I applied Ben’s three conditions. This produced 8 pairs of cities, and of course this list uses the first dataset’s choice of a city’s latitude and longitude and the distances resulting from that. Different choices of latitude and longitude produce a different list.

Of these pairs, I actually like Hialeah and Miami as the true twin cities. Besides meeting the original three criteria, they both share the same large ethnic population and they share a public transportation system.

Wikipedia has a much larger list of twin cities https://en.wikipedia.org/wiki/Twin_cities, but they did not necessarily use Ben Olin’s three criteria. Also, Ben’s problem is for US cities only, and Wikipedia has several pairs of Canada-US and Mexico-US cities that I had not thought about.

Here is the R code I used:

df1 <- subset(df1, select = c(city, lat, lng, state_name))
n1 <- nrow(df1) # 30844

df2 <- read_excel("https://raw.githubusercontent.com/fcas80/jt_files/main/censuspop.xlsx", mode = "wb", skip = 3)
df2 <- df2[ -c(1,4:6) ]
colnames(df2) <- c("city", "pop")
# city appears as format Los Angeles city, California
df2\$state <- gsub(".*\\, ", "", df2\$city) # extract state: everything after comma blank
df2\$city <- gsub("\\,.*", "", df2\$city) # extract everything before comma
df2\$city <- gsub(" city*", "", df2\$city) # delete: blank city

df = merge(x=df1, y=df2, by=”city”,all=TRUE)
df <- na.omit(df)
df <- df[df\$pop >= 200000, ]
df <- df[df\$state_name == df\$state, ] # delete improper merge same city in 2 states
df <- subset(df, select = -c(state_name, state))
n <- nrow(df) # 112

kount <- 1
df11 <- data.frame()
for (i in 1:n){
Lat_start <- df[i,2]
Long_start <- df[i,3]
for (j in 1:n){
Lat_end <- df[j,2]
Long_end <- df[j,3]
dist_miles <- acos(sin(pi*(Lat_start)/180.0)*sin(pi*(Lat_end)/180.0)+
cos(pi*(Lat_start)/180.0)*cos(pi*(Lat_end)/180.0)*cos(pi*
(Long_start)/180.0-pi*(Long_end)/180.0))*3963
cos(pi*(Lat_start)/180.0)*cos(pi*(Lat_end)/180.0)*cos(pi*(Long_start)/180.0-pi*
(Long_end)/180.0))*3963
dist_miles <- round(dist_miles, 0)
pop_ratio <- round(max(df[i,4]/df[j,4], df[j,4]/df[i,4]),1)
if (df[i,1] != df[j,1] & dist_miles > 0 & dist_miles <= 10 & pop_ratio <= 2){
df11[kount,1] <- df[i,1]
df11[kount,2] <- df[j,1]
df11[kount,3] <- dist_miles
df11[kount,4] <- df[i,4]
df11[kount,5] <- df[j,4]
df11[kount,6] <- pop_ratio
df11[kount,7] <- df[i,4] + df[j,4]
kount <- kount + 1
}
}
}
colnames(df11) <- c("City1", "City2", "Dist", "Pop1", "Pop2", "Ratio", "TotPop")
df11 <- df11[!duplicated(df11\$TotPop), ] # remove duplicates
df11 <- df11[ -c(7) ]
df11 <- data.frame(df11, row.names = NULL) # renumber rows consecutively
df11