Merge by City and State in R

February 20, 2014
By

(This article was first published on Mollie's Research Blog, and kindly contributed to R-bloggers)

Often, you'll need to merge two data frames based on multiple variables. For this example, we'll use the common case of needing to merge by city and state.

First, you need to read in both your data sets:

# import city coordinate data:
coords <- read.csv("cities-coords.csv",
header = TRUE,
sep = ",")

# import population data:
data <- read.csv("cities-data.csv",
header = TRUE,
sep = ",")

Next comes the merge. You can use by.x and by.y to declare which variables the merge will be based on. If the variables have exactly the same name in both data sets, you can use by instead of by.x and by.y.

x and y represent the two data sets you are merging, in that order.

You also want to state whether you want to include all data from either data set, using all or all.x and all.y. In this case, we want to make sure we hold onto all our city data, even data for the cities we do not have coordinates for.

# merge data & coords by city & state:
dataCoords <- merge(coords, data,
by.x = c("City", "State"),
by.y = c("city", "state"),
all.x = FALSE,
all.y = TRUE)

Running that code shows what we would expect. Houston is included in the final data set even though there are no coordinates for it, while Dallas is not included since it has coordinates but no data:

            City State Latitude  Longitude year population
1 Chicago IL 41.85003 -87.65005 2012 2714856
2 Columbus GA 32.46098 -84.98771 2012 198413
3 Columbus OH 39.96118 -82.99879 2012 809798
4 Columbus OH 39.96118 -82.99879 2010 787033
5 Los Angeles CA 34.05223 -118.24368 2012 3857799
6 New York NY 40.71427 -74.00597 2012 8336697
7 New York NY 40.71427 -74.00597 2010 8175133
8 San Francisco CA 37.77823 -122.44250 2012 825863
9 San Francisco CA 37.77823 -122.44250 2010 805235
10 Houston TX NA NA 2012 2160821


Bonus

If you'd like to get a list of which cases got merged in but lack coordinate data, there's a simple line of code to do that:

> dataCoords[!complete.cases(dataCoords[,c(3,4)]),]
City State Latitude Longitude year population
10 Houston TX NA NA 2012 2160821

Also, you might want to tidy up the names of your variables, if they followed different conventions in their respective initial data sets:


> names(dataCoords) <- c("City", "State", "Latitude", "Longitude", "Year", "Population")
> dataCoords
City State Latitude Longitude Year Population
1 Chicago IL 41.85003 -87.65005 2012 2714856
2 Columbus GA 32.46098 -84.98771 2012 198413
3 Columbus OH 39.96118 -82.99879 2012 809798
4 Columbus OH 39.96118 -82.99879 2010 787033
5 Los Angeles CA 34.05223 -118.24368 2012 3857799
6 New York NY 40.71427 -74.00597 2012 8336697
7 New York NY 40.71427 -74.00597 2010 8175133
8 San Francisco CA 37.77823 -122.44250 2012 825863
9 San Francisco CA 37.77823 -122.44250 2010 805235
10 Houston TX NA NA 2012 2160821


The full sample code is available as a gist.

References


gist

Change R code tabs to 2 spaces.

bold package names
italicize functions

http://hilite.me/
manni

for R posts, use: sessionInfo()

To leave a comment for the author, please follow the link and comment on his blog: Mollie's Research Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.