Merging data: A tutorial

Posted on May 7, 2006 by dataninja in R bloggers | 0 Comments

[This article was first published on Dataninja » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The situation: you have two datasets with a common variable, and you want to incorporate both into one large dataset containing all of the variables. This is called merging data, and it’s easy to do in any standard statistical package. In these examples, I assume that there is only one variable between any datasets to be merged that has the same name in each file.

Stata

Before you can merge data in Stata, you must do two things:

Read each dataset into Stata and sort it by the merging variable (ex: insheet using file.csv , comma; sort id)
Save the dataset as a dta file (ex: save file.dta)

With the data properly formatted, you can merge two or more datasets by the same variable using the merge command:

use file1.dta
merge using id file2.dta

where id is the common variable between datasets file1.dta and file2.dta. If desired, you can merge arbitrarily many datasets using merge as long as they all contain and are sorted by one common variable.

Tip: After performing a merge in Stata, a new variable named _merge is created. This variable can be tabulated to analyze the results of the merge. For more information, type help merge into the Stata command line.

In R, the process is is similar. To begin, read each datafile in as a dataframe (ex: file1<-read.csv(file=”file1.csv”,header=TRUE)). Then type:

new.dataframe<-merge(file1, file2, by=”id”)

One difference between R’s merging function and Stata’s is that, in R, you can only merge two dataframes at a time. So, to merge a third dataset, working from the example above, you would use:

new.dataframe<-merge(new.dataframe, file3, by=”id”)

SAS

The process is essentially the same in SAS: first create two SAS datasets (named file1 and file2 for this example) and sort them by id (ex: proc sort data=file1; by id; run;). Then, to create a third, merged dataset:

data newdataset;
merge file1 file2;
by id;
run;

Like Stata, SAS lets you merge arbitrarily many files at once, as long as they are property sorted and imported into SAS.

For more

The examples above show how to merge two (or more) datasets in such a way that each value of the variable id, no matter whether that value is missing on any of the input datasets, will be included in the final combined file. In some cases, this may not be optimal. You may want to only use values of id that are present in one particular dataset (for example, trying to match up national population data with another dataset specific to the northeast, you would want to exclude any states not in the northeast). Stata, R and SAS all have the capability to do this kind of selective merging, but describing all of the options for each package would take up too much space for this short tutorial.

To lean about advanced features, try the following: