**Dataninja » R**, and kindly contributed to R-bloggers)

The situation: you have two datasets with a common variable, and you want to incorporate both into one large dataset containing all of the variables. This is called merging data, and it’s easy to do in any standard statistical package. In these examples, I assume that there is only one variable between any datasets to be merged that has the same name in each file.

**Stata**

Before you can merge data in Stata, you must do two things:

- Read each dataset into Stata and sort it by the merging variable (ex: insheet using file.csv , comma; sort id)
- Save the dataset as a dta file (ex: save file.dta)

With the data properly formatted, you can merge two or more datasets by the same variable using the *merge* command:

use file1.dta

merge using id file2.dta

where *id* is the common variable between datasets *file1.dta *and *file2.dta*. If desired, you can merge arbitrarily many datasets using *merge* as long as they all contain and are sorted by one common variable.

**Tip:** After performing a merge in Stata, a new variable named *_merge* is created. This variable can be tabulated to analyze the results of the merge. For more information, type *help merge* into the Stata command line.

**R**

In R, the process is is similar. To begin, read each datafile in as a dataframe (ex: file1<-read.csv(file=”file1.csv”,header=TRUE)). Then type:

new.dataframe<-merge(file1, file2, by=”id”)

One difference between R’s merging function and Stata’s is that, in R, you can only merge two dataframes at a time. So, to merge a third dataset, working from the example above, you would use:

new.dataframe<-merge(new.dataframe, file3, by=”id”)

**SAS**

The process is essentially the same in SAS: first create two SAS datasets (named file1 and file2 for this example) and sort them by *id* (ex: proc sort data=file1; by id; run;). Then, to create a third, merged dataset:

data newdataset;

merge file1 file2;

by id;

run;

Like Stata, SAS lets you merge arbitrarily many files at once, as long as they are property sorted and imported into SAS.

**For more**

The examples above show how to merge two (or more) datasets in such a way that each value of the variable *id*, no matter whether that value is missing on any of the input datasets, will be included in the final combined file. In some cases, this may not be optimal. You may want to only use values of *id* that are present in one particular dataset (for example, trying to match up national population data with another dataset specific to the northeast, you would want to exclude any states not in the northeast). Stata, R and SAS all have the capability to do this kind of selective merging, but describing all of the options for each package would take up too much space for this short tutorial.

To lean about advanced features, try the following:

- For Stata, see this page, or type
*help merge*into the Stata command window - For R, see this page, or type
*help(merge)*into the R console - For SAS, see this page, or consult your SAS documentation, wherever it may be

**leave a comment**for the author, please follow the link and comment on his blog:

**Dataninja » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...