Finding out repeated variables in multiple datasets

January 28, 2014
By

(This article was first published on Daniel MarcelinoDaniel Marcelino » R, and kindly contributed to R-bloggers)

Few days ago I posted on doing a smart job on importing several data files alike from a directory. Today, I want to return to this topic, but stretching it a bit further by adding some complexity. I want to have a snapshot of the datasets even before starting work with them. That is, I want to know beforehand which variables appear across multiple files. This post might be of particular interest for those using survey waves data, since surveys tend to repeat some questions (variables), but change others across time—or place as the interest of the research also changes.

In the R package "SciencesPo" there is a function named "detail" which describes the whole dataset in a nice way: variables as rows and descriptive statistics as columns. I do like this style because it doesn't really matter how many variables one has. The output of "detail" may become long, but not too wide to fit in the screen. My intention then, is to obtain a similar feature, however, having the variable names as rows and file names as columns. Therefore, with the outcome table will be possible to quickly identify which variables appear in multiple files.

Finally, I'll show how to get similar results using both R and Stata. The code is divided in two parts. In the first part I provide data for replication (the seniors data that ships with Stata). In the second, I run the example properly; therefore, if you already have some data, only the second part of the code may be important for you.

Doing it in R:

R Output

]1 R Output

Here the Stata goes:

Stata Output

]2 Stata Output

To leave a comment for the author, please follow the link and comment on his blog: Daniel MarcelinoDaniel Marcelino » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.