Using the wakefield package to easily generate reproducible sample data

November 5, 2015
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Andrie de Vries

Back in 2011, I asked a question on StackOverflow: "How to make a great R reproducible example?".

This question attracted some great answers, including answers by Hadley Wickham and Joris Meys (co-author of R for Dummies).

In June of this year Tyler Rinker added a new answer. Tyler published the wakefield package.  In his own words:

I am developing the wakefield package to address this need to quickly share reproducible data, sometimes dput() works fine for smaller data sets but many of the problems we deal with are much larger, sharing such a large data set via dput() is impractical.

I think it is a brilliant idea to create a package that allows you to easily create data with a specified structure.

The package has some very clever ideas.  It contains functions that "knows" about certain data types, e.g. age() generates age ranges and coin() generates a bernoulli sample, to name just a few. You can also specify correlation between variables – a helpful feature if you want to demonstrate a specific statistical model.

The package is not yet on CRAN, but is extensively documented at github.

wakefield is designed to quickly generate random data sets. The user passes n (number of rows) and predefined vectors to the r_data_frame function to produce a dplyr::tbl_df object.

Example

Here is an example from the documentation (modified only very slightly):

 

This produces the following plot. Notice the correlation in the data – people with high initial grades tend to maintain high grades over time, and vice versa.

Wakefield

Installation and more examples

To install the package, uncomment the first two lines of code and try the examples:

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)