Writing a Minimal Working Example (MWE) in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
How to Ask for Help using R
The key to getting good help with an R problem is to provide a minimally working reproducible example (MWRE). Making an MWRE is really easy with R, and it will help ensure that those helping you can identify the source of the error, and ideally submit to you back the corrected code to fix the error instead of sending you hunting for code that works. To have an MWRE you need the following items:
- a minimal dataset that produces the error
- the minimal runnable code necessary to produce the data, run on the dataset provided
- the necessary information on the used packages, R version, and system
- a
seed
value, if random properties are part of the code
Let's look at the tools available in R to help us create each of these components quickly and easily.
Producing a Minimal Dataset
There are three distinct options here:
- Use a built in R dataset
- Create a new vector / data.frame from scratch
- Output the data you are currently working on in a shareable way
Let's look at each of these in turn and see the tools R has to help us do this.
Built in Datasets
There are a few canonical buit in R datasets that are really attractive for use in help requests.
- mtcars
- diamonds (from ggplot2)
- iris
To see all the available datasets in R, simply type: data()
. To load any of
these datasets, simply use the following:
data(mtcars) head(mtcars) # to look at the data mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
This option works great for a problem where you know you are having trouble with a command in R. It is not a great option if you are having trouble understanding why a command you are familiar with won't work on your data.
Note that for education data that is fairly “realistic”, there are built in
simulated datasets in the eeptools
package, created by Jared Knowles.
library(eeptools) data(stulevel) names(stulevel) [1] "X" "school" "stuid" "grade" "schid" [6] "dist" "white" "black" "hisp" "indian" [11] "asian" "econ" "female" "ell" "disab" [16] "sch_fay" "dist_fay" "luck" "ability" "measerr" [21] "teachq" "year" "attday" "schoolscore" "district" [26] "schoolhigh" "schoolavg" "schoollow" "readSS" "mathSS" [31] "proflvl" "race"
Create Your Own Data
Inputing data into R and sharing it back out with others is really easy. Part of the power of R is the ability to create diverse data structures very easily. Let's create a simulated data frame of student test scores and demographics.
Data <- data.frame(id = seq(1, 1000), gender = sample(c("male", "female"), 1000, replace = TRUE), mathSS = rnorm(1000, mean = 400, sd = 60), readSS = rnorm(1000, mean = 370, sd = 58.3), race = sample(c("H", "B", "W", "I", "A"), 1000, replace = TRUE)) head(Data) id gender mathSS readSS race 1 1 female 396.6 349.2 H 2 2 male 369.5 330.7 W 3 3 female 423.3 354.3 B 4 4 male 348.7 333.1 W 5 5 male 299.7 353.4 H 6 6 female 338.0 422.1 I
And, just like that, we have simulated student data. This is a great way to evaluate problems with plotting data or with large datasets, since we can ask R to generate a random dataset that is incredibly large if necessary. However, let's look at the relationship among our variables using a quick plot:
library(ggplot2) qplot(mathSS, readSS, data = Data, color = race) + theme_bw()
It looks like race is pretty evenly distributed and there is no relationship
among mathSS
and readSS
. For some applications this data is sufficient, but
for others we may wish for data that is more realistic.
table(Data$race) A B H I W 192 195 202 203 208 cor(Data$mathSS, Data$readSS) [1] -0.01236
Output Your Current Data
Sometimes you just want to show others the data you are using and see why
the problem won't work. The best practice here is to make a subset of the data
you are working on, and then output it using the dput
command.
dput(head(stulevel, 5)) structure(list(X = c(44L, 53L, 116L, 244L, 274L), school = c(1L, 1L, 1L, 1L, 1L), stuid = c(149995L, 13495L, 106495L, 45205L, 142705L), grade = c(3L, 3L, 3L, 3L, 3L), schid = c(495L, 495L, 495L, 205L, 205L), dist = c(105L, 45L, 45L, 15L, 75L), white = c(0L, 0L, 0L, 0L, 0L), black = c(1L, 1L, 1L, 1L, 1L), hisp = c(0L, 0L, 0L, 0L, 0L), indian = c(0L, 0L, 0L, 0L, 0L), asian = c(0L, 0L, 0L, 0L, 0L), econ = c(0L, 1L, 1L, 1L, 1L), female = c(0L, 0L, 0L, 0L, 0L), ell = c(0L, 0L, 0L, 0L, 0L), disab = c(0L, 0L, 0L, 0L, 0L), sch_fay = c(0L, 0L, 0L, 0L, 0L), dist_fay = c(0L, 0L, 0L, 0L, 0L), luck = c(0L, 1L, 0L, 1L, 0L), ability = c(87.8540493076978, 97.7875614875502, 104.493033823157, 111.671512686787, 81.9253913501755 ), measerr = c(11.1332639734731, 6.8223938284885, -7.85615858883968, -17.5741522573204, 52.9833376218976), teachq = c(39.0902471213577, 0.0984819168655733, 39.5388526976972, 24.1161227728637, 56.6806130368238 ), year = c(2000L, 2000L, 2000L, 2000L, 2000L), attday = c(180L, 180L, 160L, 168L, 156L), schoolscore = c(29.2242722609726, 55.9632592971956, 55.9632592971956, 55.9632592971956, 55.9632592971956), district = c(3L, 3L, 3L, 3L, 3L), schoolhigh = c(0L, 0L, 0L, 0L, 0L), schoolavg = c(1L, 1L, 1L, 1L, 1L), schoollow = c(0L, 0L, 0L, 0L, 0L), readSS = c(357.286464546893, 263.904581222636, 369.672179143784, 346.595665384202, 373.125445669888 ), mathSS = c(387.280282915207, 302.572371332695, 365.461432571883, 344.496386434725, 441.15810279391), proflvl = structure(c(2L, 3L, 2L, 2L, 2L), .Label = c("advanced", "basic", "below basic", "proficient"), class = "factor"), race = structure(c(2L, 2L, 2L, 2L, 2L), .Label = c("A", "B", "H", "I", "W"), class = "factor")), .Names = c("X", "school", "stuid", "grade", "schid", "dist", "white", "black", "hisp", "indian", "asian", "econ", "female", "ell", "disab", "sch_fay", "dist_fay", "luck", "ability", "measerr", "teachq", "year", "attday", "schoolscore", "district", "schoolhigh", "schoolavg", "schoollow", "readSS", "mathSS", "proflvl", "race"), row.names = c(NA, 5L), class = "data.frame")
The resulting code can be copied and pasted into an R terminal and it will
automatically build the dataset up exactly as described. Note, that in the above
example, it might have been better if I first cut out all the unnecessary
variables for my problem before I executed the dput
command. The goal is to
make the data only necessary to reproduce your code available.
Also, note, that we never send student level data from LDS over e-mail
as this is unsecure. For work on student level data, it is better to either
simulate the data or to use the built in simulated data from the eeptools
package to run your examples.
Anonymizing Your Data
It may also be the case that you want to dput
your data, but you want to keep
the contents of your data anonymous. A Google search came up with a decent
looking function to carry this out:
anonym <- function(df) { if (length(df) > 26) { LETTERS <- replicate(floor(length(df)/26), { LETTERS <- c(LETTERS, paste(LETTERS, LETTERS, sep = "")) }) } names(df) <- paste(LETTERS[1:length(df)]) level.id.df <- function(df) { level.id <- function(i) { if (class(df[, i]) == "factor" | class(df[, i]) == "character") { column <- paste(names(df)[i], as.numeric(as.factor(df[, i])), sep = ".") } else if (is.numeric(df[, i])) { column <- df[, i]/mean(df[, i], na.rm = T) } else { column <- df[, i] } return(column) } DF <- data.frame(sapply(seq_along(df), level.id)) names(DF) <- names(df) return(DF) } df <- level.id.df(df) return(df) } test <- anonym(stulevel) head(test[, c(2:6, 28:32)]) B C D 1 0.00217632592657076 1.51160611230132 0.551020408163265 2 0.00217632592657076 0.135998696526593 0.551020408163265 3 0.00217632592657076 1.07322572705443 0.551020408163265 4 0.00217632592657076 0.455562880806568 0.551020408163265 5 0.00217632592657076 1.43813960635994 0.551020408163265 6 0.00217632592657076 0.151115261535106 0.551020408163265 E F BB CC 1 1.3475499092559 2.01923076923077 0 0.720073808281278 2 1.3475499092559 0.865384615384615 0 0.531872308862454 3 1.3475499092559 0.865384615384615 0 0.745035931291952 4 0.558076225045372 0.288461538461538 0 0.698527611516136 5 0.558076225045372 1.44230769230769 0 0.751995631770993 6 1.3475499092559 2.01923076923077 0 0.880245964840198 DD EE FF 1 0.801153708902007 EE.2 FF.2 2 0.625921298341795 EE.3 FF.2 3 0.756017786295901 EE.2 FF.2 4 0.712648099763826 EE.2 FF.2 5 0.912608944625505 EE.2 FF.2 6 0.958626895492888 EE.4 FF.2
That looks pretty generic and anonymized to me!
Notes
- Most of these solutions do not include missing data (NAs) which are often the source of problems in R. That limits their usefulness.
- So, always check for NA values.
Creating the Example
Once we have our minimal dataset, we need to reproduce our error on that dataset. This part is critical. If the error goes away when you apply your code to the minimal dataset, then it will be very hard for others to diagnose the problem remotely, and it might be time to get some “at your desk” help.
Let's look at an example where we have an error aggregating data. Let's assume I am creating a new data frame for my example, and trying to aggregate that data by race.
Data <- data.frame(id = seq(1, 1000), gender = sample(c("male", "female"), 1000, replace = TRUE), mathSS = rnorm(1000, mean = 400, sd = 60), readSS = rnorm(1000, mean = 370, sd = 58.3), race = sample(c("H", "B", "W", "I", "A"), 1000, replace = TRUE)) myAgg <- Data[, list(meanM = mean(mathSS)), by = race] Error: unused argument(s) (by = race) head(myAgg) Error: object 'myAgg' not found
Why do I get an error? Well, if you sent the above code to someone, they could quickly evaluate it for errors, and look at the mistake if they knew you were attempting to use the data.table package.
library(data.table) Data <- data.frame(id = seq(1, 1000), gender = sample(c("male", "female"), 1000, replace = TRUE), mathSS = rnorm(1000, mean = 400, sd = 60), readSS = rnorm(1000, mean = 370, sd = 58.3), race = sample(c("H", "B", "W", "I", "A"), 1000, replace = TRUE)) Data <- data.table(Data) myAgg <- Data[, list(meanM = mean(mathSS)), by = race] head(myAgg) race meanM 1: H 398.6 2: B 405.1 3: A 397.8 4: W 395.7 5: I 399.1
Session Info
However, they might not know this, so we need to provide one final piece of
information. This is known was the sessionInfo
for our R session. To diagnose
the error it is necessary to know what system you are running on, what packages
are loaded in your workspace, and what version of R and a given package you are
using.
Thankfully, R makes this incredibly easy. Just tack on the output from the
sessionInfo()
function. This is easy enough to copy and paste or include in
a knitr
document.
sessionInfo() R version 2.15.2 (2012-10-26) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] data.table_1.8.8 eeptools_0.2 ggplot2_0.9.3.1 knitr_1.2 loaded via a namespace (and not attached): [1] colorspace_1.2-2 dichromat_2.0-0 digest_0.6.3 [4] evaluate_0.4.3 formatR_0.7 grid_2.15.2 [7] gtable_0.1.2 labeling_0.1 MASS_7.3-23 [10] munsell_0.4 plyr_1.8 proto_0.3-10 [13] RColorBrewer_1.0-5 reshape2_1.2.2 scales_0.2.3 [16] stringr_0.6.2 tools_2.15.2
Resources
For more information, visit:
- http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
- https://github.com/hadley/devtools/wiki/Reproducibility
- http://stackoverflow.com/questions/10454973/how-to-create-example-data-set-from-private-data-replacing-variable-names-and-l/10458688#10458688
Get the source code for this blogpost in a Gist here: https://gist.github.com/jknowles/5659390
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.