Writing a Minimal Working Example (MWE) in R

May 27, 2013
By

(This article was first published on Data, Evidence, and Policy - Jared Knowles, and kindly contributed to R-bloggers)

How to Ask for Help using R

How to Ask for Help using R

The key to getting good help with an R problem is to provide a minimally working reproducible example (MWRE). Making an MWRE is really easy with R, and it will help ensure that those helping you can identify the source of the error, and ideally submit to you back the corrected code to fix the error instead of sending you hunting for code that works. To have an MWRE you need the following items:

  • a minimal dataset that produces the error
  • the minimal runnable code necessary to produce the data, run on the dataset provided
  • the necessary information on the used packages, R version, and system
  • a seed value, if random properties are part of the code

Let's look at the tools available in R to help us create each of these components quickly and easily.

Producing a Minimal Dataset

There are three distinct options here:

  1. Use a built in R dataset
  2. Create a new vector / data.frame from scratch
  3. Output the data you are currently working on in a shareable way

Let's look at each of these in turn and see the tools R has to help us do this.

Built in Datasets

There are a few canonical buit in R datasets that are really attractive for use in help requests.

  • mtcars
  • diamonds (from ggplot2)
  • iris

To see all the available datasets in R, simply type: data(). To load any of these datasets, simply use the following:

data(mtcars)
head(mtcars)  # to look at the data
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

This option works great for a problem where you know you are having trouble with a command in R. It is not a great option if you are having trouble understanding why a command you are familiar with won't work on your data.

Note that for education data that is fairly “realistic”, there are built in simulated datasets in the eeptools package, created by Jared Knowles.

library(eeptools)
data(stulevel)
names(stulevel)
 [1] "X"           "school"      "stuid"       "grade"       "schid"      
 [6] "dist"        "white"       "black"       "hisp"        "indian"     
[11] "asian"       "econ"        "female"      "ell"         "disab"      
[16] "sch_fay"     "dist_fay"    "luck"        "ability"     "measerr"    
[21] "teachq"      "year"        "attday"      "schoolscore" "district"   
[26] "schoolhigh"  "schoolavg"   "schoollow"   "readSS"      "mathSS"     
[31] "proflvl"     "race"       

Create Your Own Data

Inputing data into R and sharing it back out with others is really easy. Part of the power of R is the ability to create diverse data structures very easily. Let's create a simulated data frame of student test scores and demographics.

Data <- data.frame(id = seq(1, 1000), gender = sample(c("male", "female"), 1000, 
    replace = TRUE), mathSS = rnorm(1000, mean = 400, sd = 60), readSS = rnorm(1000, 
    mean = 370, sd = 58.3), race = sample(c("H", "B", "W", "I", "A"), 1000, 
    replace = TRUE))

head(Data)
  id gender mathSS readSS race
1  1 female  396.6  349.2    H
2  2   male  369.5  330.7    W
3  3 female  423.3  354.3    B
4  4   male  348.7  333.1    W
5  5   male  299.7  353.4    H
6  6 female  338.0  422.1    I

And, just like that, we have simulated student data. This is a great way to evaluate problems with plotting data or with large datasets, since we can ask R to generate a random dataset that is incredibly large if necessary. However, let's look at the relationship among our variables using a quick plot:

library(ggplot2)
qplot(mathSS, readSS, data = Data, color = race) + theme_bw()

plot of chunk evalsimmeddata

It looks like race is pretty evenly distributed and there is no relationship among mathSS and readSS. For some applications this data is sufficient, but for others we may wish for data that is more realistic.

table(Data$race)

  A   B   H   I   W 
192 195 202 203 208 
cor(Data$mathSS, Data$readSS)
[1] -0.01236

Output Your Current Data

Sometimes you just want to show others the data you are using and see why the problem won't work. The best practice here is to make a subset of the data you are working on, and then output it using the dput command.

dput(head(stulevel, 5))
structure(list(X = c(44L, 53L, 116L, 244L, 274L), school = c(1L, 
1L, 1L, 1L, 1L), stuid = c(149995L, 13495L, 106495L, 45205L, 
142705L), grade = c(3L, 3L, 3L, 3L, 3L), schid = c(495L, 495L, 
495L, 205L, 205L), dist = c(105L, 45L, 45L, 15L, 75L), white = c(0L, 
0L, 0L, 0L, 0L), black = c(1L, 1L, 1L, 1L, 1L), hisp = c(0L, 
0L, 0L, 0L, 0L), indian = c(0L, 0L, 0L, 0L, 0L), asian = c(0L, 
0L, 0L, 0L, 0L), econ = c(0L, 1L, 1L, 1L, 1L), female = c(0L, 
0L, 0L, 0L, 0L), ell = c(0L, 0L, 0L, 0L, 0L), disab = c(0L, 0L, 
0L, 0L, 0L), sch_fay = c(0L, 0L, 0L, 0L, 0L), dist_fay = c(0L, 
0L, 0L, 0L, 0L), luck = c(0L, 1L, 0L, 1L, 0L), ability = c(87.8540493076978, 
97.7875614875502, 104.493033823157, 111.671512686787, 81.9253913501755
), measerr = c(11.1332639734731, 6.8223938284885, -7.85615858883968, 
-17.5741522573204, 52.9833376218976), teachq = c(39.0902471213577, 
0.0984819168655733, 39.5388526976972, 24.1161227728637, 56.6806130368238
), year = c(2000L, 2000L, 2000L, 2000L, 2000L), attday = c(180L, 
180L, 160L, 168L, 156L), schoolscore = c(29.2242722609726, 55.9632592971956, 
55.9632592971956, 55.9632592971956, 55.9632592971956), district = c(3L, 
3L, 3L, 3L, 3L), schoolhigh = c(0L, 0L, 0L, 0L, 0L), schoolavg = c(1L, 
1L, 1L, 1L, 1L), schoollow = c(0L, 0L, 0L, 0L, 0L), readSS = c(357.286464546893, 
263.904581222636, 369.672179143784, 346.595665384202, 373.125445669888
), mathSS = c(387.280282915207, 302.572371332695, 365.461432571883, 
344.496386434725, 441.15810279391), proflvl = structure(c(2L, 
3L, 2L, 2L, 2L), .Label = c("advanced", "basic", "below basic", 
"proficient"), class = "factor"), race = structure(c(2L, 2L, 
2L, 2L, 2L), .Label = c("A", "B", "H", "I", "W"), class = "factor")), .Names = c("X", 
"school", "stuid", "grade", "schid", "dist", "white", "black", 
"hisp", "indian", "asian", "econ", "female", "ell", "disab", 
"sch_fay", "dist_fay", "luck", "ability", "measerr", "teachq", 
"year", "attday", "schoolscore", "district", "schoolhigh", "schoolavg", 
"schoollow", "readSS", "mathSS", "proflvl", "race"), row.names = c(NA, 
5L), class = "data.frame")

The resulting code can be copied and pasted into an R terminal and it will automatically build the dataset up exactly as described. Note, that in the above example, it might have been better if I first cut out all the unnecessary variables for my problem before I executed the dput command. The goal is to make the data only necessary to reproduce your code available.

Also, note, that we never send student level data from LDS over e-mail as this is unsecure. For work on student level data, it is better to either simulate the data or to use the built in simulated data from the eeptools package to run your examples.

Anonymizing Your Data

It may also be the case that you want to dput your data, but you want to keep the contents of your data anonymous. A Google search came up with a decent looking function to carry this out:

anonym <- function(df) {
    if (length(df) > 26) {
        LETTERS <- replicate(floor(length(df)/26), {
            LETTERS <- c(LETTERS, paste(LETTERS, LETTERS, sep = ""))
        })
    }
    names(df) <- paste(LETTERS[1:length(df)])

    level.id.df <- function(df) {
        level.id <- function(i) {
            if (class(df[, i]) == "factor" | class(df[, i]) == "character") {
                column <- paste(names(df)[i], as.numeric(as.factor(df[, i])), 
                  sep = ".")
            } else if (is.numeric(df[, i])) {
                column <- df[, i]/mean(df[, i], na.rm = T)
            } else {
                column <- df[, i]
            }
            return(column)
        }
        DF <- data.frame(sapply(seq_along(df), level.id))
        names(DF) <- names(df)
        return(DF)
    }
    df <- level.id.df(df)
    return(df)
}

test <- anonym(stulevel)
head(test[, c(2:6, 28:32)])
                    B                 C                 D
1 0.00217632592657076  1.51160611230132 0.551020408163265
2 0.00217632592657076 0.135998696526593 0.551020408163265
3 0.00217632592657076  1.07322572705443 0.551020408163265
4 0.00217632592657076 0.455562880806568 0.551020408163265
5 0.00217632592657076  1.43813960635994 0.551020408163265
6 0.00217632592657076 0.151115261535106 0.551020408163265
                  E                 F BB                CC
1   1.3475499092559  2.01923076923077  0 0.720073808281278
2   1.3475499092559 0.865384615384615  0 0.531872308862454
3   1.3475499092559 0.865384615384615  0 0.745035931291952
4 0.558076225045372 0.288461538461538  0 0.698527611516136
5 0.558076225045372  1.44230769230769  0 0.751995631770993
6   1.3475499092559  2.01923076923077  0 0.880245964840198
                 DD   EE   FF
1 0.801153708902007 EE.2 FF.2
2 0.625921298341795 EE.3 FF.2
3 0.756017786295901 EE.2 FF.2
4 0.712648099763826 EE.2 FF.2
5 0.912608944625505 EE.2 FF.2
6 0.958626895492888 EE.4 FF.2

That looks pretty generic and anonymized to me!

Notes

  • Most of these solutions do not include missing data (NAs) which are often the source of problems in R. That limits their usefulness.
  • So, always check for NA values.

Creating the Example

Once we have our minimal dataset, we need to reproduce our error on that dataset. This part is critical. If the error goes away when you apply your code to the minimal dataset, then it will be very hard for others to diagnose the problem remotely, and it might be time to get some “at your desk” help.

Let's look at an example where we have an error aggregating data. Let's assume I am creating a new data frame for my example, and trying to aggregate that data by race.

Data <- data.frame(id = seq(1, 1000), gender = sample(c("male", "female"), 1000, 
    replace = TRUE), mathSS = rnorm(1000, mean = 400, sd = 60), readSS = rnorm(1000, 
    mean = 370, sd = 58.3), race = sample(c("H", "B", "W", "I", "A"), 1000, 
    replace = TRUE))

myAgg <- Data[, list(meanM = mean(mathSS)), by = race]
Error: unused argument(s) (by = race)
head(myAgg)
Error: object 'myAgg' not found

Why do I get an error? Well, if you sent the above code to someone, they could quickly evaluate it for errors, and look at the mistake if they knew you were attempting to use the data.table package.

library(data.table)
Data <- data.frame(id = seq(1, 1000), gender = sample(c("male", "female"), 1000, 
    replace = TRUE), mathSS = rnorm(1000, mean = 400, sd = 60), readSS = rnorm(1000, 
    mean = 370, sd = 58.3), race = sample(c("H", "B", "W", "I", "A"), 1000, 
    replace = TRUE))

Data <- data.table(Data)
myAgg <- Data[, list(meanM = mean(mathSS)), by = race]
head(myAgg)
   race meanM
1:    H 398.6
2:    B 405.1
3:    A 397.8
4:    W 395.7
5:    I 399.1

Session Info

However, they might not know this, so we need to provide one final piece of information. This is known was the sessionInfo for our R session. To diagnose the error it is necessary to know what system you are running on, what packages are loaded in your workspace, and what version of R and a given package you are using.

Thankfully, R makes this incredibly easy. Just tack on the output from the sessionInfo() function. This is easy enough to copy and paste or include in a knitr document.

sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.8.8 eeptools_0.2     ggplot2_0.9.3.1  knitr_1.2       

loaded via a namespace (and not attached):
 [1] colorspace_1.2-2   dichromat_2.0-0    digest_0.6.3      
 [4] evaluate_0.4.3     formatR_0.7        grid_2.15.2       
 [7] gtable_0.1.2       labeling_0.1       MASS_7.3-23       
[10] munsell_0.4        plyr_1.8           proto_0.3-10      
[13] RColorBrewer_1.0-5 reshape2_1.2.2     scales_0.2.3      
[16] stringr_0.6.2      tools_2.15.2      

Resources

For more information, visit:

Get the source code for this blogpost in a Gist here: https://gist.github.com/jknowles/5659390

To leave a comment for the author, please follow the link and comment on his blog: Data, Evidence, and Policy - Jared Knowles.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.