Three tips for posting good questions to R-help and Stack Overflow

January 18, 2011
By

(This article was first published on sigmafield - R, and kindly contributed to R-bloggers)

As the number of R users and contributed packages increases, activity on the R-help mailing list and the R tag on Stack Overflow also continues to rise. Users with the knowledge to help those asking questions naturally have limited time to assist on these forums. In order to get the best answers in the shortest amount of time, there are definite steps you can take as a poster to ask high-quality questions.

The R-help Posting Guide is an excellent reference to help you ask good questions. This article is not a substitute for reading the guide, but rather presents a few important points and gives examples on how to use them. I will introduce three simple suggestions that, if followed, should lead you to start receiving better feedback from your questions.

Before we start, The Golden Rule

As someone who tries to respond to questions on R-help often, I asked myself what qualities make a good question. I am most likely to be able to offer help when the question is clear, and the poster uses R code that is able to be cut and paste directly to my session to demonstrate the problem, complete with a small, sample dataset if appropriate. The following tips should get you most of the way to this ideal.

Tip 1: Make your data reproducible

You want those who are trying to help you to be able to see the same problem as you. The first step in making that happen is to guarantee that you and the list are working on the exact same data. There are two primary ways of making that happen.

First, you can use R to generate your own random data, and post the code in your question.

The following R code generates a small sample data.frame with variables for id, gender, and age. The set.seed function makes sure that the random values sampled will be identical, no matter who runs the code.

set.seed(3)
sampleData <- data.frame(id = 1:10, 
                         gender = sample(c("Male", "Female"), 10, replace = TRUE),
                         age = rnorm(10, 40, 10))
summary(sampleData)
       id           gender       age       
 Min.   : 1.00   Female:5   Min.   :27.81  
 1st Qu.: 3.25   Male  :5   1st Qu.:32.62  
 Median : 5.50              Median :40.58  
 Mean   : 5.50              Mean   :39.09  
 3rd Qu.: 7.75              3rd Qu.:42.28  
 Max.   :10.00              Max.   :52.67

Try it out for yourself. Even though we're generating random data, your results should match mine if you use the same integer (3) in the set.seed function. If you want to use this method, just copy the statements you used to generate the data in your question. If you want to read more about the sample function used in that code block, see my article demonstrating its use.

Generating random data and using the set.seed function is one way to make sure two people are working on the same dataset. Instead of creating random data, it is often convenient to share the actual data that you are working wth. R has the dput function to help you out with this. This function will write out a plain text representation of an R object that lets others create an exact copy of the object you are working with, by simply copying and pasting the output. If the object is large, you can take a small sample of the object, and subsequently use the dput function to create a plain text representation of the R object for pasting into your message.

The following code block demonstrates two simple ideas. First, if your dataset (e.g., largeData below) has a lot of rows or columns, you probably do not want to share the whole thing with everyone to make you point. So, you can take a subset of the rows using the technique below to create the sampleData object. Finally, the dput function is called, which writes out R code that represents the object. Others can paste this code to create the same object you are using.

## we will generate a data.frame for this example, but
## this object represents your "real" data
largeData <- data.frame(id = 1:1000, age = rnorm(1000, 40, 10))

## posting the dput output of a data.frame with 1000 observations
## is probably not necessary, so we will take a small subset 
sampleData <- largeData[sample(nrow(largeData), 10), ]

## use dput to write out a text representation of the R object
dput(sampleData)
 structure(list(id = c(39L, 471L, 497L, 927L, 663L, 525L, 580L, 
622L, 48L, 727L), age = c(22.6273628946641, 35.237619316895, 
29.6406734238401, 49.7885287820185, 42.6482063541433, 35.8383991257624, 
33.1517001030015, 36.814031442543, 42.5628727298572, 48.4957262906764
)), .Names = c("id", "age"), row.names = c(39L, 471L, 497L, 927L, 
663L, 525L, 580L, 622L, 48L, 727L), class = "data.frame")

See what happened? The dput function has created the code necessary to reproduce the object! You just have to copy and paste the result (the lines beginning with the structure function) and assign the resulting object to a variable name, and readers of your question will have the same data you do. This goes a long way in helping you solve your problems. See the sample question below for an example of assigning the dput output to a variable name.

If you have a large set that must be used to recreate your issue, consider putting a CSV file on a public web space, such as Dropbox, and then providing the line of code to read in the data in your message. See read.csv and write.csv.

Takeaway: Make your data accessible to others. Either give code that generates random data or use dput on small objects and paste the result in your question. If you must use a large data set, post it to Dropbox and provide the code to read it in.

Tip 2: Make your code reproducible

This builds on the first tip, which shows how to make your data reproducible to a reader of the post. This rule tells you that your code also should be reproducible, and concise.

Give the reader the minimum number of lines of R code that you need to reproduce either the error you are receiving, or where you get stuck with your program. Remember, the goal is to leave nothing to the reader's imagination as to what you did, we want to be able to copy and paste the R code and have it run in any R session, not just yours. Give the reader enough code to reproduce the error or where you are stuck, but no more. Do not forget to include any library calls that load packages you might be using in your examples.

Takeaway: Any reader of your question should be able to copy and paste all the R code in your message to see the same output as you do. Before you send your message, run the R code in your message in a fresh R session to make sure you have met this criterion.

Tip 3: Use proper R class names in your post

R has several different classes that can be used to store data. These include, but are not limited to: data.frame, matrix, table, and various vector classes. It helps readers when you use the proper class names to describe the data you have, since functions that are appropriate for data.frames may not be appropriate for objects of type matrix, and vice-versa.

You can always get the class of your object by simply using the class function.

class(sampleData)
[1] "data.frame"

Besides helping the question readers, knowing what class your data are represented as in R is very useful knowledge for you as a data analyst.

Takeaway: Be specific with which R classes your programs use, as it can affect the recommendations given to you, and helps you think more clearly about your analyses.

Examples of bad and good questions

Here is an example of a typical bad question.

Hello, I have a data table and am trying to subtract two dates, but
it's not working.  

I tried

> myData$Date1 - myData$Date2

but it doesn't work. Can anyone help? 

What's wrong with a question like this? It breaks every rule I talk about above. There is no sample data object and the code cannot be run by copying and pasting it into a new R session. Also, the vague term "data table" is used, instead of the the more accurate "data.frame", which is the actual class of the R object. Essentially, the reader is left to guess what is wrong, and cannot reproduce the error that poster is receiving.

A much better version of this question follows.

Hello, 

I have a data.frame and am trying to subtract two components of it,
but it's not working.

I am pasting the dput output of my data.frame object. 

## dput output assigned to the myData variable
myData <- structure(list(id = 1:10, Date1 = structure(c(3L, 8L, 1L, 10L, 
6L, 4L, 5L, 9L, 7L, 2L), .Label = c("1997-07-14", "1997-10-24", 
"1997-10-26", "1998-08-21", "1998-12-31", "1999-01-31", "1999-05-09", 
"1999-11-03", "2001-11-04", "2002-06-23"), class = "factor"), 
    Date2 = structure(c(1L, 7L, 5L, 6L, 2L, 9L, 8L, 4L, 3L, 10L
    ), .Label = c("1997-07-29", "1997-09-21", "1998-05-06", "1998-07-24", 
    "1999-10-22", "2000-03-10", "2001-04-03", "2001-08-07", "2001-09-10", 
    "2002-05-07"), class = "factor")), .Names = c("id", "Date1", 
"Date2"), row.names = c(NA, -10L), class = "data.frame")

Now, when I run the command I think should work, I see the following: 

> myData$Date1 - myData$Date2
[1] NA NA NA NA NA NA NA NA NA NA
Warning message:
In Ops.factor(myData$Date1, myData$Date2) : - not meaningful for factors

What is going on?  

Now, everyone reading the question can create the same data.frame and run the code generating the warning message exactly as the poster sees it. The answer quickly reveals itself that the Date variables are not really Date objects, but rather factors. Readers can verify this by running the code exactly at it appears in the message.

str(myData)
'data.frame':	10 obs. of  3 variables:
 $ id   : int  1 2 3 4 5 6 7 8 9 10
 $ Date1: Factor w/ 10 levels "1998-08-02","1999-07-17",..: 9 10 8 1 7 4 2 6 3 5
 $ Date2: Factor w/ 10 levels "1997-07-04","1997-10-18",..: 4 2 1 6 9 8 3 7 5 10

Conclusion

I hope this has given you some good ideas for asking clearer questions on R help and Stack Overflow. Remember, the bottom of every R-help message states: "PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code." This article should help explain what that request means. Just remember, the gold standard is that anyone can cut and paste your data and code to reproduce the exact same issue that is confusing you. If you can do that, you will be far more likely to receive quality answers to your questions in a fast manner.

Tags: 

To leave a comment for the author, please follow the link and comment on his blog: sigmafield - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.