Three tips for posting good questions to R-help and Stack Overflow

January 18, 2011
By

(This article was first published on sigmafield - R, and kindly contributed to R-bloggers)

As the number of R users and contributed packages increases, activity
on the R-help mailing list and the R tag on Stack Overflow also
continues to rise. Users with the knowledge to help those asking
questions naturally have limited time to assist on these forums. In
order to get the best answers in the shortest amount of time, there
are definite steps you can take as a poster to ask high-quality
questions.

The R-help Posting Guide is an excellent reference to help you ask
good questions. This article is not a substitute for reading the
guide, but rather presents a few important points and gives examples
on how to use them. I will introduce three simple suggestions that, if
followed, should lead you to start receiving better feedback from your
questions.

Before we start, The Golden Rule

As someone who tries to respond to questions on R-help often, I asked
myself what qualities make a good question. I am most likely to be
able to offer help when the question is clear, and the poster uses R code that is able to be cut and paste directly to my session to
demonstrate the problem, complete with a small, sample dataset if
appropriate. The following tips should get you most of the way to this
ideal.

Tip 1: Make your data reproducible

You want those who are trying to help you to be able to see the same
problem as you. The first step in making that happen is to guarantee
that you and the list are working on the exact same data. There are
two primary ways of making that happen.

First, you can use R to generate your own random data, and post the
code in your question.

The following R code generates a small sample data.frame
with variables for id, gender, and age. The set.seed function makes
sure that the random values sampled will be identical, no matter who
runs the code.

set.seed(3)
sampleData <- data.frame(id = 1:10, 
                         gender = sample(c("Male", "Female"), 10, replace = TRUE),
                         age = rnorm(10, 40, 10))
summary(sampleData)
       id           gender       age       
 Min.   : 1.00   Female:5   Min.   :27.81  
 1st Qu.: 3.25   Male  :5   1st Qu.:32.62  
 Median : 5.50              Median :40.58  
 Mean   : 5.50              Mean   :39.09  
 3rd Qu.: 7.75              3rd Qu.:42.28  
 Max.   :10.00              Max.   :52.67

Try it out for yourself. Even though we’re generating random data,
your results should match mine if you use the same integer (3) in the
set.seed function. If you want to use this method, just copy the
statements you used to generate the data in your question. If you want
to read more about the sample function used in that code block, see
my article demonstrating its use.

Generating random data and using the set.seed function is one way to
make sure two people are working on the same dataset. Instead of
creating random data, it is often convenient to share the actual
data that you are working wth. R has the dput function to help you out
with this. This function will write out a plain text representation of
an R object that lets others create an exact copy of the object you
are working with, by simply copying and pasting the output. If the
object is large, you can take a small sample of the object, and
subsequently use the dput function to create a plain text
representation of the R object for pasting into your message.

The following code block demonstrates two simple ideas. First, if your
dataset (e.g., largeData below) has a lot of rows or columns, you
probably do not want to share the whole thing with everyone to make
you point. So, you can take a subset of the rows using the technique
below to create the sampleData object. Finally, the dput function is
called, which writes out R code that represents the object. Others can
paste this code to create the same object you are using.

## we will generate a data.frame for this example, but
## this object represents your "real" data
largeData <- data.frame(id = 1:1000, age = rnorm(1000, 40, 10))

## posting the dput output of a data.frame with 1000 observations
## is probably not necessary, so we will take a small subset 
sampleData <- largeData[sample(nrow(largeData), 10), ]

## use dput to write out a text representation of the R object
dput(sampleData)
 structure(list(id = c(39L, 471L, 497L, 927L, 663L, 525L, 580L, 
622L, 48L, 727L), age = c(22.6273628946641, 35.237619316895, 
29.6406734238401, 49.7885287820185, 42.6482063541433, 35.8383991257624, 
33.1517001030015, 36.814031442543, 42.5628727298572, 48.4957262906764
)), .Names = c("id", "age"), row.names = c(39L, 471L, 497L, 927L, 
663L, 525L, 580L, 622L, 48L, 727L), class = "data.frame")

See what happened? The dput function has created the code necessary
to reproduce the object! You just have to copy and paste the result
(the lines beginning with the structure function) and assign the
resulting object to a variable name, and readers of your question will
have the same data you do. This goes a long way in helping you solve
your problems. See the sample question below for an example of
assigning the dput output to a variable name.

If you have a large set that must be used to recreate your issue,
consider putting a CSV file on a public web space, such as Dropbox,
and then providing the line of code to read in the data in your
message. See read.csv and write.csv.

Takeaway: Make your data accessible to others. Either give code that
generates random data or use dput on small objects and paste the
result in your question. If you must use a large data set, post it to
Dropbox and provide the code to read it in.

Tip 2: Make your code reproducible

This builds on the first tip, which shows how to make your data
reproducible to a reader of the post. This rule tells you that your
code also should be reproducible, and concise.

Give the reader the minimum number of lines of R code that you need to
reproduce either the error you are receiving, or where you get stuck
with your program. Remember, the goal is to leave nothing to the
reader’s imagination as to what you did, we want to be able to copy
and paste the R code and have it run in any R session, not just
yours. Give the reader enough code to reproduce the error or where you
are stuck, but no more. Do not forget to include any library calls
that load packages you might be using in your examples.

Takeaway: Any reader of your question should be able to copy and paste
all the R code in your message to see the same output as you
do. Before you send your message, run the R code in your message in a fresh R session to make sure you have met this criterion.

Tip 3: Use proper R class names in your post

R has several different classes that can be used to store data. These
include, but are not limited to: data.frame, matrix, table, and
various vector classes. It helps readers when you use the proper class
names to describe the data you have, since functions that are
appropriate for data.frames may not be appropriate for objects of type
matrix, and vice-versa.

You can always get the class of your object by simply using the
class function.

class(sampleData)
[1] "data.frame"

Besides helping the question readers, knowing what class your data are
represented as in R is very useful knowledge for you as a data
analyst.

Takeaway: Be specific with which R classes your programs use, as it
can affect the recommendations given to you, and helps you think more
clearly about your analyses.

Examples of bad and good questions

Here is an example of a typical bad question.

Hello, I have a data table and am trying to subtract two dates, but
it's not working.  

I tried

> myData$Date1 - myData$Date2

but it doesn't work. Can anyone help? 

What’s wrong with a question like this? It breaks every rule I talk
about above. There is no sample data object and the code cannot be run
by copying and pasting it into a new R session. Also, the vague term
“data table” is used, instead of the the more accurate “data.frame”,
which is the actual class of the R object. Essentially, the reader is
left to guess what is wrong, and cannot reproduce the error that
poster is receiving.

A much better version of this question follows.

Hello, 

I have a data.frame and am trying to subtract two components of it,
but it's not working.

I am pasting the dput output of my data.frame object. 

## dput output assigned to the myData variable
myData <- structure(list(id = 1:10, Date1 = structure(c(3L, 8L, 1L, 10L, 
6L, 4L, 5L, 9L, 7L, 2L), .Label = c("1997-07-14", "1997-10-24", 
"1997-10-26", "1998-08-21", "1998-12-31", "1999-01-31", "1999-05-09", 
"1999-11-03", "2001-11-04", "2002-06-23"), class = "factor"), 
    Date2 = structure(c(1L, 7L, 5L, 6L, 2L, 9L, 8L, 4L, 3L, 10L
    ), .Label = c("1997-07-29", "1997-09-21", "1998-05-06", "1998-07-24", 
    "1999-10-22", "2000-03-10", "2001-04-03", "2001-08-07", "2001-09-10", 
    "2002-05-07"), class = "factor")), .Names = c("id", "Date1", 
"Date2"), row.names = c(NA, -10L), class = "data.frame")

Now, when I run the command I think should work, I see the following: 

> myData$Date1 - myData$Date2
[1] NA NA NA NA NA NA NA NA NA NA
Warning message:
In Ops.factor(myData$Date1, myData$Date2) : - not meaningful for factors

What is going on?  

Now, everyone reading the question can create the same data.frame and
run the code generating the warning message exactly as the poster sees
it. The answer quickly reveals itself that the Date variables are not
really Date objects, but rather factors. Readers can verify this by
running the code exactly at it appears in the message.

str(myData)
'data.frame':	10 obs. of  3 variables:
 $ id   : int  1 2 3 4 5 6 7 8 9 10
 $ Date1: Factor w/ 10 levels "1998-08-02","1999-07-17",..: 9 10 8 1 7 4 2 6 3 5
 $ Date2: Factor w/ 10 levels "1997-07-04","1997-10-18",..: 4 2 1 6 9 8 3 7 5 10

Conclusion

I hope this has given you some good ideas for asking clearer questions
on R help and Stack Overflow. Remember, the bottom of every R-help
message states: “PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented,
minimal, self-contained, reproducible code.” This article should help
explain what that request means. Just remember, the gold standard is
that anyone can cut and paste your data and code to reproduce the
exact same issue that is confusing you. If you can do that, you will
be far more likely to receive quality answers to your questions in a
fast manner.

Tags: 

To leave a comment for the author, please follow the link and comment on his blog: sigmafield - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.