Function to Generate a Random Data Set

[This article was first published on TRinker's R Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Often I find myself needing data sets to try functions and code out on or for teaching purposes.  I have a few stand-bys such as the mtcars and CO2 data sets in the base packages of R but sometimes I need a long format data set or a bunch of categorical or a bunch of numeric or repeated measures or I want it to have missing values to test the function and I spend valuable time searching for the correct data set.  About a year ago my answer was to have a file with several data sets I knew could fit various situations but eventually I grew tired of the pain of loading a data set each time I needed to test something and created a randomly generated data set function with categorical, numeric, interval, and repeated measures data.  I recently extended the data set to contain optional missing values, long or wide format, and proportion data and attempted to give it some speed boosts for creating larger data sets.  It generally suits my needs and I think can probably serve others too.

The main function, DFgen, relies on two helper functions, props and NAins.  I do not place these helper functions inside of DFgen itself as they have useful properties in and of themselves.  I’ll briefly explain each function, provide the code, and give a few tests to try it out.

The props Function

The props function generates a data frame of proportions whose rows sum to 1.  It takes two arguments and an optional var.names argument.  The first two arguments are the dimensions of the dataframe and are pretty self explanatory.  The final argument optionally names the columns otherwise they are named X1..Xn.  One note on this function is that for many columns it is a poorer choice.  For a slower props function but better for numerous columns Dason of talkstats.com provides an alternative (LINK).

#############################################################
# function to generate random proportions whose rowSums = 1 #
#############################################################
props <- function(ncol, nrow, var.names=NULL){
    if (ncol < 2) stop("ncol must be greater than 1")
    p <- function(n){
        y <- 0
        z <- sapply(seq_len(n-1), function(i) {
                x <- sample(seq(0, 1-y, by=.01), 1)
                y <<- y + x
                return(x)
            }
        )
        w <- c(z , 1-sum(z))
        return(w)
    }
    DF <- data.frame(t(replicate(nrow, p(n=ncol))))
    if (!is.null(var.names)) colnames(DF) <- var.names
    return(DF)
}
##############
# TRY IT OUT #
##############
props(ncol=5, nrow=5)                                      
props(ncol=3, nrow=25)                                     
props(ncol=3, nrow=5, var.names=c("red", "blue", "green"))

The NAins Function

The NAins function takes a data frame and randomly inserts a certain proportion of missing (NA) values.  The function has two arguments: df which is the dataframe and prop which is the proportion of NA values to be inserted into the data frame (default is .1),

Special thanks again to Dason of talk.stats.com for helping with a speed boost with this function.  This function consumes considerable time in DFgen and he provided the code to really gain some speed.

################################################################
# RANDOMLY INSERT A CERTAIN PROPORTION OF NAs INTO A DATAFRAME #
################################################################
NAins <-  NAinsert <- function(df, prop = .1){
    n <- nrow(df)
    m <- ncol(df)
    num.to.na <- ceiling(prop*n*m)
    id <- sample(0:(m*n-1), num.to.na, replace = FALSE)
    rows <- id %/% m + 1
    cols <- id %% m + 1
    sapply(seq(num.to.na), function(x){
            df[rows[x], cols[x]] <<- NA
        }
    )
    return(df)
}
##############
# TRY IT OUT #
##############
NAins(mtcars, .1)

The DFgen Function

The DFgen function randomly generates an n-lenght data set with predefined variables.  The default DFgen() with no arguments specified will produce the following n=10 data set:

> set.seed(10)
> DFgen()
      id   group hs.grad  race gender age m.status   political n.kids income score time1 time2 time3
1   ID.1   treat     yes white   male  19    never  republican      1 111000 -1.24 51.39 52.15 53.76
2   ID.2 control     yes black   male  30 divorced independent      0 122000 -0.46 32.21 35.07 33.10
3   ID.3 control     yes white   male  32  married  republican      1   2000 -0.83 43.36 45.46 46.22
4   ID.4   treat      no white   male  30 divorced  republican      1  65000  0.34 71.63 72.06 74.49
5   ID.5 control     yes white female  18  married  republican      3  96000  1.07  9.26 12.24 11.02
6   ID.6   treat     yes asian female  30  married independent      3 135000  1.22 24.10 26.45 24.74
7   ID.7   treat     yes white female  26    never    democrat      5  16000  0.74 28.76 31.72 31.39
8   ID.8   treat     yes white   male  40  married  republican      1 113000 -0.48 28.24 29.10 37.12
9   ID.9   treat     yes white   male  23  married independent      2  80000  0.56 62.99 65.09 67.72
10 ID.10   treat      no asian   male  22  married    democrat      1  96000 -1.25 43.74 46.79 44.04

The function also takes optional:

  • type argument (default “wide” or “long”)
  • na.rate (a decimal value between 0 and 1; default is 0) that randomly inserts missing data (great for teaching demos and testing corner cases)
  • prop argument (takes TRUE or default FALSE )
  • digits that controls the number of degits (default is 2)
############################################################
# GENERATE A RANDOM DATA SET.  CAN BE SET TO LONG OR WIDE. #
# DATA SET HAS FACTORS AND NUMERIC VARIABLES AND CAN       #
# OPTIONALLY GIVE BUDGET EXPENDITURES AS A PROPORTION.     #
# CAN ALSO TELL A PROPORTION OF CELLS TO BE MISSING VALUES #
############################################################
# NOTE RELIES ON THE props FUNCTION AND THE NAins FUNCTION #
############################################################
DFgen <- DFmaker <- function(n=10, type=wide, digits=2, 
    proportion=FALSE, na.rate=0) {

    rownamer <- function(dataframe){
        x <- as.data.frame(dataframe)
        rownames(x) <- NULL
        return(x)
    }

    dfround <- function(dataframe, digits = 0){
      df <- dataframe
      df[,sapply(df, is.numeric)] <-round(df[,sapply(df, is.numeric)], digits) 
      return(df)
    }

    TYPE <- as.character(substitute(type))
    time1 <- sample(1:100, n, replace = TRUE) + abs(rnorm(n))
    DF <- data.frame(id = paste0("ID.", 1:n), 
        group= sample(c("control", "treat"), n, replace = TRUE),
        hs.grad = sample(c("yes", "no"), n, replace = TRUE), 
        race = sample(c("black", "white", "asian"), n, 
            replace = TRUE, prob=c(.25, .5, .25)), 
        gender = sample(c("male", "female"), n, replace = TRUE), 
        age = sample(18:40, n, replace = TRUE),
        m.status = sample(c("never", "married", "divorced", "widowed"), 
            n, replace = TRUE, prob=c(.25, .4, .3, .05)), 
        political = sample(c("democrat", "republican", 
            "independent", "other"), n, replace= TRUE, 
            prob=c(.35, .35, .20, .1)),
        n.kids = rpois(n, 1.5), 
        income = sample(c(seq(0, 30000, by=1000), 
            seq(0, 150000, by=1000)), n, replace=TRUE),
        score = rnorm(n), 
        time1, 
        time2 = c(time1 + 2 * abs(rnorm(n))), 
        time3 = c(time1 + (4 * abs(rnorm(n)))))
    if (proportion) {
        DF <- cbind (DF[, 1:10], 
            props(ncol=3, nrow=n, var.names=c("food", 
                "housing", "other")),
            DF[, 11:14])
    }
    if (na.rate!=0) {  
        DF <- cbind(DF[, 1, drop=FALSE], NAins(DF[, -1], 
            prop=na.rate))
    }
    DF <- switch(TYPE, 
        wide = DF, 
        long = {DF <- reshape(DF, direction = "long", idvar = "id",
                varying = c("time1","time2", "time3"),
                v.names = c("value"),
                timevar = "time", times = c("time1", "time2", "time3"))
            rownamer(DF)}, 
        stop("Invalid Data \"type\""))
    return(dfround(DF, digits=digits))
}
##############
# TRY IT OUT #
##############
DFgen()            
DFgen(type="long") 
DFmaker(20000)     
DFgen(prop=T)      
DFgen(na.rate=.3)
NOTE: This function relies on R.2.15.  If you don’t want to update R you must include a paste0 function found in the link below.



Click here for a .txt version of this demonstration


To leave a comment for the author, please follow the link and comment on their blog: TRinker's R Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)