May 16, 2014

(This article was first published on R snippets, and kindly contributed to R-bloggers)

Its time for some fun today – because its Friday as David Smith says :).

There are many code golf sites, even some support R. However, most of them are algorithm oriented. A true RGolf competition should involve transforming a source data frame to some target format data frame.

So the challenge today will be to write a shortest code in R that performs a required data transformation.

Let’s start with the data transformation task (actually the problem was taken from a real data set I have recently analyzed).

We are running a survey. Each respondent is asked some subset of possible questions (labelled by letters) and answers the question positively (1) or negatively (0). As input we are given a data frame with two columns: labels of questions asked (as letters) and sequence of answers given to them (string of 0’s and 1’s). A good R example is better than 1000 words :):

> set.seed(1)
> questions <- replicate(1000, paste(sample(letters[1:10],
      sample.int(4) + 2), collapse = “”))
> answers <- sapply(questions, function(x) {
      paste(as.character(rbinom(nchar(x), 1, 0.5)),
      collapse = “”) })
> dataset <- data.frame(questions, answers,
      stringsAsFactors = FALSE)
> head(dataset)
  questions answers
1      cihe    1100
2     gdjie   01100
3    cfbhja  001000
4      febj    1110
5     ehfid   01101
6     hgdic   10010

We will want to transform dataset to the following wide format (stored in dataset2):

> head(dataset2)
   a  b  c  d  e  f  g  h  i  j
1 NA NA  1 NA  0 NA NA  0  1 NA
2 NA NA NA  1  0 NA  0 NA  0  1
3  0  1  0 NA NA  0 NA  0 NA  0
4 NA  1 NA NA  1  1 NA NA NA  0
5 NA NA NA  1  0  1 NA  1  0 NA
6 NA NA  0  0 NA NA  0  1  1 NA

The challenge is to transform dataset in such a way to generate dataset2 in as few keystrokes as possible, assuming that number of questions and number of respondents (respectively equal to 10 and 1000 in example data set) is unknown. The constraints are that one line of code may not be longer than 80 characters and the solution must be in base R only (no package loading is allowed).
Here is my attempt:
It has 284 characters (including 3 newline characters). If you take the challenge and have a shorter solution that produces exactly the same dataset2 data set for a given input post a comment ;). In order for the comment to be accepted the solution must be robust to changes of generated data set (different number of possible questions and answers).
Before I quit I present the same code in slightly more readable format and commented:
# extract all classes that exist in dataset$questions
# and sort them
classes <- sort(unique(strsplit(paste(dataset$questions,
    collapse = “”), “”)[[1]]))

# change one pair of questions and answers into
# a full vector containing all classes sorted
process.qa <- function(q, a) {
    res <- rep(NA, length(classes)) # initially no classes are set
    qs <- strsplit(q, split=””)[[1]] # extract question classes
    # extract answers and sort them in order of question classes
    as <- as.numeric(strsplit(a, split=””)[[1]][order(qs)])
    # update result with answers for existing questions
    res[grepl(paste(“[“,q, “]”, sep=””), classes)] <- as
    names(res) <- classes

dataset2 <- data.frame(t(mapply(process.qa,
    dataset$questions, dataset$answers, USE.NAMES = F)))

To leave a comment for the author, please follow the link and comment on their blog: R snippets.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

plotly webpage

dominolab webpage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training





CRC R books series

Six Sigma Online Training

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)