Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Its time for some fun today – because its Friday as David Smith says :).

There are many code golf sites, even some support R. However, most of them are algorithm oriented. A true RGolf competition should involve transforming a source data frame to some target format data frame.

So the challenge today will be to write a shortest code in R that performs a required data transformation.

Let’s start with the data transformation task (actually the problem was taken from a real data set I have recently analyzed).

We are running a survey. Each respondent is asked some subset of possible questions (labelled by letters) and answers the question positively (1) or negatively (0). As input we are given a data frame with two columns: labels of questions asked (as letters) and sequence of answers given to them (string of 0’s and 1’s). A good R example is better than 1000 words :):

> set.seed(1)
> questions <- replicate(1000, paste(sample(letters[1:10],
sample.int(4) + 2), collapse = “”))
> answers <- sapply(questions, function(x) {
paste(as.character(rbinom(nchar(x), 1, 0.5)),
collapse = “”) })
stringsAsFactors = FALSE)
1      cihe    1100
2     gdjie   01100
3    cfbhja  001000
4      febj    1110
5     ehfid   01101
6     hgdic   10010

We will want to transform dataset to the following wide format (stored in dataset2):

a  b  c  d  e  f  g  h  i  j
1 NA NA  1 NA  0 NA NA  0  1 NA
2 NA NA NA  1  0 NA  0 NA  0  1
3  0  1  0 NA NA  0 NA  0 NA  0
4 NA  1 NA NA  1  1 NA NA NA  0
5 NA NA NA  1  0  1 NA  1  0 NA
6 NA NA  0  0 NA NA  0  1  1 NA

The challenge is to transform dataset in such a way to generate dataset2 in as few keystrokes as possible, assuming that number of questions and number of respondents (respectively equal to 10 and 1000 in example data set) is unknown. The constraints are that one line of code may not be longer than 80 characters and the solution must be in base R only (no package loading is allowed).

Here is my attempt:

d<-dataset;y<-sort(unique(strsplit(paste(d[[1]],collapse=""),"")[[1]]))
d2<-data.frame(t(mapply(function(q,a){r<-rep(NA,length(y))
r[grepl(paste(“[“,q,”]”,sep=””),y)]<-as.numeric(strsplit(a,split="")[[1]][
order(strsplit(q,split=””)[[1]])]);names(r)<-y;r},d[[1]],d[[2]],USE.NAMES=F)))

It has 284 characters (including 3 newline characters). If you take the challenge and have a shorter solution that produces exactly the same dataset2 data set for a given input post a comment ;). In order for the comment to be accepted the solution must be robust to changes of generated data set (different number of possible questions and answers).

Before I quit I present the same code in slightly more readable format and commented:

# extract all classes that exist in dataset$questions # and sort them classes <- sort(unique(strsplit(paste(dataset$questions,
collapse = “”), “”)[[1]]))

# change one pair of questions and answers into
# a full vector containing all classes sorted
process.qa <- function(q, a) {
res <- rep(NA, length(classes)) # initially no classes are set
qs <- strsplit(q, split="")[[1]] # extract question classes
# extract answers and sort them in order of question classes
as <- as.numeric(strsplit(a, split="")[[1]][order(qs)])
# update result with answers for existing questions
res[grepl(paste(“[“,q, “]”, sep=””), classes)] <- as
names(res) <- classes
res
}

dataset2 <- data.frame(t(mapply(process.qa,
dataset$questions, dataset$answers, USE.NAMES = F)))