Your Data is Never the Right Shape

[This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One of the recurring frustrations in data analytics is that your data is never in the right shape. Worst case: you are not aware of this and every step you attempt is more expensive, less reliable and less informative than you would want. Best case: you notice this and have the tools to reshape your data.

There is no final “right shape.” In fact even your data is never right. You will always be called to re-do your analysis (new variables, new data, corrections) so you should always understand you are on your “penultimate analysis” (always one more to come). This is why we insist on using general methods and scripted techniques, as these methods are much much easier to reliably reapply on new data than GUI/WYSWYG techniques.

In this article we will work a small example and call out some R tools that make reshaping your data much easier. The idea is to think in terms of “relational algebra” (like SQL) and transform your data towards your tools (and not to attempt to adapt your tools towards the data in an ad-hoc manner).Take a simple example where you are designing a new score called “score2” to predict or track an already known value called “score1.” The typical situation is score1 is a future outcome (such as the number of dollars profit on a transaction) and score2 is a prediction (such as the estimated profit before the transaction is attempted). Training data is usually assembled by performing a large number of transactions, recording what was known before the transaction and then aligning or joining this data with measured results when they become available. For this example we are not interested in the inputs driving the model (a rare situation, but we are trying to make our example as simple as possible) but only examining the quality of score2 (which is defined as how well it tracks score1).

All of this example will be in R, but the principles are chosen apply more generally. First let us enter some example data:

> d <- data.frame(id=c(1,2,3,1,2,3),score1=c(17,5,6,10,13,7),score2=c(13,10,5,13,10,5)) > d

This gives us our example data. Each row is numbered (1 through 6) has an id and both our scores:

  id score1 score2
1  1     17     13
2  2      5     10
3  3      6      5
4  1     10     13
5  2     13     10
6  3      7      5

We said our only task was to characterize how well score2 works at predicting score1 (or how good a substitute score2 is for score1). We could compute correlation, RMS error, info-gain or some such. But instead lets look at this graphically. We will prepare a graph showing how well score1 is represented by score2. For this we choose to place score1 on the y-axis (as it is the outcome) and score2 on the x-axis (as it is the driver).

> library(ggplot2) > ggplot(d) + geom_point(aes(x=score2,y=score1))


Figure 1: score1 as a function of score2.

This does not look good. We would liked to have seen all of the dots falling on the line “y=x.” This plot shows score2 is not predicting score1 very well. Part of this is that we missed an important feature of the data (and because we missed it the feature becomes a problem): the ids repeat. First we re-order by id to make this more obvious.

> dsort <- d[order(d$id),] > dsort

  id score1 score2
1  1     17     13
4  1     10     13
2  2      5     10
5  2     13     10
3  3      6      5
6  3      7      5

This is a very common situation. The original score is not completely a function of the known inputs. We are using “id” to abstract represent all of the inputs, two rows in our example have the same id if and only if all known inputs are exactly the same. The repeating ids are the same experiment run at different times (a good idea) and the variation in score1 could be the effect of an un-modeled input that changed value or something simple like a “noise term” (a random un-modeled effect). Notice that score2 is behaving as a function of id– all rows with the same id have the same value for score2. If score2 is a model then it has to be a function of the inputs (or more precisely if it is not a function of the inputs you have done something wrong). So any variation of score1 between rows with identical id is “unexplainable variation” (unexplainable from the point of view of currently tracked inputs). You should know about, characterize and report this variation (why it is good to have some repeated experiments). But this variation is not the model’s fault, if we want to know how good a job we did constructing the model (which we now see can be a slightly different question than how well the model works at prediction) we need to see how much of the explainable variation the model accounts for.

If we assume (as is traditional) the unexplained variation is from a “unbiased noise source” then we can lessen the impact of the noise source by replacing score1 with a value averaged over rows with the same id. This assumption is traditional because an unbiased noise source is present in many problems and assuming anything more requires more research into the problem domain. You would eventually fold such research into your model- so your goal is always have all effects or biases in your model and hope what is left over is unbiased. This is usually not strictly true, but not accounting for the unexplained variation at all is in many cases even worse than modeling the unexplained variation as being bias-free.

And now we find our data is the “wrong shape.” To replace score1 with the appropriate averages we need to do some significant data manipulation. We need to group sets of rows and add new columns. We could do this imperatively (write some loops and design some variables to track and manipulate state) or declaratively (find a path of operations from what you have to what you need through R’s data manipulation algebra). Even though the declarative method is more work the first time (you could often write the code in less time than it takes to read the manuals) it is the right way to go (as it is more versatile and powerful in the end).

Luckily we don’t have to use raw R. There are a number of remarkable packages (all by Hadley Wickham who is also the author of the ggplot2 package we use to prepare our figures) that really improve R’s ability to coherently manage data. The easiest (on us) way do fix up our data is to make the computer work hard and use the powerful melt/cast technique. These functions are found in the libraries reshape and plyr (which were automatically loaded with we loaded ggplot2 library).

melt is a bit abstract. What it does convert your data into a “narrow” format where rows are split into many rows each carrying just one result column of the original row. For example we can melt our data by id as follows:

> dmelt <- melt(d,id.vars=c('id')) > dmelt

Which yields the following:

   id variable value
1   1   score1    17
2   1   score1    10
3   2   score1     5
4   2   score1    13
5   3   score1     6
6   3   score1     7
7   1   score2    13
8   1   score2    13
9   2   score2    10
10  2   score2    10
11  3   score2     5
12  3   score2     5

Each of the two facts (score1, score2) from our original row is split into its own row. The id column plus the new variable column are now considered to be keys. This format is not used directly but used because it is easy to express important data transformations in terms of it. For instance we wanted our table to have duplicate rows collected and score1 replaced by its average (to attempt to remove the unexplainable variation). That is now easy:

> dmean <- cast(dmelt,fun.aggregate=mean) > dmean

  id score1 score2
1  1   13.5     13
2  2    9.0     10
3  3    6.5      5

We used cast() in its default mode, where it assumes all columns not equal to “value” are the keyset. It then collects all rows with identical keying and combines them back into wide rows using mean or average as the function to deal with duplicates. Notice score1 is now the desired average, and score2 is as before (as it was a function of the keys or inputs, so it is not affected by averaging). With this new smaller data set we can re-try our original graph:

> ggplot(dmean) + geom_point(aes(x=score2,y=score1))


Figure 2: mean(score1) as a function of score2.

This doesn’t look so bad. A lot of the error or variation in the first plot was unexplainable variation. score2 isn’t bad given its inputs. If you wanted to do better than score2 you would be advised to find more modeling inputs (versus trying more exotic modeling techniques).

Of course a client or user is not interested if score2 is “best possible.” They want to know if it is any good. To do this we should show them (either by graph or by quantitative summary statistics like we mentioned earlier) at least 3 things:

  1. How well the model predicts overall (the very first graph we presented).
  2. How much of the explainable variation the model predicts (the second graph).
  3. The nature of the unexplained variation (which we will explore next).

We said earlier we are hoping the unexplained variation is noise (or if it is not noise it would be nice if it is a clue to new important modeling features). So the unexplained variation must not go unexamined. We will finish by showing how to characterize the unexplained variation. As before will will just make a graph, but the data preparation steps would be exactly the same if we were using a quantitive summary (like correlation, or any other). And, of course, our data is still not the right shape for this step. Luckily there is another tool ready to fix this: join().

> djoin <- join(dsort,dsort,'id') > fixnames <- function(cn) { n <- length(cn); for(i in 2:((n+1)/2)) { cn[i] <- paste('a',cn[i],sep='') }; for(i in ((n+3)/2):n) { cn[i] <- paste('b',cn[i],sep='') }; cn } > colnames(djoin) <- fixnames(colnames(djoin)) > djoin

which produces:

   id ascore1 ascore2 bscore1 bscore2
1   1      17      13      17      13
2   1      17      13      10      13
3   1      10      13      17      13
4   1      10      13      10      13
5   2       5      10       5      10
6   2       5      10      13      10
7   2      13      10       5      10
8   2      13      10      13      10
9   3       6       5       6       5
10  3       6       5       7       5
11  3       7       5       6       5
12  3       7       5       7       5

All of the work was done by the single line “djoin <- join(dsort,dsort,'id')” the rest was just fixing the column names (as self-join is not the central use case of join). What we have now is a table that is exactly right for studying unexplained variation. For each id we have each row with the same id matched. This blows every id from having 2 rows in dsort to 4 rows in djoin. Notices this gives us every pair of score1 values seen for the same id (which will let us examine unexplained variation) and score2 is still constant over all rows with the same id (as it has always been throughout our analysis). With this table we can now plot how score1 varies for rows with the same id:

> ggplot(djoin) + geom_point(aes(x=ascore1,y=bscore1))


Figure 3: score1 as a function of score1.

And we can see, as we expected, the unexplained variation in score1 is about as large as the mismatch between score1 and score2 in our original plot. The important thing is this is all about score1 (score2 is now literally out of the picture). The analyst’s job would now be to try and tie bits of the unexplained variation to new inputs (that can be folded into a new score2) and/or characterize the noise term (so the customer knows how close they should expect repeated experiments to be).

What we are trying to encourage with the use of “big hammer tools” is an ability and willingness to look at and transform your data in meaningful steps. It often seems easier and more efficient to build one more piece of data tubing, but a lot of data tubes become an unmanageable collection of spaghetti code. The analyst should, in some sense, always be looking at data and not looking at coding details. For these sort of analyses we encourage analysts to think in terms of “data shape” and transforms. This discipline leaves more of the analysts energy and attention to think productively about the data and actual problem domain.


For the third plot showing the variation of score1 across different rows (but same ids) it may be appropriate to use a slightly more complicated join() procedure than we showed. The join shown produced rows of artificial agreement where both values of score1 came from the same row (thus had no chance of being different, so in some sense deserve no credit). This is also the only way any non-duplicated evaluations could make it to the plot. To eliminate these uninteresting agreements from the plot do the following:

> d$rowNumber <- 1:(dim(d)[1]) > djoin <- join(d,d,'id') > colnames(djoin) <- fixnames(colnames(djoin)) > djoin <- djoin[djoin$arowNumber!=djoin$browNumber,] > djoin

This gives us a table that shows only values of score1 from different rows:

   id ascore1 ascore2 arowNumber bscore1 bscore2 browNumber
2   1      17      13          1      10      13          4
4   2       5      10          2      13      10          5
6   3       6       5          3       7       5          6
7   1      10      13          4      17      13          1
9   2      13      10          5       5      10          2
11  3       7       5          6       6       5          3

And only plots points on the diagonal if “you have really earned them”:


So while the direct join() may not be the immediate perfect answer it is still a good intermediate to form as what you want is only simple data transformation away from it.

Related posts:

  1. A Demonstration of Data Mining
  2. The Data Enrichment Method
  3. R examine objects tutorial

To leave a comment for the author, please follow the link and comment on their blog: Win-Vector Blog » R. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)