The other day I wrote about the R functions by, apply and friends, which allow me to operate on subsets of data. All those functions work nicely, if the data is given in the right format. More often than not it isn’t and I have to reshape the data beforehand. Thus, time to discuss the
reshape function. I will focus on the
reshape function in base R, and not the package of the same name.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
I would like to create a box whisker plot, showing the measurements of the observations for each of the species, as in the chart below.
I know, that if I had all measurements in one column and the dimension in another column, I could produce a graph like this in one line with
bwplot(Measurement ~ Species | Dimension, data=reshaped.iris)
reshape function is what I need. From the help file I learn that I want to transform my data from a wide format into a long format (
direction="long"). In the long format I would like a varibale with the measurements (v.names=”Measurement”), which I get by running through the first four columns (
varying=1:4). I know which measurement I am reading by looking at the column names (
times=names(iris)[1:4]), and I capture the dimension names in a new variable (
timevar="Dimension"). This gives me the following statement:
reshaped.iris That’s it, I can create the lattice box-whisker plot.
In my next example I would like the measurements of length and width in separate columns and capture the flower part in a new variable, so I can create scatterplots of length against width. Tweaking the reshape statement slightly gives me:
Let’s swap Part against Species.
xyplot(Length ~ Width | Part, groups=Species,
I think, the charts illustrate quite nicely why the iris data set has become a typical test case for many classification techniques in machine learning.