Taking a Ride on the Wild Function – Introducing the dostats package

February 20, 2012
By

(This article was first published on R Blog, and kindly contributed to R-bloggers)

Lately I have been rather productive in my programming and frustrated at the same time. Trying to solve the problems of creating a demographics summary table proved to be a lesson in frustration with R. Since I love R, this was disheartening. I did eventually find the reporttools package which does make a great latex table, but onlyin latex. Also the tables package looks great, but also not entirely what I was looking for, so I do the first logical thing for an R User when faced with this sort of thing. I created a package to fill in the missing functionality.

The dostats package/function

The new package is dostats. There are two functions of the package.

  1. Create summaries of vectors through the dostats function.
  2. Manipulate functions.

The package started out with the dostats function for creating more informative summary tables. It works very similar with tabular from tables package, but it is designed to work with plyr functions. The idea is to pass in a vector as the first argument and then the remaining arguments are functions that compute statistics on the vector. For example:

library(dostats)
set.seed(20120220)
dostats(rnorm(100), mean, sd, N = length)
##     mean     sd   N
## 1 0.0775 0.8975 100

There is also the renaming construct built in to create the desired variables. This construct is nice because it facilitates easily passing as an argument into ldply such as

library(plyr)
ldply(mtcars, dostats, mean, sd, IQR)
##     .id     mean       sd     IQR
## 1   mpg  20.0906   6.0269   7.375
## 2   cyl   6.1875   1.7859   4.000
## 3  disp 230.7219 123.9387 205.175
## 4    hp 146.6875  68.5629  83.500
## 5  drat   3.5966   0.5347   0.840
## 6    wt   3.2172   0.9785   1.029
## 7  qsec  17.8487   1.7869   2.008
## 8    vs   0.4375   0.5040   1.000
## 9    am   0.4062   0.4990   1.000
## 10 gear   3.6875   0.7378   1.000
## 11 carb   2.8125   1.6152   2.000

This makes for a more logical summary data.frame object that has usable columns, each with the same data type. Unfortunatly this does not always work for all data set. The above example only has numerical data. Any data frame with categorigal data would have that data treated as categorical. Another limitation is that the results of each function must be the same dimention for each variable. For this reason I introduced functions that filter by the variable class.

  • class.stats creates a dostats function for a given class, tested by inherits.
  • integer.stats predefined class stats for integer variables. This defined as class.stats('integer')
  • numeric.stats for numeric variables, which would also include integer variables.
  • factor.stats for factors.

When a class.stats function is passed to ldply, variable not matching that class are silently removed.

ldply(iris, numeric.stats, mean, sd)
##            .id  mean     sd
## 1 Sepal.Length 5.843 0.8281
## 2  Sepal.Width 3.057 0.4359
## 3 Petal.Length 3.758 1.7653
## 4  Petal.Width 1.199 0.7622
ldply(iris, factor.stats, N = length)
##       .id   N
## 1 Species 150

You can also chain together arguments to compute on subsets using ddply and ldply.

ddply(iris, .(Species), ldply, numeric.stats,
    mean, median, sd)
##       Species          .id  mean median     sd
## 1      setosa Sepal.Length 5.006   5.00 0.3525
## 2      setosa  Sepal.Width 3.428   3.40 0.3791
## 3      setosa Petal.Length 1.462   1.50 0.1737
## 4      setosa  Petal.Width 0.246   0.20 0.1054
## 5  versicolor Sepal.Length 5.936   5.90 0.5162
## 6  versicolor  Sepal.Width 2.770   2.80 0.3138
## 7  versicolor Petal.Length 4.260   4.35 0.4699
## 8  versicolor  Petal.Width 1.326   1.30 0.1978
## 9   virginica Sepal.Length 6.588   6.50 0.6359
## 10  virginica  Sepal.Width 2.974   3.00 0.3225
## 11  virginica Petal.Length 5.552   5.55 0.5519
## 12  virginica  Petal.Width 2.026   2.00 0.2747

Function manipulations

Passing all these functions around also requires some extra function manipulation functions. Now that is a mouthful, but something we do with R.

Composition

R lacks a function composition function. So I created one. function(x)any(is.na(x)) is just to long to type, and I find myself doing things like this far too often. The word “function” is just too long to type and takes up lots of space. It is much easier to do any%.%is.na or compose(any, is.na) either of which results in a function that creates a new function testing if there are any missing values. The two forms are

  1. compose(...)
  2. fun1%.%fun2

compose takes any number of arguments and nests them with the right most being the inner most and the left being the outermost. The easy to remember is that they read the same as when they were input.

Argument Manipulations

Composition and dostats, only operate on the first argument which necessitates functions for manipulating arguments.

  1. wargs: creates a new function with changed defaults. An example would be wargs(mean, rm.na=T) creates a new function that automatically removes missing values.
  2. onarg: Specifies the first argument for the function. Such as onarg(rep,'times') makes the number of times to repeate the first argument.

One example of this that is included in dostats is the contains and %contains% which is the reverse order of %in%.

Conclussion

There will likely be more functions as I come across the necessity. If you have an idea that should be included submit to the issues tracker.

To leave a comment for the author, please follow the link and comment on his blog: R Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.