Normalising data within groups

June 21, 2012
By

(This article was first published on Insights of a PhD student » R, and kindly contributed to R-bloggers)

Occasionally it proves useful to normalise data. By this I mean to scale it between zero and one. Admittedly, most people frown of this but there are papers out there with this method in use*.

How do we go about this? Its a very simple formula to calculate:

y'[i] = y[i]/sqrt(sum(y^2))

So we square all of the ys, add them up and take the square root (call in the denominator). Then we divide each individual y value by the denominator.

In R this is simple – for instance decostand in the vegan package does exactly this (plus a whole heap of other standardisations).

But what I couldnt find was a function to take it a step further, a function that normalised within groups:

y'[ij] = y[ij]/sqrt(sum(y^2[j]))

The difference here are the js of course. Or to go a step further still:

y'[ijk] = y[ijk]/sqrt(sum(y^2[jk]))

where the ks represent subgroups of j.

I needed to do just this, so I wrote a function to do it!

You can get hold of it by running

source("http://db.tt/22hmSliJ")

in R. This provides you with a function called normalise with the following arguments

dataframe – self explanatory

columns – a quoted variable name (e.g. “weight”) actually only works on a single column currently so this is a bit of a misnomer. But its easy enough to loop it**

by – one or two grouping factors, again quoted and enclosed in c() if there are two

na.rm – logical, remove any NAs? Defaults to TRUE

data <- normalise(data, "weight", by="sex")

to normalise weight according to sex, or

data <- normalise(data, "weight", by=c("age", "sex"))

to normalise weight by age and sex.

The function adds a column to the original dataframe with the original name preceded by “norm.”, so in this case it would be “norm.weight”.

Currently it only works if the by argument is a factor, but I shall change that at some point and update this post. It might also change the order of the dataframe, but thats not so much of a big deal I dont think.

Hope it helps!

* e.g. Risch AC, Jurgensen MF, Frank DA (2007) Effects of grazing and soil micro-climate on decomposition rates in a spatio-temporally heterogeneous grassland. Plant and Soil 298:191-201

**

for(i in c("height", "weight", "eye_colour")){
data <- normalise(data, i, by="weight")
}

To leave a comment for the author, please follow the link and comment on his blog: Insights of a PhD student » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.