# Normalising data within groups

June 21, 2012
By

(This article was first published on Insights of a PhD student » R, and kindly contributed to R-bloggers)

Occasionally it proves useful to normalise data. By this I mean to scale it between zero and one. Admittedly, most people frown of this but there are papers out there with this method in use*.

`y'[i] = y[i]/sqrt(sum(y^2))`

So we square all of the ys, add them up and take the square root (call in the denominator). Then we divide each individual y value by the denominator.

In R this is simple – for instance decostand in the vegan package does exactly this (plus a whole heap of other standardisations).

But what I couldnt find was a function to take it a step further, a function that normalised within groups:

`y'[ij] = y[ij]/sqrt(sum(y^2[j]))`

The difference here are the js of course. Or to go a step further still:

`y'[ijk] = y[ijk]/sqrt(sum(y^2[jk]))`

where the ks represent subgroups of j.

I needed to do just this, so I wrote a function to do it!

You can get hold of it by running

`source("http://db.tt/22hmSliJ")`

in R. This provides you with a function called normalise with the following arguments

dataframe – self explanatory

columns – a quoted variable name (e.g. “weight”) actually only works on a single column currently so this is a bit of a misnomer. But its easy enough to loop it**

by – one or two grouping factors, again quoted and enclosed in c() if there are two

na.rm – logical, remove any NAs? Defaults to TRUE

`data <- normalise(data, "weight", by="sex")`

to normalise weight according to sex, or

`data <- normalise(data, "weight", by=c("age", "sex"))`

to normalise weight by age and sex.

The function adds a column to the original dataframe with the original name preceded by “norm.”, so in this case it would be “norm.weight”.

Currently it only works if the by argument is a factor, but I shall change that at some point and update this post. It might also change the order of the dataframe, but thats not so much of a big deal I dont think.

Hope it helps!

* e.g. Risch AC, Jurgensen MF, Frank DA (2007) Effects of grazing and soil micro-climate on decomposition rates in a spatio-temporally heterogeneous grassland. Plant and Soil 298:191-201

**

```for(i in c("height", "weight", "eye_colour")){
data <- normalise(data, i, by="weight")
}```

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...