Summarize content of a vector or data.frame every n entries

August 8, 2013
By

(This article was first published on Fabio Marroni's Blog » R, and kindly contributed to R-bloggers)

I imagine that the same result can be achieved by a proper use of quantile, but I like to have an easy way to obtain summary statistics every n entries of my dataset be it a vector or data.frame.

The function takes three parameters: the R object on which we need to obtain statistics (x), how many entries should each summary contain (step, defaulting to 1000), and the function we want to apply (fun, defaulting to “mean”).

Then, it’s all about using aggregate.

summarize.by<-function(x,step=1000,fun="mean")
{
	
	if(is.data.frame(x))
	{
		group<-sort(rep(seq(1,ceiling(nrow(x)/step)),step)[1:nrow(x)])
	}
	if(is.vector(x))
	{
		group<-sort(rep(seq(1,ceiling(length(x)/step)),step)[1:length(x)])
	}
	x<-data.frame(group,x)
	x<-aggregate(x,by=list(x$group),FUN=fun)
	x<-x[,-c(1,2)]
	return(x)
}

Example application and result for a data.frame:

dummy<-data.frame(matrix(runif(100000,0,1),ncol=10))
summarize.by(dummy)
          X1        X2        X3        X4        X5        X6        X7
1  0.5081756 0.5206011 0.4972622 0.5060707 0.4907807 0.5063138 0.4982252
2  0.5014300 0.5093051 0.5015310 0.4718058 0.4931249 0.4882382 0.5084970
3  0.4994759 0.4979546 0.4964157 0.5138695 0.5018427 0.5228862 0.4980824
4  0.4970300 0.4953163 0.4954068 0.5157935 0.4770471 0.5000562 0.4960250
5  0.5118221 0.4967686 0.5114420 0.4945936 0.5016019 0.5003544 0.5016693
6  0.5026323 0.4995367 0.5003587 0.4970245 0.4992188 0.4993896 0.4873300
7  0.4911944 0.5081578 0.4858666 0.4974576 0.4864710 0.5022401 0.5058064
8  0.5050684 0.5021456 0.4970707 0.4829222 0.4980984 0.4901941 0.5053296
9  0.4910359 0.4883865 0.4915000 0.4984415 0.4941274 0.4933778 0.4964306
10 0.4832396 0.4986647 0.5017873 0.5008766 0.4952849 0.5036030 0.5084799
          X8        X9       X10
1  0.5052379 0.4906292 0.4916262
2  0.5074966 0.5117570 0.5183119
3  0.4988349 0.5029704 0.5077726
4  0.4889516 0.5066026 0.5078195
5  0.5068717 0.4988389 0.5018225
6  0.5010366 0.4870614 0.4827767
7  0.5148197 0.5083662 0.5037901
8  0.4979452 0.5273463 0.4944513
9  0.5130718 0.5061075 0.5058208
10 0.4896030 0.4911127 0.4956848

And for a vector


dummy<-runif(10000,0,1)
summarize.by(dummy)
 [1] 0.4914789 0.4908839 0.4951939 0.4928015 0.4911908 0.4994735 0.4947729
 [8] 0.5058204 0.5026956 0.5018375

To leave a comment for the author, please follow the link and comment on his blog: Fabio Marroni's Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.