Summarize content of a vector or data.frame every n entries

[This article was first published on Fabio Marroni's Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I imagine that the same result can be achieved by a proper use of quantile, but I like to have an easy way to obtain summary statistics every n entries of my dataset be it a vector or data.frame.

The function takes three parameters: the R object on which we need to obtain statistics (x), how many entries should each summary contain (step, defaulting to 1000), and the function we want to apply (fun, defaulting to “mean”).

Then, it’s all about using aggregate.

summarize.by<-function(x,step=1000,fun="mean")
{
	
	if(is.data.frame(x))
	{
		group<-sort(rep(seq(1,ceiling(nrow(x)/step)),step)[1:nrow(x)])
	}
	if(is.vector(x))
	{
		group<-sort(rep(seq(1,ceiling(length(x)/step)),step)[1:length(x)])
	}
	x<-data.frame(group,x)
	x<-aggregate(x,by=list(x$group),FUN=fun)
	x<-x[,-c(1,2)]
	return(x)
}

Example application and result for a data.frame:

dummy<-data.frame(matrix(runif(100000,0,1),ncol=10))
summarize.by(dummy)
          X1        X2        X3        X4        X5        X6        X7
1  0.5081756 0.5206011 0.4972622 0.5060707 0.4907807 0.5063138 0.4982252
2  0.5014300 0.5093051 0.5015310 0.4718058 0.4931249 0.4882382 0.5084970
3  0.4994759 0.4979546 0.4964157 0.5138695 0.5018427 0.5228862 0.4980824
4  0.4970300 0.4953163 0.4954068 0.5157935 0.4770471 0.5000562 0.4960250
5  0.5118221 0.4967686 0.5114420 0.4945936 0.5016019 0.5003544 0.5016693
6  0.5026323 0.4995367 0.5003587 0.4970245 0.4992188 0.4993896 0.4873300
7  0.4911944 0.5081578 0.4858666 0.4974576 0.4864710 0.5022401 0.5058064
8  0.5050684 0.5021456 0.4970707 0.4829222 0.4980984 0.4901941 0.5053296
9  0.4910359 0.4883865 0.4915000 0.4984415 0.4941274 0.4933778 0.4964306
10 0.4832396 0.4986647 0.5017873 0.5008766 0.4952849 0.5036030 0.5084799
          X8        X9       X10
1  0.5052379 0.4906292 0.4916262
2  0.5074966 0.5117570 0.5183119
3  0.4988349 0.5029704 0.5077726
4  0.4889516 0.5066026 0.5078195
5  0.5068717 0.4988389 0.5018225
6  0.5010366 0.4870614 0.4827767
7  0.5148197 0.5083662 0.5037901
8  0.4979452 0.5273463 0.4944513
9  0.5130718 0.5061075 0.5058208
10 0.4896030 0.4911127 0.4956848

And for a vector


dummy<-runif(10000,0,1)
summarize.by(dummy)
 [1] 0.4914789 0.4908839 0.4951939 0.4928015 0.4911908 0.4994735 0.4947729
 [8] 0.5058204 0.5026956 0.5018375

To leave a comment for the author, please follow the link and comment on their blog: Fabio Marroni's Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)