RProtoBuf & HistogramTools: Statistical Analysis Tools for Large Data Sets

Posted on October 10, 2013 by Stephanie Taylor in R bloggers | 0 Comments

[This article was first published on Google Open Source Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

At Google, building, managing and safely securing some of the world’s largest storage systems requires complex analysis of filesystem metadata. This is an important part of making sure that the information stored within those systems is quickly accessible and always secure. We’re always looking for ways to make our data storage systems more efficient, and often times, this requires understanding the age, size and access patterns of the data stored, the failure rates of servers and disks, and more. You can imagine how complex this becomes with each new data center added.

Given the number of files and servers that are relevant for this performance analysis, we bin the metadata into a compact histogram form. We use these output histograms for many purposes, such as (i) building Markov models of data availability, (ii) statistical forecasting of resource usage, and (iii) formulating and solving optimization problems to determine optimal allocation of flash devices.

We rely on several open source tools to make our work easier. The most common tool we use for statistical analysis of the performance, availability, and resource needs of our internal systems is the R programming language. We’ve released two package updates that make R particularly suitable for interacting with other distributed systems.

RProtoBuf is an R package for Google’s Protocol Buffer library that allows one to define simple data structures with intuitive getter and setter methods. These data structures can be serialized into an extremely compact format for sending to other distributed systems. Recent releases include improved support for 64-bit integers, protocol buffer extensions, and more.

HistogramTools is a new R package I have released that uses RProtoBuf to read in a compact protocol buffer representation of binned data and includes a number of helpful functions for manipulating, plotting, and measuring the statistical information loss due to the binning. In addition to protocol buffers, it also supports importing aggregate performance data directly from DTrace output.

Both packages are available on CRAN and include extensive documentation and examples.

If you’re interested to learn more, we have shared some of our research findings at conferences such as OSDI, USENIX ATC, and JSM.

By Murray Stokely, Storage Analytics Team Lead

To leave a comment for the author, please follow the link and comment on their blog: Google Open Source Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

RProtoBuf & HistogramTools: Statistical Analysis Tools for Large Data Sets

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)