Site icon R-bloggers

RProtoBuf & HistogramTools: Statistical Analysis Tools for Large Data Sets

[This article was first published on Google Open Source Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
At Google, building, managing and safely securing some of the world’s largest storage systems requires complex analysis of filesystem metadata. This is an important part of making sure that the information stored within those systems is quickly accessible and always secure. We’re always looking for ways to make our data storage systems more efficient, and often times, this requires understanding the age, size and access patterns of the data stored, the failure rates of servers and disks, and more. You can imagine how complex this becomes with each new data center added.

Given the number of files and servers that are relevant for this performance analysis, we bin the metadata into a compact histogram form. We use these output histograms for many purposes, such as (i) building Markov models of data availability, (ii) statistical forecasting of resource usage, and (iii) formulating and solving optimization problems to determine optimal allocation of flash devices.

We rely on several open source tools to make our work easier. The most common tool we use for statistical analysis of the performance, availability, and resource needs of our internal systems is the R programming language. We’ve released two package updates that make R particularly suitable for interacting with other distributed systems.
Both packages are available on CRAN and include extensive documentation and examples.

If you’re interested to learn more, we have shared some of our research findings at conferences such as OSDI, USENIX ATC, and JSM.

By Murray Stokely, Storage Analytics Team Lead

To leave a comment for the author, please follow the link and comment on their blog: Google Open Source Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.