Summarizing big data in R

May 30, 2017
By

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Our next "R and big data tip" is: summarizing big data.

We always say "if you are not looking at the data, you are not doing science"- and for big data you are very dependent on summaries (as you can’t actually look at everything).

Simple question: is there an easy way to summarize big data in R?

The answer is: yes, but we suggest you use the replyr package to do so.

Let’s set up a trivial example.

suppressPackageStartupMessages(library("dplyr"))
packageVersion("dplyr")
## [1] '0.5.0'
library("sparklyr")
packageVersion("sparklyr")
## [1] '0.5.5'
library("replyr")
packageVersion("replyr")
## [1] '0.3.902'
sc <- sparklyr::spark_connect(version='2.0.2', 
                              master = "local")
diris <- copy_to(sc, iris, 'diris')

The usual S3summary() summarizes the handle, not the data.

summary(diris)
##     Length Class          Mode
## src 1      src_spark      list
## ops 3      op_base_remote list

tibble::glimpse() throws.

packageVersion("tibble")
## [1] '1.3.3'
# errors-out
glimpse(diris)
## Observations: 150
## Variables: 5

## Error in if (width[i] <= max_width[i]) next: missing value where TRUE/FALSE needed

broom::glance() throws.

packageVersion("broom")
## [1] '0.4.2'
broom::glance(diris)
## Error: glance doesn't know how to deal with data of class tbl_sparktbl_sqltbl_lazytbl

replyr_summary() works, and returns results in a data.frame.

replyr_summary(diris) %>%
  select(-nunique, -index, -nrows)
##         column     class nna min max     mean        sd lexmin    lexmax
## 1 Sepal_Length   numeric   0 4.3 7.9 5.843333 0.8280661         
## 2  Sepal_Width   numeric   0 2.0 4.4 3.057333 0.4358663         
## 3 Petal_Length   numeric   0 1.0 6.9 3.758000 1.7652982         
## 4  Petal_Width   numeric   0 0.1 2.5 1.199333 0.7622377         
## 5      Species character   0  NA  NA       NA        NA setosa virginica

sparklyr::spark_disconnect(sc)
rm(list=ls())
gc()
##           used (Mb) gc trigger (Mb) max used (Mb)
## Ncells  762515 40.8    1442291 77.1  1168576 62.5
## Vcells 1394407 10.7    2552219 19.5  1820135 13.9

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)