Summarizing big data in R
[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Our next “R and big data tip” is: summarizing big data.
We always say “if you are not looking at the data, you are not doing science”- and for big data you are very dependent on summaries (as you can’t actually look at everything).
Simple question: is there an easy way to summarize big data in R
?
The answer is: yes, but we suggest you use the replyr
package to do so.
Let’s set up a trivial example.
suppressPackageStartupMessages(library("dplyr")) packageVersion("dplyr")
## [1] '0.5.0'
library("sparklyr") packageVersion("sparklyr")
## [1] '0.5.5'
library("replyr") packageVersion("replyr")
## [1] '0.3.902'
sc <- sparklyr::spark_connect(version='2.0.2', master = "local") diris <- copy_to(sc, iris, 'diris')
The usual S3
–summary()
summarizes the handle, not the data.
summary(diris)
## Length Class Mode ## src 1 src_spark list ## ops 3 op_base_remote list
tibble::glimpse()
throws.
packageVersion("tibble")
## [1] '1.3.3'
# errors-out glimpse(diris)
## Observations: 150 ## Variables: 5 ## Error in if (width[i] <= max_width[i]) next: missing value where TRUE/FALSE needed
broom::glance()
throws.
packageVersion("broom")
## [1] '0.4.2'
broom::glance(diris)
## Error: glance doesn't know how to deal with data of class tbl_sparktbl_sqltbl_lazytbl
replyr_summary()
works, and returns results in a data.frame
.
replyr_summary(diris) %>% select(-nunique, -index, -nrows)
## column class nna min max mean sd lexmin lexmax ## 1 Sepal_Length numeric 0 4.3 7.9 5.843333 0.8280661 <NA> <NA> ## 2 Sepal_Width numeric 0 2.0 4.4 3.057333 0.4358663 <NA> <NA> ## 3 Petal_Length numeric 0 1.0 6.9 3.758000 1.7652982 <NA> <NA> ## 4 Petal_Width numeric 0 0.1 2.5 1.199333 0.7622377 <NA> <NA> ## 5 Species character 0 NA NA NA NA setosa virginica
sparklyr::spark_disconnect(sc) rm(list=ls()) gc()
## used (Mb) gc trigger (Mb) max used (Mb) ## Ncells 762515 40.8 1442291 77.1 1168576 62.5 ## Vcells 1394407 10.7 2552219 19.5 1820135 13.9
To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.