(This article was first published on

**R – Win-Vector Blog**, and kindly contributed to R-bloggers)Our next "R and big data tip" is: summarizing big data.

We always say "if you are not looking at the data, you are not doing science"- and for big data you are very dependent on summaries (as you can’t actually look at everything).

Simple question: is there an easy way to summarize big data in `R`

?

The answer is: yes, but we suggest you use the `replyr`

package to do so.

Let’s set up a trivial example.

```
suppressPackageStartupMessages(library("dplyr"))
packageVersion("dplyr")
```

`## [1] '0.5.0'`

```
library("sparklyr")
packageVersion("sparklyr")
```

`## [1] '0.5.5'`

```
library("replyr")
packageVersion("replyr")
```

`## [1] '0.3.902'`

```
sc <- sparklyr::spark_connect(version='2.0.2',
master = "local")
diris <- copy_to(sc, iris, 'diris')
```

The usual `S3`

–`summary()`

summarizes the handle, not the data.

`summary(diris)`

```
## Length Class Mode
## src 1 src_spark list
## ops 3 op_base_remote list
```

`tibble::glimpse()`

throws.

`packageVersion("tibble")`

`## [1] '1.3.3'`

```
# errors-out
glimpse(diris)
```

```
## Observations: 150
## Variables: 5
## Error in if (width[i] <= max_width[i]) next: missing value where TRUE/FALSE needed
```

`broom::glance()`

throws.

`packageVersion("broom")`

`## [1] '0.4.2'`

`broom::glance(diris)`

`## Error: glance doesn't know how to deal with data of class tbl_sparktbl_sqltbl_lazytbl`

`replyr_summary()`

works, and returns results in a `data.frame`

.

```
replyr_summary(diris) %>%
select(-nunique, -index, -nrows)
```

```
## column class nna min max mean sd lexmin lexmax
## 1 Sepal_Length numeric 0 4.3 7.9 5.843333 0.8280661
```
## 2 Sepal_Width numeric 0 2.0 4.4 3.057333 0.4358663
## 3 Petal_Length numeric 0 1.0 6.9 3.758000 1.7652982
## 4 Petal_Width numeric 0 0.1 2.5 1.199333 0.7622377
## 5 Species character 0 NA NA NA NA setosa virginica

```
sparklyr::spark_disconnect(sc)
rm(list=ls())
gc()
```

```
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 762515 40.8 1442291 77.1 1168576 62.5
## Vcells 1394407 10.7 2552219 19.5 1820135 13.9
```

To

**leave a comment**for the author, please follow the link and comment on their blog:**R – Win-Vector Blog**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...