# DataExplorer: Fast Data Exploration With Minimum Code

**Revolutions**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

*by Boxuan Cui, Data Scientist at Smarter Travel*

Once upon a time, there was a joke:

In Data Science, 80% of time spent prepare data, 20% of time spent complain about need for prepare data.

— Big Data Borat (@BigDataBorat) February 27, 2013

According to a Forbes article, cleaning and organizing data is the most time-consuming and least enjoyable data science task. Of all the resources out there, DataExplorer is one of them, with its sole mission to minimize the 80%, and make it enjoyable. As a result, one fundamental design principle is to be extremely user-friendly. Most of the time, one function call is all you need.

Data manipulation is powered by data.table, so tasks involving big datasets usually complete in a few seconds. In addition, the package is flexible enough with input data classes, so you should be able to throw in any `data.frame`

-like objects. However, certain functions require a `data.table`

class object as input due to the update-by-reference feature, which I will cover in later part of the post.

Now enough said and let's look at some code, shall we?

Take the `BostonHousing`

dataset from the `mlbench`

library:

library(mlbench) data("BostonHousing", package = "mlbench")

### Initial Visualization

Without knowing anything about the data, my first 3 tasks are almost always:

library(DataExplorer) plot_missing(BostonHousing) ## Are there missing values, and what is the missing data profile? plot_bar(BostonHousing) ## How does the categorical frequency for each discrete variable look like? plot_histogram(BostonHousing) ## What is the distribution of each continuous variable?

While there are not many interesting insights from `plot_missing`

and `plot_bar`

, below is the output from `plot_histogram`

.

Upon scrutiny, the variable **rad** looks like discrete, and I want to group **crim**, **zn**, **indus** and **b** into bins as well. Let's do so:

## Set `rad` to factor BostonHousing$rad <- as.factor(BostonHousing$rad) ## Create new discrete variables for (col in c("crim", "zn", "indus", "b")) BostonHousing[[paste0(col, "_d")]] <- as.factor(ggplot2::cut_interval(BostonHousing[[col]], 2)) ## Plot bar chart for all discrete variables plot_bar(BostonHousing)

At this point, we have much better understanding of the data distribution. Now assume we are interested in **medv** (median value of owner-occupied homes in USD 1000's), and would like to build a model to predict it. Let's plot it against all other variables:

plot_boxplot(BostonHousing, by = "medv")

plot_scatterplot( subset(BostonHousing, select = -c(crim, zn, indus, b)), by = "medv", size = 0.5)

plot_correlation(BostonHousing)

And this is how you slice & dice your data, and analyze correlation with merely 3 lines of code.

### Feature Engineering

Feature engineering is a crucial step in building better models. DataExplorer provides a couple of functions to ease the process. All of them require a `data.table`

as the input object, because it is lightning fast. However, if you don't feel like coding in `data.table`

syntax, you may adopt the following process:

## Set your data to `data.table` first your_data <- data.table(your_data) ## Apply DataExplorer functions group_category(your_data, ...) drop_columns(your_data, ...) set_missing(your_data, ...) ## Set data back to the original object class(your_data) <- "original_object_name"

Let's return to the `BostonHousing`

dataset. For the rest of this section, we'll assume the data has been converted to a `data.table`

already.

library(data.table) BostonHousingDT <- data.table(BostonHousing)

Remember those transformed continuous variables? Let's drop them:

drop_columns(BostonHousingDT, c("crim", "zn", "indus", "b"))

Note: Because `data.table`

updates by reference, the original object is updated without the need to re-assign a returned object.

Let's take a look at the discrete variable **rad**:

plot_bar(BostonHousingDT$rad)

I think categories other than 4, 5 and 24 are too sparse, and might skew my model fit. How could I group all the sparse categories together?

group_category(BostonHousingDT, "rad", 0.25, update = FALSE) # rad cnt pct cum_pct # 1: 24 132 0.2608696 0.2608696 # 2: 5 115 0.2272727 0.4881423 # 3: 4 110 0.2173913 0.7055336

Looks like grouping by bottom 25% of **rad** would give me what I need. Let's do so:

group_category(BostonHousingDT, "rad", 0.25, update = TRUE) plot_bar(BostonHousingDT$rad)

In addition to categorical frequency, you may also play with the `measure`

argument to group by the sum of a different variable. See `?group_category`

for more example use cases.

### Data Report

To generate a report of your data:

create_report(BostonHousing)

Currently, there is not much to do with this, but it is my plan to support customization of the generated report, so stay tuned for more features!

I hope you enjoyed exploring the Boston housing data with me, and finally here are some additional resources about the DataExplorer package:

**leave a comment**for the author, please follow the link and comment on their blog:

**Revolutions**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.