**Revolutions**, and kindly contributed to R-bloggers)

*by Boxuan Cui, Data Scientist at Smarter Travel*

Once upon a time, there was a joke:

In Data Science, 80% of time spent prepare data, 20% of time spent complain about need for prepare data.

— Big Data Borat (@BigDataBorat) February 27, 2013

According to a Forbes article, cleaning and organizing data is the most time-consuming and least enjoyable data science task. Of all the resources out there, DataExplorer is one of them, with its sole mission to minimize the 80%, and make it enjoyable. As a result, one fundamental design principle is to be extremely user-friendly. Most of the time, one function call is all you need.

Data manipulation is powered by data.table, so tasks involving big datasets usually complete in a few seconds. In addition, the package is flexible enough with input data classes, so you should be able to throw in any `data.frame`

-like objects. However, certain functions require a `data.table`

class object as input due to the update-by-reference feature, which I will cover in later part of the post.

Now enough said and let's look at some code, shall we?

Take the `BostonHousing`

dataset from the `mlbench`

library:

```
library(mlbench)
data("BostonHousing", package = "mlbench")
```

### Initial Visualization

Without knowing anything about the data, my first 3 tasks are almost always:

```
library(DataExplorer)
plot_missing(BostonHousing) ## Are there missing values, and what is the missing data profile?
plot_bar(BostonHousing) ## How does the categorical frequency for each discrete variable look like?
plot_histogram(BostonHousing) ## What is the distribution of each continuous variable?
```

While there are not many interesting insights from `plot_missing`

and `plot_bar`

, below is the output from `plot_histogram`

.

Upon scrutiny, the variable **rad** looks like discrete, and I want to group **crim**, **zn**, **indus** and **b** into bins as well. Let's do so:

```
## Set `rad` to factor
BostonHousing$rad <- as.factor(BostonHousing$rad)
## Create new discrete variables
for (col in c("crim", "zn", "indus", "b"))
```

BostonHousing[[paste0(col, "_d")]] <- as.factor(ggplot2::cut_interval(BostonHousing[[col]], 2))
## Plot bar chart for all discrete variables
plot_bar(BostonHousing)

At this point, we have much better understanding of the data distribution. Now assume we are interested in **medv** (median value of owner-occupied homes in USD 1000's), and would like to build a model to predict it. Let's plot it against all other variables:

`plot_boxplot(BostonHousing, by = "medv") `

`plot_scatterplot(`

subset(BostonHousing, select = -c(crim, zn, indus, b)),

by = "medv", size = 0.5)

`plot_correlation(BostonHousing)`

And this is how you slice & dice your data, and analyze correlation with merely 3 lines of code.

### Feature Engineering

Feature engineering is a crucial step in building better models. DataExplorer provides a couple of functions to ease the process. All of them require a `data.table`

as the input object, because it is lightning fast. However, if you don't feel like coding in `data.table`

syntax, you may adopt the following process:

```
## Set your data to `data.table` first
your_data <- data.table(your_data)
## Apply DataExplorer functions
group_category(your_data, ...)
drop_columns(your_data, ...)
set_missing(your_data, ...)
## Set data back to the original object
class(your_data) <- "original_object_name"
```

Let's return to the `BostonHousing`

dataset. For the rest of this section, we'll assume the data has been converted to a `data.table`

already.

```
library(data.table)
BostonHousingDT <- data.table(BostonHousing)
```

Remember those transformed continuous variables? Let's drop them:

`drop_columns(BostonHousingDT, c("crim", "zn", "indus", "b"))`

Note: Because `data.table`

updates by reference, the original object is updated without the need to re-assign a returned object.

Let's take a look at the discrete variable **rad**:

`plot_bar(BostonHousingDT$rad)`

I think categories other than 4, 5 and 24 are too sparse, and might skew my model fit. How could I group all the sparse categories together?

```
group_category(BostonHousingDT, "rad", 0.25, update = FALSE)
# rad cnt pct cum_pct
# 1: 24 132 0.2608696 0.2608696
# 2: 5 115 0.2272727 0.4881423
# 3: 4 110 0.2173913 0.7055336
```

Looks like grouping by bottom 25% of **rad** would give me what I need. Let's do so:

```
group_category(BostonHousingDT, "rad", 0.25, update = TRUE)
plot_bar(BostonHousingDT$rad)
```

In addition to categorical frequency, you may also play with the `measure`

argument to group by the sum of a different variable. See `?group_category`

for more example use cases.

### Data Report

To generate a report of your data:

`create_report(BostonHousing)`

Currently, there is not much to do with this, but it is my plan to support customization of the generated report, so stay tuned for more features!

I hope you enjoyed exploring the Boston housing data with me, and finally here are some additional resources about the DataExplorer package:

**leave a comment**for the author, please follow the link and comment on their blog:

**Revolutions**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...