DataExplorer: Fast Data Exploration With Minimum Code

February 8, 2018
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Boxuan Cui, Data Scientist at Smarter Travel

Once upon a time, there was a joke:

According to a Forbes article, cleaning and organizing data is the most time-consuming and least enjoyable data science task. Of all the resources out there, DataExplorer is one of them, with its sole mission to minimize the 80%, and make it enjoyable. As a result, one fundamental design principle is to be extremely user-friendly. Most of the time, one function call is all you need.

Data manipulation is powered by data.table, so tasks involving big datasets usually complete in a few seconds. In addition, the package is flexible enough with input data classes, so you should be able to throw in any data.frame-like objects. However, certain functions require a data.table class object as input due to the update-by-reference feature, which I will cover in later part of the post.

Now enough said and let's look at some code, shall we?


Take the BostonHousing dataset from the mlbench library:

library(mlbench)
data("BostonHousing", package = "mlbench")

Initial Visualization

Without knowing anything about the data, my first 3 tasks are almost always:

library(DataExplorer)
plot_missing(BostonHousing) ## Are there missing values, and what is the missing data profile?
plot_bar(BostonHousing) ## How does the categorical frequency for each discrete variable look like?
plot_histogram(BostonHousing) ## What is the distribution of each continuous variable?

While there are not many interesting insights from plot_missing and plot_bar, below is the output from plot_histogram.

Histogram

Upon scrutiny, the variable rad looks like discrete, and I want to group crim, zn, indus and b into bins as well. Let's do so:

## Set `rad` to factor
BostonHousing$rad <- as.factor(BostonHousing$rad)

## Create new discrete variables
for (col in c("crim", "zn", "indus", "b")) 
BostonHousing[[paste0(col, "_d")]] <- as.factor(ggplot2::cut_interval(BostonHousing[[col]], 2)) ## Plot bar chart for all discrete variables plot_bar(BostonHousing)

Bar

At this point, we have much better understanding of the data distribution. Now assume we are interested in medv (median value of owner-occupied homes in USD 1000's), and would like to build a model to predict it. Let's plot it against all other variables:

plot_boxplot(BostonHousing, by = "medv")    

Boxplot

plot_scatterplot(
subset(BostonHousing, select = -c(crim, zn, indus, b)),
by = "medv", size = 0.5)

Scatterplot_1
Scatterplot_2

plot_correlation(BostonHousing)

Correlation

And this is how you slice & dice your data, and analyze correlation with merely 3 lines of code.

Feature Engineering

Feature engineering is a crucial step in building better models. DataExplorer provides a couple of functions to ease the process. All of them require a data.table as the input object, because it is lightning fast. However, if you don't feel like coding in data.table syntax, you may adopt the following process:

## Set your data to `data.table` first
your_data <- data.table(your_data)

## Apply DataExplorer functions
group_category(your_data, ...)
drop_columns(your_data, ...)
set_missing(your_data, ...)

## Set data back to the original object
class(your_data) <- "original_object_name"

Let's return to the BostonHousing dataset. For the rest of this section, we'll assume the data has been converted to a data.table already.

library(data.table)
BostonHousingDT <- data.table(BostonHousing)

Remember those transformed continuous variables? Let's drop them:

drop_columns(BostonHousingDT, c("crim", "zn", "indus", "b"))

Note: Because data.table updates by reference, the original object is updated without the need to re-assign a returned object.

Let's take a look at the discrete variable rad:

plot_bar(BostonHousingDT$rad)

Rad_bar

I think categories other than 4, 5 and 24 are too sparse, and might skew my model fit. How could I group all the sparse categories together?

group_category(BostonHousingDT, "rad", 0.25, update = FALSE)

#    rad cnt       pct   cum_pct
# 1:  24 132 0.2608696 0.2608696
# 2:   5 115 0.2272727 0.4881423
# 3:   4 110 0.2173913 0.7055336

Looks like grouping by bottom 25% of rad would give me what I need. Let's do so:

group_category(BostonHousingDT, "rad", 0.25, update = TRUE)
plot_bar(BostonHousingDT$rad)

Grouped_rad_bar

In addition to categorical frequency, you may also play with the measure argument to group by the sum of a different variable. See ?group_category for more example use cases.

Data Report

To generate a report of your data:

create_report(BostonHousing)

Currently, there is not much to do with this, but it is my plan to support customization of the generated report, so stay tuned for more features!


I hope you enjoyed exploring the Boston housing data with me, and finally here are some additional resources about the DataExplorer package:

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)