DataExplorer: Fast Data Exploration With Minimum Code

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Boxuan Cui, Data Scientist at Smarter Travel

Once upon a time, there was a joke:

According to a Forbes article, cleaning and organizing data is the most time-consuming and least enjoyable data science task. Of all the resources out there, DataExplorer is one of them, with its sole mission to minimize the 80%, and make it enjoyable. As a result, one fundamental design principle is to be extremely user-friendly. Most of the time, one function call is all you need.

Data manipulation is powered by data.table, so tasks involving big datasets usually complete in a few seconds. In addition, the package is flexible enough with input data classes, so you should be able to throw in any data.frame-like objects. However, certain functions require a data.table class object as input due to the update-by-reference feature, which I will cover in later part of the post.

Now enough said and let's look at some code, shall we?


Take the BostonHousing dataset from the mlbench library:

library(mlbench)
data("BostonHousing", package = "mlbench")

Initial Visualization

Without knowing anything about the data, my first 3 tasks are almost always:

library(DataExplorer)
plot_missing(BostonHousing) ## Are there missing values, and what is the missing data profile?
plot_bar(BostonHousing) ## How does the categorical frequency for each discrete variable look like?
plot_histogram(BostonHousing) ## What is the distribution of each continuous variable?

While there are not many interesting insights from plot_missing and plot_bar, below is the output from plot_histogram.

Histogram

Upon scrutiny, the variable rad looks like discrete, and I want to group crim, zn, indus and b into bins as well. Let's do so:

## Set `rad` to factor
BostonHousing$rad <- as.factor(BostonHousing$rad)

## Create new discrete variables
for (col in c("crim", "zn", "indus", "b")) 
  BostonHousing[[paste0(col, "_d")]] <- as.factor(ggplot2::cut_interval(BostonHousing[[col]], 2))

## Plot bar chart for all discrete variables
plot_bar(BostonHousing)

Bar

At this point, we have much better understanding of the data distribution. Now assume we are interested in medv (median value of owner-occupied homes in USD 1000's), and would like to build a model to predict it. Let's plot it against all other variables:

plot_boxplot(BostonHousing, by = "medv")    

Boxplot

plot_scatterplot(
  subset(BostonHousing, select = -c(crim, zn, indus, b)), 
  by = "medv", size = 0.5)

Scatterplot_1
Scatterplot_2

plot_correlation(BostonHousing)

Correlation

And this is how you slice & dice your data, and analyze correlation with merely 3 lines of code.

Feature Engineering

Feature engineering is a crucial step in building better models. DataExplorer provides a couple of functions to ease the process. All of them require a data.table as the input object, because it is lightning fast. However, if you don't feel like coding in data.table syntax, you may adopt the following process:

## Set your data to `data.table` first
your_data <- data.table(your_data)

## Apply DataExplorer functions
group_category(your_data, ...)
drop_columns(your_data, ...)
set_missing(your_data, ...)

## Set data back to the original object
class(your_data) <- "original_object_name"

Let's return to the BostonHousing dataset. For the rest of this section, we'll assume the data has been converted to a data.table already.

library(data.table)
BostonHousingDT <- data.table(BostonHousing)

Remember those transformed continuous variables? Let's drop them:

drop_columns(BostonHousingDT, c("crim", "zn", "indus", "b"))

Note: Because data.table updates by reference, the original object is updated without the need to re-assign a returned object.

Let's take a look at the discrete variable rad:

plot_bar(BostonHousingDT$rad)

Rad_bar

I think categories other than 4, 5 and 24 are too sparse, and might skew my model fit. How could I group all the sparse categories together?

group_category(BostonHousingDT, "rad", 0.25, update = FALSE)

#    rad cnt       pct   cum_pct
# 1:  24 132 0.2608696 0.2608696
# 2:   5 115 0.2272727 0.4881423
# 3:   4 110 0.2173913 0.7055336

Looks like grouping by bottom 25% of rad would give me what I need. Let's do so:

group_category(BostonHousingDT, "rad", 0.25, update = TRUE)
plot_bar(BostonHousingDT$rad)

Grouped_rad_bar

In addition to categorical frequency, you may also play with the measure argument to group by the sum of a different variable. See ?group_category for more example use cases.

Data Report

To generate a report of your data:

create_report(BostonHousing)

Currently, there is not much to do with this, but it is my plan to support customization of the generated report, so stay tuned for more features!


I hope you enjoyed exploring the Boston housing data with me, and finally here are some additional resources about the DataExplorer package:

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)