Exploratory Data Analysis in R (introduction)

August 1, 2018
By

(This article was first published on R - Data Science Heroes Blog, and kindly contributed to R-bloggers)

Exploratory Data Analysis in R (introduction)

Hi there!

tl;dr: Exploratory data analysis (EDA) the very first step in a data project. We will create a code-template to achieve this with one function.

Introduction

EDA consists of univariate (1-variable) and bivariate (2-variables) analysis.
In this post we will review some functions that lead us to the analysis of the first case.

  • Step 1 – First approach to data
  • Step 2 – Analyzing categorical variables
  • Step 3 – Analyzing numerical variables
  • Step 4 – Analyzing numerical and categorical at the same time

Covering some key points in a basic EDA:

  • Data types
  • Outliers
  • Missing values
  • Distributions (numerically and graphically) for both, numerical and categorical variables.

Type of analysis results

They can be two: informative or operative.

Informative – For example plots, or any long variable summary. We cannot filter data from it, but give us a lot of information at once. Most used on the EDA stage.

Operative – The results can be used to take an action directly on the data workflow (for example, selecting any variables whose percentage of missing values are below 20%). Most used in the Data Preparation stage.

Setting-up

Uncoment in case you don’t have any of these libraries:

# install.packages("tidyverse")
# install.packages("funModeling")
# install.packages("Hmisc")

A newer version of funModeling has been released on Ago-1, please update 😉

Now load the needed libraries…

library(funModeling) 
library(tidyverse) 
library(Hmisc)

tl;dr (code)

Run all the functions in this post in one-shot with the following function:

basic_eda <- function(data)
{
  glimpse(data)
  df_status(data)
  freq(data) 
  profiling_num(data)
  plot_num(data)
  describe(data)
}

Replace data with your data, and that’s it!:

basic_eda(my_amazing_data)


Creating the data for this example

Using the heart_disease data (from funModeling package). We will take only 4 variables for legibility.

data=heart_disease %>% select(age, max_heart_rate, thal, has_heart_disease)

Step 1 – First approach to data

Number of observations (rows) and variables, and a head of the first cases.

glimpse(data)
## Observations: 303
## Variables: 4
## $ age                63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, ...
## $ max_heart_rate     150, 108, 129, 187, 172, 178, 160, 163, 147,...
## $ thal               6, 3, 7, 3, 3, 3, 3, 3, 7, 7, 6, 3, 6, 7, 7,...
## $ has_heart_disease  no, yes, yes, no, no, no, yes, no, yes, yes,...

Getting the metrics about data types, zeros, infinite numbers, and missing values:

df_status(data)
##            variable q_zeros p_zeros q_na p_na q_inf p_inf    type unique
## 1               age       0       0    0 0.00     0     0 integer     41
## 2    max_heart_rate       0       0    0 0.00     0     0 integer     91
## 3              thal       0       0    2 0.66     0     0  factor      3
## 4 has_heart_disease       0       0    0 0.00     0     0  factor      2

df_status returns a table, so it is easy to keep with variables that match certain conditions like:
+ Having at least 80% of non-NA values (p_na < 20)
+ Having less than 50 unique values (unique <= 50)

💡 TIPS:

  • Are all the variables in the correct data type?
  • Variables with lots of zeros or NAs?
  • Any high cardinality variable?

[🔎 Read more here.]

Step 2 – Analyzing categorical variables

freq function runs for all factor or character variables automatically:

freq(data)

Exploratory Data Analysis in R (introduction)

##   thal frequency percentage cumulative_perc
## 1    3       166      54.79              55
## 2    7       117      38.61              93
## 3    6        18       5.94              99
## 4          2       0.66             100

Exploratory Data Analysis in R (introduction)

##   has_heart_disease frequency percentage cumulative_perc
## 1                no       164         54              54
## 2               yes       139         46             100
## [1] "Variables processed: thal, has_heart_disease"

💡 TIPS:

  • If freq receives one variable –freq(data$variable)– it retruns a table. Useful to treat high cardinality variables (like zip code).
  • Export the plots to jpeg into current directory: freq(data, path_out = ".")
  • Does all the categories make sense?
  • Lots of missing values?
  • Always check absolute and relative values

[🔎 Read more here.]

Step 3 – Analyzing numerical variables

We will see: plot_num and profiling_num. Both run automatically for all numerical/integer variables:

Graphically

plot_num(data)

Exploratory Data Analysis in R (introduction)

Export the plot to jpeg: plot_num(data, path_out = ".")

💡 TIPS:

  • Try to identify high-unbalanced variables
  • Visually check any variable with outliers

[🔎 Read more here.]

Quantitatively

profiling_num runs for all numerical/integer variables automatically:

data_prof=profiling_num(data)
##         variable mean std_dev variation_coef p_01 p_05 p_25 p_50 p_75 p_95
## 1            age   54       9           0.17   35   40   48   56   61   68
## 2 max_heart_rate  150      23           0.15   95  108  134  153  166  182
##   p_99 skewness kurtosis iqr        range_98     range_80
## 1   71    -0.21      2.5  13        [35, 71]     [42, 66]
## 2  192    -0.53      2.9  32 [95.02, 191.96] [116, 176.6]

💡 TIPS:

  • Try to describe each variable based on its distribution (also useful for reporting)
  • Pay attention to variables with high standard deviation.
  • Select the metrics that you are most familiar with: data_prof %>% select(variable, variation_coef, range_98): A high value in variation_coef may indictate outliers. range_98 indicates where most of the values are.

[🔎 Read more here.]

Step 4 – Analyzing numerical and categorical at the same time

describe from Hmisc package.

library(Hmisc)
describe(data)
## data 
## 
##  4  Variables      303  Observations
## ---------------------------------------------------------------------------
## age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      303        0       41    0.999    54.44     10.3       40       42 
##      .25      .50      .75      .90      .95 
##       48       56       61       66       68 
## 
## lowest : 29 34 35 37 38, highest: 70 71 74 76 77
## ---------------------------------------------------------------------------
## max_heart_rate 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      303        0       91        1    149.6    25.73    108.1    116.0 
##      .25      .50      .75      .90      .95 
##    133.5    153.0    166.0    176.6    181.9 
## 
## lowest :  71  88  90  95  96, highest: 190 192 194 195 202
## ---------------------------------------------------------------------------
## thal 
##        n  missing distinct 
##      301        2        3 
##                          
## Value         3    6    7
## Frequency   166   18  117
## Proportion 0.55 0.06 0.39
## ---------------------------------------------------------------------------
## has_heart_disease 
##        n  missing distinct 
##      303        0        2 
##                     
## Value        no  yes
## Frequency   164  139
## Proportion 0.54 0.46
## ---------------------------------------------------------------------------

Really useful to have a quick picture for all the variables. But is not as operative as freq and profiling_num when we want to use its results to change our data workflow.

💡 TIPS:

  • Check min and max values (outliers)
  • Check Distributions (same as before)

[🔎 Read more here.]


PS: Does anyone remember the function that creates a single-page with a data summary? Wanted to mention here…

That’s all by now! 🙂

PC.

Twitter

Linkedin


Other posts you might like:

To leave a comment for the author, please follow the link and comment on their blog: R - Data Science Heroes Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)