Predictive Power Score vs CorrelationFunnel

Posted on February 15, 2021 by Business Science in R bloggers | 0 Comments

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This article is part of a R-Tips Weekly, a weekly video tutorial that shows you step-by-step how to do common R coding tasks.

Here are the links to get set up. ????

Battle of the EDA Packages

Exploratory Data Analysis (EDA) is what every data scientist does to understand actionable insights from the data. This process used to take forever. Not anymore. We’re kicking the tires on 2 EDA packages in a Battle Royale to determine which one reigns supreme.

ppsr – An implementation of the Predictive Power Score.
correlationfunnel – My R package that leverages a “binning trick” for inference using correlation.

As you follow along, you can use my Ultimate R Cheatsheet. It consolidates the most important R packages (ones I use every day) into 1 cheatsheet. We’ll also be using the tidyverse and tidymodels / xgboost from the cheatsheet as the tie-breaker.

So let’s get started. You’re making these data visualizations today:

The Drawbacks of Correlation
Did you know that you miss important insights with correlation?

Correlation plots are must-know plots. Everyone understands correlations (even non-technical people). Correlation are great for explaining insights in simple terms:

Magnitude (between zero in one): two features have a higher degree of relationship the closer the magnitude gets to one.
Sign (Positive/Negative): A positive sign indicates a positive relationship (both go up). A negative sign indicates an inverse relationship (as one goes up, the other goes down).

But Correlation has serious drawbacks.

Doesn’t work well with Non-Linear Relationships
Doesn’t work on Categorical Data

Enter the Predictive Power Score, a new technique for finding relationships.

The Contender: Predictive Power Score
An enhanced version of the traditional correlation that scores feature importance

Predictive Power Score fixes the issues. It works with:

Non-Linear Relationships
Categorical Data

Here’s the implementation in the ppsr package.

This outputs a nice “PPScore Heatmap”, which looks very similar to a Correlation Matrix Heatmap. We can see that Tenure and TotalCharges are the top features identified with relationship to Churn.

Awesome. In a few lines of code, we got some insights telling us to look closer at TotalCharges and Tenure.

But did the PPScore miss anything?

Let’s try another method: My Correlation Funnel R Package.

The Defender: Correlation Funnel
Uses the “binning trick” to fix the issues of traditional correlation

Correlation Funnel fixes the issues with traditional correlation using a clever binning strategy (i.e. the “binning trick”) that allows categorical data to be one-hot encoded and numeric features to be binned into binary groups.

First binarize() the data – Note this changes the columns in your data using the “binning trick”
Then correlate()– Make sure to use the binarized target variable name (Churn__Yes)
Then visualize with plot_correlation_funnel()

This creates an easy to infer plot that shows not only magnitude of relationship, but also which parts of the feature correlate most with the segment of interest (e.g. Contract= Month-to-Month correlates with Churn=Yes).

We see new features that PPScore didn’t have:

Contract
Online Security
Tech Support

Which is right? We need a tie-breaker.

Tie-Breaker: XGBoost + VIP
Use feature importance from a model designed for non-linear relationships

To help us decide which to believe, we can use a 3rd model – XGBOOST:

Xgboost works well with non-linear data (tree-based, very accurate)
We can get variable importance from Xgboost

Watch the YouTube Video for this one.

And the results are in, the top features for each method are:

Interpretation of Results

Each EDA method produced different results.
My interpretation – Experiment! Use multiple types of EDA techniques.
You learned 3 EDA Techniques in this tutorial.

Your boss

After he sees your new EDA skills…

But if you really want to improve your data skills…

Here’s how to master R.

What happens after you learn R for Business from Matt.

This is career acceleration.

SETUP R-TIPS WEEKLY PROJECT

Get the Code
Check out the R-Tips Setup Video

Once you take these actions, you’ll be set up to receive R-Tips with Code every week. =)

???? Top R-Tips Tutorials you might like:

To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Predictive Power Score vs CorrelationFunnel

Battle of the EDA Packages

The Drawbacks of Correlation
Did you know that you miss important insights with correlation?

The Contender: Predictive Power Score
An enhanced version of the traditional correlation that scores feature importance