As data scientists, it can be downright impossible to drill into messy data. Fortunately, there’s a new R package that helps us focus on a “high-density region”, which is simply an area in a scatter plot defined by a high percentage of the data points. It’s called
High Density Regions on a Scatter Plot
In this R-tip, I’m going to show you how to hone in on high-density regions under 5-minutes:
- Learn how to make high-density scatter plots with
- BONUS: Make faceted density plots to drill into over-plotted high-density region data
This article is part of R-Tips Weekly, a weekly video tutorial that shows you step-by-step how to do common R coding tasks. Pretty cool, right?
Here are the links to get set up. 👇
I have a companion video tutorial that shows even more secrets (plus mistakes to avoid).
What you make in this R-Tip
By the end of this tutorial, you’ll use of high density regions to make insights from groups within your data. For example, here we can see where each Class of Vehicle compares in terms of engine displacement (displ) and highway fuel economy (hwy), answering questions like:
- Is vehicle class a good way to describe vehicle clusters?
- Which vehicle classes have the greatest variation in highway fuel economy versus displacement?
- Which vehicle classes have the highest / lowest highway fuel economy?
Do you see how powerful
Uncover insights with ggdensity
Thank You Developers.
Before we move on, please recognize that
ggdensity was developed by James Otto, Doctoral Candidate at the Department of Statistical Science, Baylor University. Thank you for everything you do! Also, the full documentation for
ggdensity can be accessed here.
Before we get started, get the R Cheat Sheet
ggdensity is great for extending ggplot2 with advanced features. But, you’ll need to learn
ggplot2 to take full advantage. For these topics, I’ll use the Ultimate R Cheat Sheet to refer to
ggplot2 code in my workflow.
Download the Ultimate R Cheat Sheet. Then Click the “CS” hyperlink to “ggplot2”.
Now you’re ready to quickly reference the
ggplot2 cheat sheet. This shows you the core plotting functions available in the ggplot library.
Onto the tutorial.
Let’s dive into using
ggdensity so we can show you how to make high-density regions on your scatter plots.
Important: All of the data and code shown can be accessed through our Business Science R-Tips Project.
Plus I have a surprise at the end (for everyone)!
💡 Step 1: Load the Libraries and Data
First, run this code to load the R libraries:
Next, run this code to pull in the data.
We’ll read in the
mpg data set that was comes with ggplot2.
We want to understand how highway fuel economy relates to engine size (displacement) and to see if there are clusters by vehicle class.
💡 Step 2: Make a basic ggplot
Next, make a basic
ggplot using the following code. This creates a scatter plot with the colors that change by vehicle class. I won’t go into all of the mechanics, but you can download my R cheat sheet to learn more about ggplot and the grammar of graphics.
Here’s what the plot looks like. Do you see how it’s really tough to pull out the clusters in there? Each of the points overlap which makes understanding the group structure in the data very tough.
Step 3: Add High Density Regions
Ok, now that we have a basic scatter plot, we can make a quick alteration by adding high density regions that capture 90% and 50% of the data. We use
geom_hdr(probs = c(0.9, 0.5, alpha = 0.35) to accomplish the next plot.
Let’s see what we have here.
We can now see where the clusters have the highest density. But there’s still a problem called “overplotting”, which is when too many graphics get plot on top of each other.
💡 BONUS: Overplotting solved!
Here’s the problem we’re facing: overplotting. We simply have too many groups that are too close together. Let’s see how to fix this.
The fix is pretty simple. Just use facetting from ggplot2.
And, voila! We can easily inspect the clusters by vehicle class.
You learned how to use the
ggdensity library to create high-density regions that help us understand the clusters within our data. Great work! But, there’s a lot more to becoming a Business Scientist.
If you’d like to become a Business Scientist (and have an awesome career, improve your quality of life, enjoy your job, and all the fun that comes along), then I can help with that.
Step 1: Watch my Free 40-Minute Webinar
Learning data science on your own is hard. I know because IT TOOK ME 5-YEARS to feel confident.
AND, I don’t want it to take that long for you.
So, I put together a FREE 40-minute webinar (a masterclass) that provides a roadmap for what worked for me.
Literally 5-years of learning, consolidated into 40-minutes. It’s jammed packed with value. I wish I saw this when I was starting… It would have made a huge difference.
Step 2: Take action
For my action-takers, if you are ready to become a Business Scientist, then read on.
If you need take your skills to the next level and DON’T want to wait 5-years to learn data science for business, AND you want a career you love that earns you $100,000+ salary (plus bonuses), AND you’d like someone to help you do this in UNDER 6-MONTHS or less….
Then I can help with that too.
There’s a link in the FREE 40-minute webinar for a special price (because you are special!) and taking that action will kickstart your journey with me in your corner.
Get ready. The ride is wild. And the destination is AMAZING!
👇 Top R-Tips Tutorials you might like:
- mmtable2: ggplot2 for tables
- ggdist: Make a Raincloud Plot to Visualize Distribution in ggplot2
- ggside: Plot linear regression with marginal distributions
- DataEditR: Interactive Data Editing in R
- openxlsx: How to Automate Excel in R
- officer: How to Automate PowerPoint in R
- DataExplorer: Fast EDA in R
- esquisse: Interactive ggplot2 builder
- gghalves: Half-plots with ggplot2
- rmarkdown: How to Automate PDF Reporting
- patchwork: How to combine multiple ggplots
- Geospatial Map Visualizations in R
Want these tips every week? Join R-Tips Weekly.