Site icon R-bloggers

Business Ready Plots with R

[This article was first published on Exploring Data, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Quick Overview

Exploring-Data is a place where I share easily digestible content aimed at making the wrangling and exploration of data more efficient (+fun).

Sign up Here to join the many other subscribers who also nerd out on new tips and tricks ????

And if you enjoy the post be sure to share it

Business Science

Recently, I completed the Data Science for Business 101 course over at Business Science University. In the course, Matt Dancho teaches students the fundamentals of data science for business with the tidyverse.

The course is jam packed with material: from basic data wrangling all the way to applied machine learning – I highly recommend it to anyone looking to advance their skills ????

Clink this LINK to access the course.

I’ve been tracking down data then applying the techniques to help solidify concepts from the course. One of my favorite parts from Week 1 is turning a generic ggplot() into something that is Business Ready.

In this post I’ll show you how to upgrade your plots in R so that they are Business-Ready.

The Final Plot

This is the plot that we will recreate in the post – it’s crisp, clean, and Business-Ready.

Let’s get started ????

Load our Libraries

library(tidyverse)  # Work-Horse Package
library(tidyquant)  # Business Ready Plots 
library(scales)     # Scaling Data for Plots

Let’s Get Some Data

These are Census data that I got here: link to data.

The original data was 4M+ rows and so I’ve already filtered it down a bit.

# Import Data
edu_census_data_raw_tbl <- read_csv("../../static/01_data/edu_census_data.csv")

# Glimpse Data
edu_census_data_raw_tbl %>% glimpse()
## Rows: 228,737
## Columns: 5
## $ name     <chr> "United States", "United States", "United States", "United S…
## $ type     <chr> "nation", "nation", "nation", "nation", "nation", "nation", …
## $ year     <dbl> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ variable <chr> "percent_bachelors_degree_or_higher_rank", "percent_graduate…
## $ value    <dbl> 1.0, 10.8, 28.8, 11.1, 1.0, 86.7, 5.8, 18.0, 27.8, 28.1, 8.5…

Filter Data for Plotting

We want to compare educational attainment statistics for the County of Los Angeles against the rest of the Nation – first, let’s do a bit of filtering to get just the data needed for our plot.

# Setup variables: for filter + for using in plot later
year_f <- 2018
nation <- "United States"
county <- "Los Angeles County"

# Data Prep
edu_census_filtered_tbl <- edu_census_data_raw_tbl %>% 
    
    # Filter data to year and areas of interest 
    filter(year == year_f,
           name == nation | # OR
           str_detect(name, county))

# View data
edu_census_filtered_tbl
## # A tibble: 10 x 5
##    name                      type    year variable                         value
##    <chr>                     <chr>  <dbl> <chr>                            <dbl>
##  1 United States             nation  2018 percent_less_than_9th_grade        5.3
##  2 United States             nation  2018 percent_high_school_graduate_or…  87.7
##  3 United States             nation  2018 percent_bachelors_degree_or_hig…  31.5
##  4 United States             nation  2018 percent_associates_degree          8.4
##  5 United States             nation  2018 percent_graduate_or_professiona…  12.1
##  6 Los Angeles County, Cali… county  2018 percent_less_than_9th_grade       12.6
##  7 Los Angeles County, Cali… county  2018 percent_high_school_graduate_or…  78.7
##  8 Los Angeles County, Cali… county  2018 percent_bachelors_degree_or_hig…  31.8
##  9 Los Angeles County, Cali… county  2018 percent_associates_degree          7  
## 10 Los Angeles County, Cali… county  2018 percent_graduate_or_professiona…  11.1

The 10 x 5 table is exactly what we need to create our first plot.

Making a visualization is a great way to get a few insights in the process of better understanding your data.

Generic ggplot()

The awesomeness of ggplot() is that we can rapidly produce a plot with just a couple of lines of code – that means we can quickly get insights that will help determine the next steps in Exploring these Data further.

The stacked bar-chart below is a great starting place.

edu_census_filtered_tbl %>% 
    ggplot(aes(x = variable, y = value, fill = name)) +
    geom_col() 

We can immediately see that ‘some’ differences exist but it’s difficult to get a sense of the magnitude. It’s also difficult to make out the names of the variables on the x-axis.

Making Business-Ready plots can be time-consuming – thankfully, we have the Tidyquant library to help expedite the process.

Business Ready Plots

To get those plots business-ready, it’s helpful (+best-practice) to break things up into two steps:

  1. Data Manipulation (Wrangling)
  2. Data Visualization

The data manipulation step will pay-off immensely once we get to the data visualization step; this was a key learning from Matt in the 101 course – it keeps your code nice and tidy too ????

1) Data Manipulation

# Step 1 - Manipulate Data
data_manipulated_tbl <- edu_census_filtered_tbl %>% 
    
    # Selecting columns to focus on
    select(name, variable, value) %>% 
    
    # Tidy up variable names
    mutate(variable = str_replace(variable, "percent_", ""),
           variable = str_replace_all(variable, "_", " "),
           variable = str_to_title(variable)) %>% 
    
    # Convert value to a pct (ratio)
    mutate(pct = value / 100) %>% 
    
    # Format % Text 
    mutate(pct_text = scales::percent(pct, accuracy = 0.1)) %>% 
    
    # Select final columns for plotting
    select(name, variable, contains("pct"))

Now that we’ve wrangled + manipulated our data, let’s take a peak at it before diving into the generation of our visualization.

data_manipulated_tbl
## # A tibble: 10 x 4
##    name                           variable                          pct pct_text
##    <chr>                          <chr>                           <dbl> <chr>   
##  1 United States                  Less Than 9th Grade             0.053 5.3%    
##  2 United States                  High School Graduate Or Higher  0.877 87.7%   
##  3 United States                  Bachelors Degree Or Higher      0.315 31.5%   
##  4 United States                  Associates Degree               0.084 8.4%    
##  5 United States                  Graduate Or Professional Degree 0.121 12.1%   
##  6 Los Angeles County, California Less Than 9th Grade             0.126 12.6%   
##  7 Los Angeles County, California High School Graduate Or Higher  0.787 78.7%   
##  8 Los Angeles County, California Bachelors Degree Or Higher      0.318 31.8%   
##  9 Los Angeles County, California Associates Degree               0.07  7.0%    
## 10 Los Angeles County, California Graduate Or Professional Degree 0.111 11.1%

Creating the pct_text column will come in handy for adding clean labels to our plot – this will be a nice touch that will help the audience quickly make sense of the chart.

2) Data Visualization

# Step 2 - Visualize Data
data_visualized_plot <- data_manipulated_tbl %>% 
    
    # Setup ggplot() canvas for plotting
    ggplot(aes(x = variable, y = pct, fill = name)) +
    
    # Geometries
    geom_col() +
    geom_label(aes(label = pct_text), fill = "white", hjust = "center") +
    
    # Facet: splits plot into multiple plots by a categorical feature
    facet_wrap(~ name) +
    
    # Flip coordinates for readable variable names
    coord_flip() +
    
    # Formatting
    theme_tq() +
    scale_fill_tq() +
    scale_y_continuous(labels = scales::percent, limits = c(0, 1.0)) +
    theme(legend.position = "none",
          plot.title = element_text(face = "bold")) +
    labs(title = str_glue("Comparison of Educational Attainment ({year_f})"),
         subtitle = str_glue("{county} vs. Overall National Statistics"),
         caption  = "Census Data",
         x = "", y = "") 

We now have the two steps completed and our code is nicely commented for readability (+reproducibility).

Display Plot

Let’s take a look at our awesome plot ????

data_visualized_plot

Wrap Up

That’s it for today!

You learned how to turn a generic ggplot() into one that is Business-Ready ????

Get the code here: Github Repo.

Subscribe + Share

Enter your Email Here to get the latest from Exploring-Data in your inbox.

PS: Be Kind and Tidy your Data ????

Learn R Fast ????

Interested in expediting your learning path?

Click on the link to head over to Business Science and join me on the journey.

To leave a comment for the author, please follow the link and comment on their blog: Exploring Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.