X is for scale_x

[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

These next two posts will deal with formatting scales in ggplot2 – x-axis, y-axis – so I’ll try to limit the amount of overlap and repetition.

Let’s say I wanted to plot my reading over time, specifically as a cumulative sum of pages across the year. My x-axis will be a date. Since my reads2019 file initially formats my dates as character, I’ll need to use my mutate code to turn them into dates, plus compute my cumulative sum of pages read.

## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --

## <U+2713> ggplot2 3.2.1     <U+2713> purrr   0.3.3
## <U+2713> tibble  2.1.3     <U+2713> dplyr   0.8.3
## <U+2713> tidyr   1.0.0     <U+2713> stringr 1.4.0
## <U+2713> readr   1.3.1     <U+2713> forcats 0.4.0

## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

reads2019 <- read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allchanges.csv",
                      col_names = TRUE)

## Parsed with column specification:
## cols(
##   Title = col_character(),
##   Pages = col_double(),
##   date_started = col_character(),
##   date_read = col_character(),
##   Book.ID = col_double(),
##   Author = col_character(),
##   AdditionalAuthors = col_character(),
##   AverageRating = col_double(),
##   OriginalPublicationYear = col_double(),
##   read_time = col_double(),
##   MyRating = col_double(),
##   Gender = col_double(),
##   Fiction = col_double(),
##   Childrens = col_double(),
##   Fantasy = col_double(),
##   SciFi = col_double(),
##   Mystery = col_double(),
##   SelfHelp = col_double()
## )

reads2019 <- reads2019 %>%
  mutate(date_started = as.Date(reads2019$date_started, format = '%m/%d/%Y'),
         date_read = as.Date(date_read, format = '%m/%d/%Y'),
         PagesRead = order_by(date_read, cumsum(Pages)))
This gives me the variables I need to plot my pages read over time.
reads2019 %>%
  ggplot(aes(date_read, PagesRead)) +

ggplot2 did a fine job of creating this plot using default settings. Since my date_read variable is a date, the plot automatically ordered date_read, formatted as “Month Year”, and used quarters as breaks. But we can still use the scale_x functions to make this plot look even better.

One way could be to format years as 2-digit instead of 4. We could also have month breaks instead of quarters.
reads2019 %>%
  ggplot(aes(date_read, PagesRead)) +
  geom_point() +
  scale_x_date(date_labels = "%b %y",
               date_breaks = "1 month")

Of course, we could drop year completely and just show month, since all of this data is for 2019. We could then note that in the title instead.
reads2019 %>%
  ggplot(aes(date_read, PagesRead)) +
  geom_point() +
  scale_x_date(date_labels = "%B",
               date_breaks = "1 month") +
  labs(title = "Cumulative Pages Read Over 2019") +
  theme(plot.title = element_text(hjust = 0.5))
Tomorrow, I’ll show some tricks for how we can format the y-axis of this plot. But let’s see what else we can do to the x-axis. Let’s create a bar graph with my genre data. I’ll use the genre names I created for my summarized data last week.
genres <- reads2019 %>%
  group_by(Fiction, Childrens, Fantasy, SciFi, Mystery) %>%
  summarise(Books = n())

genres <- genres %>%
  bind_cols(Genre = c("Non-Fiction",
           "General Fiction",
           "Science Fiction",
           "Fantasy Sci-Fi",
           "Children's Fiction",
           "Children's Fantasy"))

genres %>%
  ggplot(aes(Genre, Books)) +

Unfortunately, my new genre names are a bit long, and overlap each other unless I make my plot really wide. There are a few ways I can deal with that. First, I could ask ggplot2 to abbreviate the names.
genres %>%
  ggplot(aes(Genre, Books)) +
  geom_col() +
  scale_x_discrete(labels = abbreviate)

These abbreviations were generated automatically by R, and I’m not a huge fan. A better way might be to add line breaks to any two-word genres. This Stack Overflow post gave me a function I can add to my scale_x_discrete to do just that.
genres %>%
  ggplot(aes(Genre, Books)) +
  geom_col() +
  scale_x_discrete(labels=function(x){sub("\\s", "\n", x)})

MUCH better!

As you can see, the scale_x function you use depends on the type of data you’re working with. For dates, scale_x_date; for categories, scale_x_discrete. Tomorrow, we’ll show some ways to format continuous data, since that’s often what you see on the y-axis. See you then!

By the way, this is my 1000th post on my blog!

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)