P is for percent

Posted on April 18, 2020 by Unknown in R bloggers | 0 Comments

[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We’ve used ggplots throughout this blog series, but today, I want to introduce another package that helps you customize scales on your ggplots – the scales package. I use this package most frequently to format scales as percent. There aren’t a lot of good ways to use percents with my dataset, but one example would be to calculate the percentage each book contributes to the total pages I read in 2019.

library(tidyverse)

## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --

## <U+2713> ggplot2 3.2.1     <U+2713> purrr   0.3.3
## <U+2713> tibble  2.1.3     <U+2713> dplyr   0.8.3
## <U+2713> tidyr   1.0.0     <U+2713> stringr 1.4.0
## <U+2713> readr   1.3.1     <U+2713> forcats 0.4.0

## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

reads2019 <- read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allrated.csv",
                      col_names = TRUE)

## Parsed with column specification:
## cols(
##   Title = col_character(),
##   Pages = col_double(),
##   date_started = col_character(),
##   date_read = col_character(),
##   Book.ID = col_double(),
##   Author = col_character(),
##   AdditionalAuthors = col_character(),
##   AverageRating = col_double(),
##   OriginalPublicationYear = col_double(),
##   read_time = col_double(),
##   MyRating = col_double(),
##   Gender = col_double(),
##   Fiction = col_double(),
##   Childrens = col_double(),
##   Fantasy = col_double(),
##   SciFi = col_double(),
##   Mystery = col_double(),
##   SelfHelp = col_double()
## )

reads2019 <- reads2019 %>%
  mutate(perpage = Pages/sum(Pages))

The new variable, perpage, is a proportion. But if I display those data with a figure, I want them to be percentages instead. Here’s how to do that. (If you don’t already have the scales package, add install.packages(“scales”) at the beginning of this code.)

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

reads2019 %>%
  ggplot(aes(perpage)) +
  geom_histogram() +
  scale_x_continuous(labels = percent, breaks = seq(0,.05,.005)) +
  xlab("Percentage of Total Pages Read") +
  ylab("Books")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You need to make sure you load the scales package before you add the labels = percent attribute, or you’ll get an error message. Alternatively, you can tell R to use the scales package just for this attribute by adding scales:: before percent. This trick becomes useful when you have lots of packages loaded that use the same function names, because R will use the most recently loaded package for that function, and mask it from any other packages.

This post also seems like a great opportunity to hop on my statistical highhorse and talk about the difference between a histogram and a bar chart. Why is this important? With everything going on in the world – pandemics, political elections, etc. – I’ve seen lots of comments on others’ intelligence, many of which show a misunderstanding of the most well-known histogram: the standard normal curve. You see, raw data, even from a huge number of people and even on a standardized test, like a cognitive ability (aka: IQ) test, is never as clean or pretty as it appears in a histogram.

Histograms use a process called “binning”, where ranges of scores are combined to form one of the bars. The bins can be made bigger (including a larger range of scores) or smaller, and smaller bins will start showing the jagged nature of most data, even so-called normally distributed data.

As one example, let’s show what my percent figure would look like as a bar chart instead of a histogram (like the one above).

reads2019 %>%
  ggplot(aes(perpage)) +
  geom_bar() +
  scale_x_continuous(labels = percent, breaks = seq(0,.05,.005)) +
  xlab("Percentage of Total Pages Read") +
  ylab("Books")

set.seed(42) test <- tibble(ID = c(1:10000), value = rnorm(10000)) test %>% ggplot(aes(value)) + geom_histogram() ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

library(magrittr) ## ## Attaching package: 'magrittr' ## The following object is masked from 'package:purrr': ## ## set_names ## The following object is masked from 'package:tidyr': ## ## extract test %$% n_distinct(value) ## [1] 10000 test %>% ggplot(aes(value)) + geom_histogram(bins = 10000)

CogAbil <- tibble(Person = c(1:10000), Ability = rnorm(10000, mean = 100, sd = 15)) CogAbil <- CogAbil %>% mutate(Ability = round(Ability, digits = 0)) CogAbil %$% n_distinct(Ability) ## [1] 103 CogAbil %>% ggplot(aes(Ability)) + geom_histogram() + labs(title = "With 30 bins") + theme(plot.title = element_text(hjust = 0.5)) ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.