L is for Log Transformation

Posted on April 14, 2020 by Unknown in R bloggers | 0 Comments

[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When visualizing data, outliers and skewed data can have a huge impact, potentially making your visualization difficult to understand. We can use many of the tricks covered so far to deal with those issues, such as using filters to remove extreme values. But what if you want to display all values, even extreme ones? A log transformation is a great option for displaying skewed data.

One of the more skewed variables in my reading dataset is read_time. I was able to read many books in a pretty short amount of time (a few days), but others took longer, either because they were a long book or because I was busy with other things and didn’t have as much time to read. Let’s take a quick look.

library(tidyverse)

## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --

## <U+2713> ggplot2 3.2.1     <U+2713> purrr   0.3.3
## <U+2713> tibble  2.1.3     <U+2713> dplyr   0.8.3
## <U+2713> tidyr   1.0.0     <U+2713> stringr 1.4.0
## <U+2713> readr   1.3.1     <U+2713> forcats 0.4.0

## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

reads2019 <- read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allrated.csv", col_names = TRUE)

## Parsed with column specification:
## cols(
##   Title = col_character(),
##   Pages = col_double(),
##   date_started = col_character(),
##   date_read = col_character(),
##   Book.ID = col_double(),
##   Author = col_character(),
##   AdditionalAuthors = col_character(),
##   AverageRating = col_double(),
##   OriginalPublicationYear = col_double(),
##   read_time = col_double(),
##   MyRating = col_double(),
##   Gender = col_double(),
##   Fiction = col_double(),
##   Childrens = col_double(),
##   Fantasy = col_double(),
##   SciFi = col_double(),
##   Mystery = col_double(),
##   SelfHelp = col_double()
## )

library(magrittr)

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

reads2019 %$%
  range(read_time)

## [1]  0 25

Read time ranges from 0 (finished in the same day) to almost a month. If I created box-plots of reading time, I'd likely have some outliers. I'll use my Fantasy genre to generate 2 box-plots. To make these data a bit easier to visualize, I'll also change my Fantasy flag into a labeled factor.

reads2019 <- reads2019 %>%
  mutate(Fantasy = factor(Fantasy, labels = c("Non-Fantasy",
                                              "Fantasy"),
                          ordered = TRUE))
reads2019 %>%
  ggplot(aes(Fantasy, read_time)) +
  geom_boxplot()

Most of the books were finished within a couple weeks, but one fantasy book I read took longer. I could drop that value for this figure, since it does appear to be an outlier. But if I'd prefer not to drop an outlier, or if I had multiple long reads mixed in, I could keep all values and use a log-transformation to create this display. I can easily make that transformation for my figure with the scales package (add install.packages("scales") if you don't already have that package installed).

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

reads2019 %>%
  ggplot(aes(Fantasy, read_time)) +
  geom_boxplot() +
  scale_y_continuous(trans = log2_trans()) +
  ylab("Read Time (in days)") +
  labs(caption = "Because reading time was skewed, data have been log-transformed.")

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Removed 2 rows containing non-finite values (stat_boxplot).