L is for Log Transformation
When visualizing data, outliers and skewed data can have a huge impact, potentially making your visualization difficult to understand. We can use many of the tricks covered so far to deal with those issues, such as using filters to remove extreme values. But what if you want to display all values, even extreme ones? A log transformation is a great option for displaying skewed data.
One of the more skewed variables in my reading dataset is read_time. I was able to read many books in a pretty short amount of time (a few days), but others took longer, either because they were a long book or because I was busy with other things and didn’t have as much time to read. Let’s take a quick look.
library(tidyverse) ## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 -- ## <U+2713> ggplot2 3.2.1 <U+2713> purrr 0.3.3 ## <U+2713> tibble 2.1.3 <U+2713> dplyr 0.8.3 ## <U+2713> tidyr 1.0.0 <U+2713> stringr 1.4.0 ## <U+2713> readr 1.3.1 <U+2713> forcats 0.4.0 ## -- Conflicts ---------------------------------------------- tidyverse_conflicts() -- ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() reads2019 <- read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allrated.csv", col_names = TRUE) ## Parsed with column specification: ## cols( ## Title = col_character(), ## Pages = col_double(), ## date_started = col_character(), ## date_read = col_character(), ## Book.ID = col_double(), ## Author = col_character(), ## AdditionalAuthors = col_character(), ## AverageRating = col_double(), ## OriginalPublicationYear = col_double(), ## read_time = col_double(), ## MyRating = col_double(), ## Gender = col_double(), ## Fiction = col_double(), ## Childrens = col_double(), ## Fantasy = col_double(), ## SciFi = col_double(), ## Mystery = col_double(), ## SelfHelp = col_double() ## ) library(magrittr) ## ## Attaching package: 'magrittr' ## The following object is masked from 'package:purrr': ## ## set_names ## The following object is masked from 'package:tidyr': ## ## extract reads2019 %$% range(read_time) ## [1] 0 25Read time ranges from 0 (finished in the same day) to almost a month. If I created box-plots of reading time, I'd likely have some outliers. I'll use my Fantasy genre to generate 2 box-plots. To make these data a bit easier to visualize, I'll also change my Fantasy flag into a labeled factor.
reads2019 <- reads2019 %>% mutate(Fantasy = factor(Fantasy, labels = c("Non-Fantasy", "Fantasy"), ordered = TRUE)) reads2019 %>% ggplot(aes(Fantasy, read_time)) + geom_boxplot()
library(scales) ## ## Attaching package: 'scales' ## The following object is masked from 'package:purrr': ## ## discard ## The following object is masked from 'package:readr': ## ## col_factor reads2019 %>% ggplot(aes(Fantasy, read_time)) + geom_boxplot() + scale_y_continuous(trans = log2_trans()) + ylab("Read Time (in days)") + labs(caption = "Because reading time was skewed, data have been log-transformed.") ## Warning: Transformation introduced infinite values in continuous y-axis ## Warning: Removed 2 rows containing non-finite values (stat_boxplot).