The Power to Normalize

Posted on August 18, 2023 by Matias Andina in R bloggers | 0 Comments

[This article was first published on Matias Andina, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I started participating in the Tidytuesday project to practice my visualization skills, while using datasets that come from sources that I’m not used to. In addition, I enjoy checking what other people do with the same dataset. I find that others are way more creative than I am, and I borrow heavily!

The challenge for Week 33 of 2023 was to perform viz on the spam dataset.

When PCA fails

The spam dataset presents heavily skewed distributions for variables that serve as predictors of spam email. Because it was a dataset with 6 numeric variables and a single binary predictor, my initial idea was to perform a quick and dirty PCA.

Code

library(tidyverse, warn.conflicts = FALSE)
library(tidytuesdayR)
library(paletteer)
library(FactoMineR)
library(factoextra)
library(scales, warn.conflicts = FALSE)

# load the data
spam <- tt_load(2023, week=33)$spam

    Downloading file 1 of 1: `spam.csv`

Code

spam$yesno <- dplyr::if_else(spam$yesno == "y", "spam", "email")
pc <- prcomp(spam[, 1:6], center = TRUE, scale. = TRUE)
# make it a tibble for ggplot plotting
pc_data <- pc$x[, 1:2] %>% as_tibble()
pc_data$yesno <- spam$yesno

pc_ori_plot <- ggplot(pc_data, 
       aes(PC1, PC2, color = yesno)) +
  geom_point() +
  coord_equal() +
  scale_color_paletteer_d("awtools::a_palette") +
  ggthemes::theme_base()+
  theme(legend.position = "bottom",
        plot.background =  element_rect(color = NA),
        legend.background = element_rect(fill = "gray90"),
        legend.key = element_rect(fill = "gray90"),
        panel.background = element_rect(fill="#81AE5C")) +
  labs(color = element_blank())
pc_ori_plot

If you are inclined to do so, you can check that fviz_screeplot(pc) gives you a horrible scree plot with very little variance explained and use fviz_pca_contrib(pc, choice = 'var') to check that the contributions of the different variables are also close to random.

Skewed Data Distributions

The vanilla PCA does nothing to help us visualize a separation between the. Why is that the case?

Upon a closer inspection of the regular variables, which I should have done before diving into the PCA, we see that we are dealing with heavily skewed distributions

Code

spam %>% 
  pivot_longer(-yesno) %>% 
  ggplot(aes(yesno, value, fill = yesno)) +
  geom_violin() +
  facet_wrap(~name, scales = "free", nrow=3) +
  scale_y_continuous(labels = label_number(scale_cut = cut_short_scale()))+
  scale_fill_paletteer_d("awtools::a_palette") +
  ggthemes::theme_base() +
  theme(legend.position = "bottom",
        plot.background =  element_rect(color = NA),
        legend.background = element_rect(fill = "gray90"),
        legend.key = element_rect(fill = "gray90"),
        panel.background = element_rect(fill="#81AE5C")) +
  labs(fill = element_blank(), x = element_blank(), y = element_blank())

The distributions are so skewed we can barely see them.

Transform

Enter the Yeo–Johnson transformation, a type of Power Transform¹ that will come handy to normalize the data.

As a side note, I had a bit of trouble running this using the more conventional caret or recipes packages, you can read my StackOverflow question here and the nice answer about estimating parameters properly. For this post, I will be using bestNormalize::yeojohnson to normalize all columns in the dataset.

Code

# quickly apply transformation to the data itself
df_transformed <- select(spam, where(is.numeric)) %>% 
  mutate_all(.funs = function(x) predict(bestNormalize::yeojohnson(x), newdata = x))
# check the new distributions
df_transformed$yesno <- spam$yesno
df_transformed %>% 
  pivot_longer(-yesno) %>% 
  ggplot(aes(yesno, value, fill = yesno)) +
  geom_violin() +
  facet_wrap(~name, scales = "free", nrow=3) +
  scale_y_continuous(labels = label_number(scale_cut = cut_short_scale()))+
  scale_fill_paletteer_d("awtools::a_palette") +
  ggthemes::theme_base() +
  theme(legend.position = "bottom",
        plot.background =  element_rect(color = NA),
        legend.background = element_rect(fill = "gray90"),
        legend.key = element_rect(fill = "gray90"),
        panel.background = element_rect(fill="#81AE5C")) +
  labs(fill = element_blank(), x = element_blank(), y = element_blank())

I am not a huge fan of data transformations, but that is a very nice transformation. We often deal with skewed data, which produces difficulties when visualizing and performing data analysis. Having a tool like this power transform comes really handy².

Second PCA

We can now check how the second PCA looks like. It’s not a panacea, but we have made large improvements. Check the side by side comparisons:

Code

pc <- prcomp(df_transformed[, 1:6])
pc_data <- pc$x[, 1:2] %>% as_tibble()
pc_data$yesno <- spam$yesno

pc_second_plot <- ggplot(pc_data, 
       aes(PC1, PC2, color = yesno)) +
  geom_point() +
  coord_equal() +
  scale_color_paletteer_d("awtools::a_palette") +
  ggthemes::theme_base()+
  theme(legend.position = "bottom",
        plot.background =  element_rect(color = NA),
        legend.background = element_rect(fill = "gray90"),
        legend.key = element_rect(fill = "gray90"),
        panel.background = element_rect(fill="#81AE5C")) +
  labs(color = element_blank())
library(patchwork)
pc_ori_plot + pc_second_plot

In terms of separating data, the second PCA is not the best PCA in the world. We can still see that there is a bunch of points all clustered together:

Code

p1 <- ggplot(pc_data, 
       aes(PC1, PC2, color = yesno)) +
  geom_point(color = 'gray50', alpha = 0.5)  + 
  labs(title = "All Data") + 
  coord_equal()+
  ggthemes::theme_few(base_family = "Ubuntu")
spam_color <- paletteer::paletteer_d("awtools::a_palette")[2]
email_color <- paletteer::paletteer_d("awtools::a_palette")[1]
p2 <- ggplot(pc_data, 
       aes(PC1, PC2, color = yesno)) +
  geom_point(color = 'gray50', alpha = 0.5)  + 
  geom_point(data=filter(pc_data, yesno=="spam"),
             color = spam_color, alpha = 0.5)  + 
  labs(title = "Spam") + 
  coord_equal()+
  ggthemes::theme_few(base_family = "Ubuntu")
p3 <- ggplot(pc_data, 
       aes(PC1, PC2, color = yesno)) +
  geom_point(color = 'gray50', alpha = 0.5)  + 
  geom_point(data=filter(pc_data, yesno=="email"),
             color = email_color, alpha = 0.5)  + 
  labs(title = "Emails") + 
  coord_equal() +
  ggthemes::theme_few(base_family = "Ubuntu")
p1 + p2 + p3

However, I encourage you to check fviz_screeplot(pc) to see how dramatically better this second PCA is.

Ending remarks

Regardless of the final separation that we could achieve in this particular dataset, the normalization performed using Yeo–Johnson transform was quite powerful. You have been given the Power to Normalize, I hope you try it on your own skewed datasets!

Footnotes

Yes, the title of this post is 100% pun intended.↩︎
The devil is on the details. Always check the parameters and be careful on data interpretation when transforming your data!↩︎

Reuse

https://creativecommons.org/licenses/by/4.0/

Citation

BibTeX citation:

@online{andina2023,
  author = {Andina, Matias},
  title = {The {Power} to {Normalize}},
  date = {2023-08-19},
  url = {https://matiasandina.com/posts/2023-08-19-the-power-to-normalize},
  langid = {en}
}

For attribution, please cite this work as:

Andina, Matias. 2023. “The Power to Normalize.” August 19, 2023. https://matiasandina.com/posts/2023-08-19-the-power-to-normalize.

To leave a comment for the author, please follow the link and comment on their blog: Matias Andina.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

The Power to Normalize

When PCA fails

Skewed Data Distributions

Transform

Second PCA

Ending remarks

Footnotes

Reuse

Citation

Related

When PCA fails

Skewed Data Distributions

Transform

Second PCA

Ending remarks

Footnotes

Reuse

Citation

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)