Reading conditional formatting in XLSX files in a Tidy way with readxl

[This article was first published on pacha.dev/blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Because of delays with my scholarship payment, if this post is useful to you I kindly ask a minimal donation on Buy Me a Coffee or GitHub Sponsorships. It shall be used to continue my Open Source efforts. The full explanation is here: A Personal Message from an Open Source Contributor. If you play the electric guitar, the same scholarship chaos led me to turn my guitar pedals and DIY kits hobby into a business, and you can check those here.

Continuing with the previous post, I found an additional data organization issue: Using conditional format as information.

Like all tools (R included), spreadsheets are good when used properly. Conditional formats provide visual guides but there are better ways to express categories such as “above the mean” (e.g. using “VLOOKUP” to clearly add categories to the tables).

Think about the gapminder dataset. If instead of this:

You get this:

Then without the “good” sheet you need extra information not present in the file to know what the colours in “lifeExp” mean. In this case, the colour is used to express if the country in question has a life expectancy above or below the continent’s mean. This could be guessed dedicating time to explore the data, which can be time consuming and lead to wrong conclusions.

The changes I made to the readxl package previously cannot read the colours dynamically created with conditional formatting, and I had to expand my Pull Request.

The new change alters the C++ code such that the extra extract_colors parameter in read_excel(), and now allows to do this:

file <- readxl_example("gapminder-2007.xlsx")

# using the new functionality
bad_data <- read_excel(file, sheet = "bad", extract_colors = TRUE)

> bad_data                                                                    
# A tibble: 142 × 7
   country                   year lifeExp    pop gdpPercap country_bg lifeExp_bg
   <chr>                    <dbl>   <dbl>  <dbl>     <dbl> <chr>      <chr>     
 1 Algeria                   2007    72.3 3.33e7     6223. #FF6600    #CCFFCC   
 2 Angola                    2007    42.7 1.24e7     4797. #FF6600    #FFCCCC   
 3 Benin                     2007    56.7 8.08e6     1441. #FF6600    #FFCCCC   
 4 Botswana                  2007    50.7 1.64e6    12570. #FF6600    #FFCCCC   
 5 Burkina Faso              2007    52.3 1.43e7     1217. #FF6600    #FFCCCC   
 6 Burundi                   2007    49.6 8.39e6      430. #FF6600    #FFCCCC   
 7 Cameroon                  2007    50.4 1.77e7     2042. #FF6600    #FFCCCC   
 8 Central African Republic  2007    44.7 4.37e6      706. #FF6600    #FFCCCC   
 9 Chad                      2007    50.7 1.02e7     1704. #FF6600    #FFCCCC   
10 Comoros                   2007    65.2 7.11e5      986. #FF6600    #CCFFCC   
# ℹ 132 more rows
# ℹ Use `print(n = ...)` to see more rows

In this way, we can extract some information from the conditional formatting, such as:

> unique(bad_data$lifeExp_bg)
[1] "#CCFFCC" "#FFCCCC"

and compare the unique values in the “good” and “bad” sheets:

> good_data <- read_excel(file, sheet = "good")
> unique(good_data$lifeExpOverContinentAvg)
[1] "Above Continent Average" "Below Continent Average"

This is an example of some trime consuming tasks as without the “good” sheet we would need to guess this with the “bad” sheet:

> library(dplyr)

bad_data %>%
  group_by(country_bg) %>%
  mutate(lifeExpAvg = mean(lifeExp, na.rm = T)) %>%
  mutate(lifeExpOverContinentAvg = case_when(
    lifeExp > lifeExpAvg ~ "Above Continent Average",
    TRUE ~ "Below Continent Average"
  ))

# A tibble: 142 × 9
# Groups:   country_bg [5]
   country        year lifeExp    pop gdpPercap country_bg lifeExp_bg lifeExpAvg
   <chr>         <dbl>   <dbl>  <dbl>     <dbl> <chr>      <chr>           <dbl>
 1 Algeria        2007    72.3 3.33e7     6223. #FF6600    #CCFFCC          54.8
 2 Angola         2007    42.7 1.24e7     4797. #FF6600    #FFCCCC          54.8
 3 Benin          2007    56.7 8.08e6     1441. #FF6600    #FFCCCC          54.8
 4 Botswana       2007    50.7 1.64e6    12570. #FF6600    #FFCCCC          54.8
 5 Burkina Faso   2007    52.3 1.43e7     1217. #FF6600    #FFCCCC          54.8
 6 Burundi        2007    49.6 8.39e6      430. #FF6600    #FFCCCC          54.8
 7 Cameroon       2007    50.4 1.77e7     2042. #FF6600    #FFCCCC          54.8
 8 Central Afri…  2007    44.7 4.37e6      706. #FF6600    #FFCCCC          54.8
 9 Chad           2007    50.7 1.02e7     1704. #FF6600    #FFCCCC          54.8
10 Comoros        2007    65.2 7.11e5      986. #FF6600    #CCFFCC          54.8
# ℹ 132 more rows
# ℹ 1 more variable: lifeExpOverContinentAvg <chr>

I hope this is useful!

To leave a comment for the author, please follow the link and comment on their blog: pacha.dev/blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)