Reading the background colour of an XLSX file in a Tidy way with readxl

[This article was first published on pacha.dev/blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Because of delays with my scholarship payment, if this post is useful to you I kindly ask a minimal donation on Buy Me a Coffee. It shall be used to continue my Open Source efforts. The full explanation is here: A Personal Message from an Open Source Contributor. If you play the electric guitar, the same scholarship chaos led me to turn my guitar pedals and DIY kits hobby into a business, and you can check those here.

I had to read some XLSX (Microsoft Excel) files for a consultancy project I am leading now in Chile, when I found a severe data organization problem. Instead of coding the municipalities in a district as:

Distrito    Circunscripción Provincial  Comuna
14        Talagante                     Talagante
14        Talagante                     El Monte
14        Talagante                     Isla de Maipo
14        Talagante                     Padre Hurtado
14        Talagante                     Peñaflor
...

Some sheets in the files used multiple background colours to indicate the district that each municipality (“comuna”) belongs to when it comes to define who is the congressperson for that municipality (the map congressperson-municipality is a 1-to-many when the map mayor-municipality is 1-to-1).

This way of representing information is bad for everybody. It is not clear for the reader and it is more time consumer for whom created the file. A good resource about proper data organization with spreadsheets is Data organization in spreadsheets by Karl Broman and Kara Woo that I suggest to my students each semester.

Think about the gapminder dataset. If instead of this:

You get this:

Then you need extra information not present in the file to know the continent for each country. This is not so bad but it is for more specific information (e.g., think about the international phone code).

I searched online and found a few cumbersome solutions from nearly ten years ago using the xlsx package. Not happy with those, I used my C++ knowledge obtained in ECE244 and made some changes to the readxl package.

The change adds an extra extract_colors parameter to read_excel(), and allows to do this:

# extract_colors = FALSE (default)
good_data <- read_excel(file, sheet = "good")

> good_data                                                                   
# A tibble: 142 × 6
   country                  continent  year lifeExp      pop gdpPercap
   <chr>                    <chr>     <dbl>   <dbl>    <dbl>     <dbl>
 1 Algeria                  Africa     2007    72.3 33333216     6223.
 2 Angola                   Africa     2007    42.7 12420476     4797.
 3 Benin                    Africa     2007    56.7  8078314     1441.
 4 Botswana                 Africa     2007    50.7  1639131    12570.
 5 Burkina Faso             Africa     2007    52.3 14326203     1217.
 6 Burundi                  Africa     2007    49.6  8390505      430.
 7 Cameroon                 Africa     2007    50.4 17696293     2042.
 8 Central African Republic Africa     2007    44.7  4369038      706.
 9 Chad                     Africa     2007    50.7 10238807     1704.
10 Comoros                  Africa     2007    65.2   710960      986.
# ℹ 132 more rows
# ℹ Use `print(n = ...)` to see more rows
bad_data <- read_excel(file, sheet = "bad", extract_colors = TRUE)

> bad_data                                                                    
# A tibble: 142 × 6
   country                   year lifeExp      pop gdpPercap country_bg
   <chr>                    <dbl>   <dbl>    <dbl>     <dbl> <chr>     
 1 Algeria                   2007    72.3 33333216     6223. #FF6600   
 2 Angola                    2007    42.7 12420476     4797. #FF6600   
 3 Benin                     2007    56.7  8078314     1441. #FF6600   
 4 Botswana                  2007    50.7  1639131    12570. #FF6600   
 5 Burkina Faso              2007    52.3 14326203     1217. #FF6600   
 6 Burundi                   2007    49.6  8390505      430. #FF6600   
 7 Cameroon                  2007    50.4 17696293     2042. #FF6600   
 8 Central African Republic  2007    44.7  4369038      706. #FF6600   
 9 Chad                      2007    50.7 10238807     1704. #FF6600   
10 Comoros                   2007    65.2   710960      986. #FF6600

In this way, we can group countries somehow, and one of many possibilities is to convert country_bg to a factor with five categories (e.g., one colour per continent) with:

library(dplyr)

bad_data %>%
  mutate(country_bg = as.factor(country_bg))

or matching HEX codes to a string:

bad_data %>%
  inner_join(
    bad_data %>%
      distinct(country_bg) %>%
      mutate(
        continent = case_when(
          country_bg == "#FF6600" ~ "Asia",
          country_bg == "#99CCFF" ~ "Europe",
          country_bg == "#FFFFCC" ~ "Africa",
          country_bg == "#993366" ~ "Americas",
          country_bg == "#666699" ~ "Oceania",
          TRUE ~ NA_character_
        )
      )
  )

I hope this is useful!

To leave a comment for the author, please follow the link and comment on their blog: pacha.dev/blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)