Site icon R-bloggers

Reading the background colour of an XLSX file in a Tidy way with readxl

[This article was first published on https://pacha.dev/blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< !DOCTYPE html> < charset="utf-8"> < http-equiv="X-UA-Compatible" content="IE=edge"> < name="viewport" content="width=device-width, initial-scale=1.0"> pacha.dev/blog < !-- MathJax Configuration --> < !-- Smart header: libraries detected based on content --> < !-- File: /tmp/tmp.gvQ59BdPSQ/index.html -->
  • < !-- DEBUG: Found sourceCode --> < !-- Load custom CSS after any library CSS to ensure proper precedence -->
  • < header class="site-top">

    Mauricio “Pachá” Vargas Sepúlveda

    Blog with notes about R, Shiny, SQL, Python, Linux and C++. This blog is listed on R-Bloggers.

    HOME 🏠
    < !-- categories are printed below this--> < nav class="sidebar-nav">

    Categories

    < header id="title-block-header" class="quarto-title-block default">

    Reading the background colour of an XLSX file in a Tidy way with readxl

    Data organization issues derived from not using the tools correctly and its solution.
    Author

    Mauricio “Pachá” Vargas S.

    Published

    August 25, 2025

    Because of delays with my scholarship payment, if this post is useful to you I kindly ask a minimal donation on Buy Me a Coffee. It shall be used to continue my Open Source efforts. The full explanation is here: A Personal Message from an Open Source Contributor. If you play the electric guitar, the same scholarship chaos led me to turn my guitar pedals and DIY kits hobby into a business, and you can check those here.

    I had to read some XLSX (Microsoft Excel) files for a consultancy project I am leading now in Chile, when I found a severe data organization problem. Instead of coding the municipalities in a district as:

    Distrito    Circunscripción Provincial  Comuna
    14        Talagante                     Talagante
    14        Talagante                     El Monte
    14        Talagante                     Isla de Maipo
    14        Talagante                     Padre Hurtado
    14        Talagante                     Peñaflor
    ...

    Some sheets in the files used multiple background colours to indicate the district that each municipality (“comuna”) belongs to when it comes to define who is the congressperson for that municipality (the map congressperson-municipality is a 1-to-many when the map mayor-municipality is 1-to-1).

    This way of representing information is bad for everybody. It is not clear for the reader and it is more time consumer for whom created the file. A good resource about proper data organization with spreadsheets is Data organization in spreadsheets by Karl Broman and Kara Woo that I suggest to my students each semester.

    Think about the gapminder dataset. If instead of this:

    You get this:

    Then you need extra information not present in the file to know the continent for each country. This is not so bad but it is for more specific information (e.g., think about the international phone code).

    I searched online and found a few cumbersome solutions from nearly ten years ago using the xlsx package. Not happy with those, I used my C++ knowledge obtained in ECE244 and made some changes to the readxl package.

    The change adds an extra extract_colors parameter to read_excel(), and allows to do this:

    file <- readxl_example("gapminder-2007.xlsx")
    
    # extract_colors = FALSE (default)
    good_data <- read_excel(file, sheet = "good")
    
    > good_data                                                                   
    # A tibble: 142 × 6
       country                  continent  year lifeExp      pop gdpPercap
       <chr>                    <chr>     <dbl>   <dbl>    <dbl>     <dbl>
     1 Algeria                  Africa     2007    72.3 33333216     6223.
     2 Angola                   Africa     2007    42.7 12420476     4797.
     3 Benin                    Africa     2007    56.7  8078314     1441.
     4 Botswana                 Africa     2007    50.7  1639131    12570.
     5 Burkina Faso             Africa     2007    52.3 14326203     1217.
     6 Burundi                  Africa     2007    49.6  8390505      430.
     7 Cameroon                 Africa     2007    50.4 17696293     2042.
     8 Central African Republic Africa     2007    44.7  4369038      706.
     9 Chad                     Africa     2007    50.7 10238807     1704.
    10 Comoros                  Africa     2007    65.2   710960      986.
    # ℹ 132 more rows
    # ℹ Use `print(n = ...)` to see more rows
    # using the new functionality
    bad_data <- read_excel(file, sheet = "bad", extract_colors = TRUE)
    
    > bad_data                                                                    
    # A tibble: 142 × 6
       country                   year lifeExp      pop gdpPercap country_bg
       <chr>                    <dbl>   <dbl>    <dbl>     <dbl> <chr>     
     1 Algeria                   2007    72.3 33333216     6223. #FF6600   
     2 Angola                    2007    42.7 12420476     4797. #FF6600   
     3 Benin                     2007    56.7  8078314     1441. #FF6600   
     4 Botswana                  2007    50.7  1639131    12570. #FF6600   
     5 Burkina Faso              2007    52.3 14326203     1217. #FF6600   
     6 Burundi                   2007    49.6  8390505      430. #FF6600   
     7 Cameroon                  2007    50.4 17696293     2042. #FF6600   
     8 Central African Republic  2007    44.7  4369038      706. #FF6600   
     9 Chad                      2007    50.7 10238807     1704. #FF6600   
    10 Comoros                   2007    65.2   710960      986. #FF6600

    In this way, we can group countries somehow, and one of many possibilities is to convert country_bg to a factor with five categories (e.g., one colour per continent) with:

    library(dplyr)
    
    bad_data %>%
      mutate(country_bg = as.factor(country_bg))

    or matching HEX codes to a string:

    bad_data %>%
      inner_join(
        bad_data %>%
          distinct(country_bg) %>%
          mutate(
            continent = case_when(
              country_bg == "#FF6600" ~ "Asia",
              country_bg == "#99CCFF" ~ "Europe",
              country_bg == "#FFFFCC" ~ "Africa",
              country_bg == "#993366" ~ "Americas",
              country_bg == "#666699" ~ "Oceania",
              TRUE ~ NA_character_
            )
          )
      )

    I hope this is useful!

    < footer>

    Loading…

  • < !-- Load shared sidebar -->
    To leave a comment for the author, please follow the link and comment on their blog: https://pacha.dev/blog.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
    Exit mobile version