Reading the background colour of an XLSX file in a Tidy way with readxl
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Because of delays with my scholarship payment, if this post is useful to you I kindly ask a minimal donation on Buy Me a Coffee. It shall be used to continue my Open Source efforts. The full explanation is here: A Personal Message from an Open Source Contributor. If you play the electric guitar, the same scholarship chaos led me to turn my guitar pedals and DIY kits hobby into a business, and you can check those here.
I had to read some XLSX (Microsoft Excel) files for a consultancy project I am leading now in Chile, when I found a severe data organization problem. Instead of coding the municipalities in a district as:
Distrito Circunscripción Provincial Comuna 14 Talagante Talagante 14 Talagante El Monte 14 Talagante Isla de Maipo 14 Talagante Padre Hurtado 14 Talagante Peñaflor ...
Some sheets in the files used multiple background colours to indicate the district that each municipality (“comuna”) belongs to when it comes to define who is the congressperson for that municipality (the map congressperson-municipality is a 1-to-many when the map mayor-municipality is 1-to-1).
This way of representing information is bad for everybody. It is not clear for the reader and it is more time consumer for whom created the file. A good resource about proper data organization with spreadsheets is Data organization in spreadsheets by Karl Broman and Kara Woo that I suggest to my students each semester.
Think about the gapminder dataset. If instead of this:
You get this:
Then you need extra information not present in the file to know the continent for each country. This is not so bad but it is for more specific information (e.g., think about the international phone code).
I searched online and found a few cumbersome solutions from nearly ten years ago using the xlsx package. Not happy with those, I used my C++ knowledge obtained in ECE244 and made some changes to the readxl package.
The change adds an extra extract_colors
parameter to read_excel()
, and allows to do this:
# extract_colors = FALSE (default) good_data <- read_excel(file, sheet = "good") > good_data # A tibble: 142 × 6 country continent year lifeExp pop gdpPercap <chr> <chr> <dbl> <dbl> <dbl> <dbl> 1 Algeria Africa 2007 72.3 33333216 6223. 2 Angola Africa 2007 42.7 12420476 4797. 3 Benin Africa 2007 56.7 8078314 1441. 4 Botswana Africa 2007 50.7 1639131 12570. 5 Burkina Faso Africa 2007 52.3 14326203 1217. 6 Burundi Africa 2007 49.6 8390505 430. 7 Cameroon Africa 2007 50.4 17696293 2042. 8 Central African Republic Africa 2007 44.7 4369038 706. 9 Chad Africa 2007 50.7 10238807 1704. 10 Comoros Africa 2007 65.2 710960 986. # ℹ 132 more rows # ℹ Use `print(n = ...)` to see more rows bad_data <- read_excel(file, sheet = "bad", extract_colors = TRUE) > bad_data # A tibble: 142 × 6 country year lifeExp pop gdpPercap country_bg <chr> <dbl> <dbl> <dbl> <dbl> <chr> 1 Algeria 2007 72.3 33333216 6223. #FF6600 2 Angola 2007 42.7 12420476 4797. #FF6600 3 Benin 2007 56.7 8078314 1441. #FF6600 4 Botswana 2007 50.7 1639131 12570. #FF6600 5 Burkina Faso 2007 52.3 14326203 1217. #FF6600 6 Burundi 2007 49.6 8390505 430. #FF6600 7 Cameroon 2007 50.4 17696293 2042. #FF6600 8 Central African Republic 2007 44.7 4369038 706. #FF6600 9 Chad 2007 50.7 10238807 1704. #FF6600 10 Comoros 2007 65.2 710960 986. #FF6600
In this way, we can group countries somehow, and one of many possibilities is to convert country_bg
to a factor with five categories (e.g., one colour per continent) with:
library(dplyr) bad_data %>% mutate(country_bg = as.factor(country_bg))
or matching HEX codes to a string:
bad_data %>% inner_join( bad_data %>% distinct(country_bg) %>% mutate( continent = case_when( country_bg == "#FF6600" ~ "Asia", country_bg == "#99CCFF" ~ "Europe", country_bg == "#FFFFCC" ~ "Africa", country_bg == "#993366" ~ "Americas", country_bg == "#666699" ~ "Oceania", TRUE ~ NA_character_ ) ) )
I hope this is useful!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.