Reading conditional formatting in XLSX files in a Tidy way with readxl
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Because of delays with my scholarship payment, if this post is useful to you I kindly ask a minimal donation on Buy Me a Coffee or GitHub Sponsorships. It shall be used to continue my Open Source efforts. The full explanation is here: A Personal Message from an Open Source Contributor. If you play the electric guitar, the same scholarship chaos led me to turn my guitar pedals and DIY kits hobby into a business, and you can check those here.
Continuing with the previous post, I found an additional data organization issue: Using conditional format as information.
Like all tools (R included), spreadsheets are good when used properly. Conditional formats provide visual guides but there are better ways to express categories such as “above the mean” (e.g. using “VLOOKUP” to clearly add categories to the tables).
Think about the gapminder dataset. If instead of this:
You get this:
Then without the “good” sheet you need extra information not present in the file to know what the colours in “lifeExp” mean. In this case, the colour is used to express if the country in question has a life expectancy above or below the continent’s mean. This could be guessed dedicating time to explore the data, which can be time consuming and lead to wrong conclusions.
The changes I made to the readxl package previously cannot read the colours dynamically created with conditional formatting, and I had to expand my Pull Request.
The new change alters the C++ code such that the extra extract_colors
parameter in read_excel()
, and now allows to do this:
file <- readxl_example("gapminder-2007.xlsx") # using the new functionality bad_data <- read_excel(file, sheet = "bad", extract_colors = TRUE) > bad_data # A tibble: 142 × 7 country year lifeExp pop gdpPercap country_bg lifeExp_bg <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> 1 Algeria 2007 72.3 3.33e7 6223. #FF6600 #CCFFCC 2 Angola 2007 42.7 1.24e7 4797. #FF6600 #FFCCCC 3 Benin 2007 56.7 8.08e6 1441. #FF6600 #FFCCCC 4 Botswana 2007 50.7 1.64e6 12570. #FF6600 #FFCCCC 5 Burkina Faso 2007 52.3 1.43e7 1217. #FF6600 #FFCCCC 6 Burundi 2007 49.6 8.39e6 430. #FF6600 #FFCCCC 7 Cameroon 2007 50.4 1.77e7 2042. #FF6600 #FFCCCC 8 Central African Republic 2007 44.7 4.37e6 706. #FF6600 #FFCCCC 9 Chad 2007 50.7 1.02e7 1704. #FF6600 #FFCCCC 10 Comoros 2007 65.2 7.11e5 986. #FF6600 #CCFFCC # ℹ 132 more rows # ℹ Use `print(n = ...)` to see more rows
In this way, we can extract some information from the conditional formatting, such as:
> unique(bad_data$lifeExp_bg) [1] "#CCFFCC" "#FFCCCC"
and compare the unique values in the “good” and “bad” sheets:
> good_data <- read_excel(file, sheet = "good") > unique(good_data$lifeExpOverContinentAvg) [1] "Above Continent Average" "Below Continent Average"
This is an example of some trime consuming tasks as without the “good” sheet we would need to guess this with the “bad” sheet:
> library(dplyr) bad_data %>% group_by(country_bg) %>% mutate(lifeExpAvg = mean(lifeExp, na.rm = T)) %>% mutate(lifeExpOverContinentAvg = case_when( lifeExp > lifeExpAvg ~ "Above Continent Average", TRUE ~ "Below Continent Average" )) # A tibble: 142 × 9 # Groups: country_bg [5] country year lifeExp pop gdpPercap country_bg lifeExp_bg lifeExpAvg <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> 1 Algeria 2007 72.3 3.33e7 6223. #FF6600 #CCFFCC 54.8 2 Angola 2007 42.7 1.24e7 4797. #FF6600 #FFCCCC 54.8 3 Benin 2007 56.7 8.08e6 1441. #FF6600 #FFCCCC 54.8 4 Botswana 2007 50.7 1.64e6 12570. #FF6600 #FFCCCC 54.8 5 Burkina Faso 2007 52.3 1.43e7 1217. #FF6600 #FFCCCC 54.8 6 Burundi 2007 49.6 8.39e6 430. #FF6600 #FFCCCC 54.8 7 Cameroon 2007 50.4 1.77e7 2042. #FF6600 #FFCCCC 54.8 8 Central Afri… 2007 44.7 4.37e6 706. #FF6600 #FFCCCC 54.8 9 Chad 2007 50.7 1.02e7 1704. #FF6600 #FFCCCC 54.8 10 Comoros 2007 65.2 7.11e5 986. #FF6600 #CCFFCC 54.8 # ℹ 132 more rows # ℹ 1 more variable: lifeExpOverContinentAvg <chr>
I hope this is useful!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.