Going from a human readable Excel file to a machine-readable csv with {tidyxl}

September 10, 2018
By

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)


I won’t write a very long introduction; we all know that Excel is ubiquitous in business, and that
it has a lot of very nice features, especially for business practitioners that do not know any
programming. However, when people use Excel for purposes it was not designed for, it can be a
hassle. Often, people use Excel as a reporting tool, which it is not; they create very elaborated
and complicated spreadsheets that are human readable, but impossible to import within any other tool.

In this blog post (which will probably be part of a series), I show you how you can go from this:

to this:

You can find the data I will use here. Click on the “Time use” folder and you can download the workbook.

The Excel workbook contains several sheets (in French and English) of the amount of time Luxembourguish
citizens spend from Monday to Sunday. For example, on average, people that are in employment spend
almost 8 hours sleeping during the week days, and 8:45 hours on Saturday.

As you can see from the screenshot, each sheet contains several tables that have lots of headers
and these tables are next to one another. Trying to import these sheets with good ol’ readxl::read_excel()
produces a monster.

This is where {tidyxl} comes into play. Let’s import the workbook with {tidyxl}:

library(tidyverse)
library(tidyxl)

time_use_xl <- xlsx_cells("time-use.xlsx")

Let’s see what happened:

head(time_use_xl)
## # A tibble: 6 x 21
##   sheet address   row   col is_blank data_type error logical numeric
##                        
## 1 Index A1          1     1 FALSE    character   NA           NA
## 2 Index B1          1     2 TRUE     blank       NA           NA
## 3 Index C1          1     3 TRUE     blank       NA           NA
## 4 Index D1          1     4 TRUE     blank       NA           NA
## 5 Index E1          1     5 TRUE     blank       NA           NA
## 6 Index F1          1     6 TRUE     blank       NA           NA
## # ... with 12 more variables: date , character ,
## #   character_formatted , formula , is_array ,
## #   formula_ref , formula_group , comment , height ,
## #   width , style_format , local_format_id 

As you can see, the sheet was imported, but the result might be unexpected. Actually, time_use_xl
is a tibble object, where each row is one cell of the Excel sheet. This might seem very complicated
to handle, but you will see that it actually makes things way easier.

I only want to work on the English sheets so I use the following code to ignore the French ones:

sheets <- xlsx_sheet_names("~/Downloads/time-use.xlsx") %>%
    keep(grepl(pattern = ".*day$", .))

Also, there’s a sheet that aggregates the results for week days and weekends, which I also ignore.

Now, to extract the tables from each sheet I wrote the following function:

extract_data <- function(sheet){
    activities <- sheet %>%
        filter(col == 2) %>%
        select(row, character) %>%
        filter(row %in% seq(6,58)) %>%
        rename(activities = character) %>%
        select(-row)
    
    cols_to_extract <- sheet %>% 
        filter(grepl("Population who completed.*", character)) %>% 
        pull(col)
    
    headers_pos <- cols_to_extract - 1
    
    headers <- sheet %>%
        filter(col %in% headers_pos, row == 3) %>%
        pull(character)
    
    cols_to_extract %>% 
        map(~filter(sheet, col %in% .)) %>%
        map(~select(., sheet, address, row, col, character)) %>%
        map(~filter(., row %in% seq(6,58))) %>%
        map(~select(., character)) %>%
        map2(.x = ., .y = headers, ~mutate(.x, "population" = .y)) %>%
        map(., ~bind_cols(activities, .)) %>%
        bind_rows()
}

Let’s study it step by step and see how it works. First, there’s the argument, sheet. This function
will be mapped to each sheet of the workbook. Then, the first block I wrote, extracts the
activities:

    activities <- sheet %>%
        filter(col == 2) %>%
        select(row, character) %>%
        filter(row %in% seq(6,58)) %>%
        rename(activities = character) %>%
        select(-row)

I only keep the second column (filter(col == 2)); col is a column of the tibble and if you
look inside the workbook, you will notice that the activities are on the second column, or the B
column. Then, I select two columns, the row and the character column. row is self-explanatory
and character actually contains whatever is written inside the cells. Then, I only keep rows
6 to 58, because that is what interests me; the rest is either empty cells, or unneeded. Finally,
I rename the character column to activities and remove the row column.

The second block:

    cols_to_extract <- sheet %>% 
        filter(grepl("Population who completed.*", character)) %>% 
        pull(col)

returns the index of the columns I want to extract. I am only interested in the people that have
completed the activities, so using grepl() inside filter(), I located these columns, and use
pull()… to pull them out of the data frame! cols_to_extract is thus a nice atomic vector of
columns that I want to keep.

In the third block, I extract the headers:

    headers_pos <- cols_to_extract - 1

Why - 1? This is because if you look in the Excel, you will see that the headers are one column before
the column labeled “People who completed the activity”. For example on column G, I have “People who completed the activity”
and on column F I have the header, in this case “Male”.

Now I actually extract the headers:

    headers <- sheet %>%
        filter(col %in% headers_pos, row == 3) %>%
        pull(character)

Headers are always on the third row, but on different columns, hence the col %in% headers_pos. I
then pull out the values inside the cells with pull(character). So my headers object will be
an atomic vector with “All”, “Male”, “Female”, “10 – 19 years”, etc… everything on row 3.

Finally, the last block, actually extracts the data:

    cols_to_extract %>% 
        map(~filter(sheet, col %in% .)) %>%
        map(~select(., sheet, address, row, col, character)) %>%
        map(~filter(., row %in% seq(6,58))) %>%
        map(~select(., character)) %>%
        map2(.x = ., .y = headers, ~mutate(.x, "population" = .y)) %>%
        map(., ~bind_cols(activities, .)) %>%
        bind_rows()

cols_to_extract is a vector with the positions of the columns that interest me. So for example
“4”, “7”, “10” and so on. I map this vector to the sheet, which returns me a list of extracted
data frames. I pass this down to a select() (which is inside map()… why? Because the input
parameter is a list of data frames). So for each data frame inside the list, I select the columns
sheet, address, row, col and character. Then, for each data frame inside the list,
I use filter() to only keep the rows from position 6 to 58. Then, I only select the character
column, which actually contains the text inside the cell. Then, using map2(), I add the values
inside the headers object as a new column, called population. Then, I bind the activities
column to the data frame and bind all the rows together.

Time to use this function! Let’s see:

clean_data <- sheets %>%
    map(~filter(time_use_xl, sheet %in% .)) %>%
    set_names(sheets) %>%
    map(extract_data) %>%
    map2(.x = ., .y = sheets, ~mutate(.x, "day" = .y)) %>%
    bind_rows() %>%
    select(day, population, activities, time = character)

glimpse(clean_data)
## Observations: 2,968
## Variables: 4
## $ day         "Year 2014_Monday til Friday", "Year 2014_Monday ti...
## $ population  "All", "All", "All", "All", "All", "All", "All", "A...
## $ activities  "Personal care", "Sleep", "Eating", "Other personal...
## $ time        "11:07", "08:26", "01:47", "00:56", "07:37", "07:47...

So I map my list of sheets to the tibble I imported with readxl, use set_names to
name the elements of my list (which is superfluous, but I wanted to show this; might interest you!)
and then map this result to my little function. I could stop here,
but I then add a new column to each data frame that contains the day on which the data was
measured, bind the rows together and reorder the columns. Done!

Now, how did I come up with this function? I did not start with a function. I started by writing
some code that did what I wanted for one table only, inside one sheet only. Only when I got
something that worked, did I start to generalize to several tables and then to several sheets. Most
of the time spent was actually in trying to find patterns in the Excel sheet that I could use
to write my function (for example noticing that the headers I wanted where always one column before
the column I was interested in). This is my advice when working with function programming; always
solve the issue for one element, wrap this code inside a function, and then simply map this function
to a list of elements!

If you found this blog post useful, you might want to follow me on twitter
for blog post updates.

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)