# Tidying messy Excel data (tidyxl)

**R on Abhijit Dasgupta**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Well, here’s what I was dealing with:

Notice that we have 3 header rows, first with patient IDs, second with spine region, and third with variable names (A and B, to protect the innocent).

#### Goal

A dataset that, for each patient and each angle gives us corresponding values of A and B. So this would be a four-column data set with ID, angle, A and B.

## Attempt 1 (readxl)

d1 <- readxl::read_excel('spreadsheet1.xlsx') head(d1) ## # A tibble: 6 x 26 ## X__1 patient `44` `44__1` `10` `10__1` `3` `3__1` `53` `53__1` #### 1 IDS T5/T6 T5/T6 T5/T6 T5/T6 T5/T6 T5/T6 T5/T6 T5/T6 ## 2 angles A B A B A B A B ## 3 60 31.83… 1 31.52… 1 32.9… 0 31.8… 0 ## 4 65 31.66… 1 31.33… 1 32.2… 0 32.3… 0 ## 5 70 31.45… 1 31.09… 0.20200… 31.7… 0 32.5… 0 ## 6 75 31.08… 1 30.96… 0.44831… 31.2… 8.641… 32.3… 1 ## # ... with 16 more variables: `2` , `2__1` , `8` , ## # `8__1` , `6` , `6__1` , `43` , `43__1` , ## # `48` , `48__1` , `46` , `46__1` , `4` , ## # `4__1` , `9` , `9__1`

This strategy gives us funky column names, and pushes two of the headers into data rows. Since the headers are in rows, they’re a little harder to extract and work with. More worrisome is the fact that since the headers leaked into the data rows, the columns are all of type `character`

rather than type `numeric`

, which would now require further careful conversion after cleaning. So I don’t think `readxl`

is the way to go here, if there’s a better solution.

## Attempt 2 (tidyxl)

d2 <- tidyxl::xlsx_cells('spreadsheet1.xlsx') head(d2) ## # A tibble: 6 x 21 ## sheet address row col is_blank data_type error logical numeric #### 1 T5T6 B1 1 2 FALSE character NA NA ## 2 T5T6 C1 1 3 FALSE numeric NA 44. ## 3 T5T6 D1 1 4 FALSE numeric NA 44. ## 4 T5T6 E1 1 5 FALSE numeric NA 10. ## 5 T5T6 F1 1 6 FALSE numeric NA 10. ## 6 T5T6 G1 1 7 FALSE numeric NA 3. ## # ... with 12 more variables: date , character , ## # character_formatted , formula

, is_array , ## # formula_ref , formula_group , comment , height , ## # width , style_format , local_format_id

The `xlsx_cells`

captures the data in a tidy fashion, explicitly calling out rows and columns and other metadata within each cell. We can clean up this data using tidyverse functions:

library(tidyverse) cleanData1 <- function(d) { angle <- d %>% filter(row >= 4, col == 1) %>% pull(numeric) name <- d %>% filter(row %in% c(1,3), col >= 3) %>% mutate(character = ifelse(is.na(character), as.character(numeric), character)) %>% select(row, col, character) %>% filter(!is.na(character)) %>% spread(row, character) %>% unite(ID, `1`:`3`, sep = '_') %>% pull(ID) data <- d %>% filter(row >= 4, col >= 3) %>% filter(!is.na(numeric)) %>% select(row, col, numeric) %>% spread(col, numeric) %>% select(-row) %>% set_names(name) %>% cbind(angle) %>% gather(variable, value, -angle) %>% separate(variable, c('ID','Measure'), sep = '_') %>% spread(Measure, value) %>% select(ID, angle, A, B) %>% arrange(ID, angle) return(data) } head(cleanData1(d2)) ## ID angle A B ## 1 10 60 31.52867 1.000000 ## 2 10 65 31.33477 1.000000 ## 3 10 70 31.09272 0.202002 ## 4 10 75 30.96078 0.448317 ## 5 10 80 30.79397 0.670876 ## 6 10 85 30.52185 0.461406

This is a lot of data munging, and though `dplyr`

is powerful, it took a lot of trial and error to get the final pipeline done.

Nonetheless, I was really psyched about `tidyxl`

, since it automated a job that would have taken manual manipulation (I had 12 spreadsheets like this to process). I was going to write a blog post on this cool package that made my life dealing with messy Excel file a piece of cake. But wait, there’s more…

## Attempt 3 (tidyxl + unpivotr)

I didn’t know about `unpivotr`

until this post:

So maybe all that complicated munging can be simplfied.When your spreadsheet is too ??? for readxl, tidyxl + unpivotr helps you tackle charming features like “data as formatting” and “data in the layout”. https://t.co/ABerpfHT8W

— Jenny Bryan (@JennyBryan) December 7, 2017

# devtools::install_github('nacnudus/unpivotr') library(unpivotr) cleanData2 <- function(d){ bl <- d %>% select(row, col, data_type, numeric, character) %>% behead('N', ID) %>% behead('N', spine) %>% behead('N', variable) # Extract the angles column bl1 <- bl %>% filter(variable == 'angles') %>% spatter(variable) %>% select(row, angles) # Extract the rest of the columns bl2 <- bl %>% filter(variable %in% c('A','B')) %>% select(-spine, -col) %>% spatter(ID) %>% # Spread to columns select(-character) %>% # All my variables are numeric gather(ID, value, -row, -variable) %>% spread(variable, value) final <- bl1 %>% left_join(bl2) %>% # put things back together arrange(ID, angles) %>% select(ID, everything(),-row) # re-arrange columns return(final) } cleanData2(d2) ## # A tibble: 588 x 4 ## ID angles A B #### 1 10 60. 31.5 1.00 ## 2 10 65. 31.3 1.00 ## 3 10 70. 31.1 0.202 ## 4 10 75. 31.0 0.448 ## 5 10 80. 30.8 0.671 ## 6 10 85. 30.5 0.461 ## 7 10 90. 30.3 0.245 ## 8 10 95. 30.0 0.159 ## 9 10 100. 29.7 0.170 ## 10 10 105. 29.2 0.421 ## # ... with 578 more rows

In this example, I’m using the `behead`

function (available in the development version of `unpivotr`

on GitHub) to extract out the three
rows of headers. Then I’m extracting out the `angles`

column separately and merging it with the rest of the columns.

In case you’re wondering about the “N” in the

`behead`

code,`unpivotr`

has a geographic options system as to where the headers are with respect to the main code. This vignette explains this nomenclature.

## Attempt 4 (tidyxl + unpivotr)

After re-reading the `unpivotr`

documentation, I realized that the `angles`

column could be treated as a row header in the `unpivotr`

code. So I further modified the function:

cleanData3 <- function(d) { final <- d %>% select(row, col, data_type, numeric, character) %>% behead('N', ID) %>% # Extract column headers behead('N', spine) %>% behead('N', variable) %>% behead('W', angles) %>% # angles as row header select(numeric, ID:angles, data_type, -spine) %>% # all vars are numeric filter(variable %in% c('A', 'B')) %>% # Kills off some extra columns spatter(variable) # Spreads, using data_type, numeric return(final) } cleanData3(d2) ## # A tibble: 588 x 4 ## ID angles A B #### 1 10 60. 31.5 1.00 ## 2 10 65. 31.3 1.00 ## 3 10 70. 31.1 0.202 ## 4 10 75. 31.0 0.448 ## 5 10 80. 30.8 0.671 ## 6 10 85. 30.5 0.461 ## 7 10 90. 30.3 0.245 ## 8 10 95. 30.0 0.159 ## 9 10 100. 29.7 0.170 ## 10 10 105. 29.2 0.421 ## # ... with 578 more rows

I get to the same output, but with much cleaner code. This is cool!!

I’m going to go deeper into the `unpivotr`

documentation and see what else can be in my regular pipeline. A big thank you to the tool-makers that create these tools that make everyday activies easier and make us stay saner.

**leave a comment**for the author, please follow the link and comment on their blog:

**R on Abhijit Dasgupta**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.