Extract tables from messy spreadsheets with jailbreakr

August 17, 2016
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

R has some good tools for importing data from spreadsheets, among them the readxl package for Excel and the googlesheets package for Google Sheets. But these only work well when the data in the spreadsheet are arranged as a rectangular table, and not overly encumbered with formatting or generated with formulas. As Jenny Bryan pointed out in her recent talk at the useR!2016 conference (and embedded below, or download PDF slides here), in practice few spreadsheets have “a clean little rectangle of data in the upper-left corner”, because most people use spreadsheets not just a file format for data retrieval, but also as a reporting/visualization/analysis tool.

Nonetheless, for a practicing data scientist, there’s a lot of useful data locked up in these messy spreadsheets that needs to be imported into R before we can begin analysis. As just one example given by Jenny in her talk, this spreadsheet was included as one of 15,000 spreadsheet attachments (one with 175 tabs!) in the Enron Corpus.

Enron

To make it easier to import data into R from messy spreadsheets like this, Jenny and co-author Richard G. FitzJohn created the jailbreakr package. The package is in its early stages, but it can already import Excel (xlsx format) and Google Sheets intro R as a new “linen” objects from which small sub-tables can easily be extracted as data frames. It can also print spreadsheets in a condensed text-based format with one character per cell — useful if you’re trying to figure out why an apparently simple spreadsheet isn’t importing as you expect. (Check out the “weekend getaway winner” story near the end of Jenny’s talk for a great example.)

The jailbreakr package isn’t yet on CRAN, but if you want to try it out you can download it from the Github repository (or even contribute!) at the link below.

Github (rsheets): jailbreakr

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)