Tidying messy Excel data (Introduction)

April 30, 2018
By

(This article was first published on R on Abhijit Dasgupta, and kindly contributed to R-bloggers)

Personal expressiveness, or how data is stored in a spreadsheet

When you get data from a broad research community, the variability in how that data is
formatted and stored is truly astonishing. Of course there are the standardized formats that are output from machines, like Next Generation Sequencing and other automated systems. That is a saving grace!

But for smaller data, or data collected in the lab, the possibilities are truly endless! You can get every possiblle color-coding of rows, columns and cells, merged cells,
hidden columns and rows, and inopportune blank spaces that convert numbers to characters,
and notes where there should be numbers. That’s just from the more organized spreadsheets. Then you get multiple tables in the same spreadsheet, ad hoc computations in some cells,
cells copied by hand (with error), and sundry other variations on this theme. In other words, it can be a nightmare scenario for the analyst. To wit,

In thinking about this post, I went back and looked at the documentation of the readxl package, which has made reading Excel files into R a lot easier than before. This package is quite powerful, so as long as data are in a
relatively clean tabular form, this tool can pull it into R; see these vignettes to get a real sense of how to process decently behaved Excel files with R.

On a side note, how many ways of importing Excel files into R can you name, or have you used?

The readxl package has been my friend for a while, but then I received some well-intentioned spreadsheets that even readxl wouldn’t touch, despite much coaxing. Now, I had two options: ask the collaborator to re-format the spreadsheet (manually, of course :smile:), which in my experience is a disaster waiting to happen; or just take the spreadsheet as is and figure out how to import it into R. I almost always take the latter route, and tidyxl is my bosom friend in this effort. In the next part of this series, I’ll describe my experiences with tidyxl and why, every time I use it, I say a small blessing for the team that created it.

To leave a comment for the author, please follow the link and comment on their blog: R on Abhijit Dasgupta.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)