Microsoft Office Metadata with R

June 10, 2013
By

(This article was first published on Joe's Data Diner, and kindly contributed to R-bloggers)

Sometimes I need to retrieve various items of metadata from Microsoft Office files. For the 'old-style' (i.e. '.doc' and '.xls') files perhaps a solution in python, such as hachoir, was the best way to extract this data from the ole2 file format - although perhaps it was always possible in R too? When I started digging around for a similar solution for the 'new-style' (i.e. '.xlsx' and '.docx') files I was pleasantly surprised to find the file structure is much more open, indeed it is called Office Open XML. I am by no means an expert but basically it is a zipped set of xml type files. This makes getting at the metadata so much easier. I found a simple example in python by zeekay on stack overflow. My code below is an unashamed replication of this in R.

To leave a comment for the author, please follow the link and comment on his blog: Joe's Data Diner.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.