Microsoft Office Metadata with R

June 10, 2013

(This article was first published on Joe's Data Diner, and kindly contributed to R-bloggers)

Sometimes I need to retrieve various items of metadata from Microsoft Office files. For the ‘old-style’ (i.e. ‘.doc’ and ‘.xls’) files perhaps a solution in python, such as hachoir, was the best way to extract this data from the ole2 file format – although perhaps it was always possible in R too? When I started digging around for a similar solution for the ‘new-style’ (i.e. ‘.xlsx’ and ‘.docx’) files I was pleasantly surprised to find the file structure is much more open, indeed it is called Office Open XML. I am by no means an expert but basically it is a zipped set of xml type files. This makes getting at the metadata so much easier. I found a simple example in python by zeekay on stack overflow. My code below is an unashamed replication of this in R.

To leave a comment for the author, please follow the link and comment on their blog: Joe's Data Diner. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)