Reading data from the new version of Google Spreadsheets

June 3, 2014
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Spreadsheets remain an important way for people to share and work with data. Among other providers, Google has provided the ability to create online spreadsheets and other documents.

Back in 2009, David Smith posted a blog entry on how to use R, and specifically the XML package to import data from a Google Spreadsheet. Once you marked your Google sheet as exported, it took about two lines of code to import your data into a data frame.

But things have changed

More recently, it seems that Google changed and improved the Spreadsheet product. Google's own overview of changes lists some changes, but one change isn't on this list. In the previous version, it was possible to publish a sheet as a csv file. In the new version it is still possible to publish a sheet, but the ability to do this as csv is no longer there.

On April 5, 2014 somebody asked a question on StackOverflow on how to deal with this.

Because I had the same need to import data from a spreadsheet shared in our team, I set out to find and answer.

Quick overview of publishing a sheet in the new version of Google docs

To publish a Google Docs spreadsheet is really as simple as following these three steps:

  • Create a google sheet
  • Publish to web
  • Copy the document link

R code to read the Google data

Here is the code. You will need to load the XML package before using this.

The function

readGoogleSheet()

returns a list of data frames, one for each table found on the Google sheet:

library(XML)
readGoogleSheet <- function(url, na.string="", header=TRUE){
  stopifnot(require(XML))
  # Suppress warnings because Google docs seems to have incomplete final line
  suppressWarnings({
    doc <- paste(readLines(url), collapse=" ")
  })
  if(nchar(doc) == 0) stop("No content found")
  htmlTable <- gsub("^.*?(", doc)
  ret <- readHTMLTable(htmlTable, header=header, stringsAsFactors=FALSE, as.data.frame=TRUE)
  lapply(ret, function(x){ x[ x == na.string] <- NA; x})
}

Next, we need a function to clean the individual tables.

cleanGoogleTable()

removes empty lines inserted by Google, removes the row names (if they exist) and allows you to skip empty lines before the table starts:

cleanGoogleTable <- function(dat, table=1, skip=0, ncols=NA, nrows=-1, header=TRUE, dropFirstCol=NA){
  if(!is.data.frame(dat)){
    dat <- dat[[table]]
  }
  if(is.na(dropFirstCol)) {
    firstCol <- na.omit(dat[[1]])
    if(all(firstCol == ".") || all(firstCol== as.character(seq_along(firstCol)))) {
      dat <- dat[, -1]
    }
  } else if(dropFirstCol) {
    dat <- dat[, -1]
  }
  if(skip > 0){
    dat <- dat[-seq_len(skip), ]
  }
  if(nrow(dat) == 1) return(dat)
  if(nrow(dat) >= 2){
    if(all(is.na(dat[2, ]))) dat <- dat[-2, ]
  }
  if(header && nrow(dat) > 1){
    header <- as.character(dat[1, ])
    names(dat) <- header
    dat <- dat[-1, ]
  }
  # Keep only desired columns
  if(!is.na(ncols)){
    ncols <- min(ncols, ncol(dat))
    dat <- dat[, seq_len(ncols)]
  }
  # Keep only desired rows
  if(nrows > 0){
    nrows <- min(nrows, nrow(dat))
    dat <- dat[seq_len(nrows), ]
  }
  # Rename rows
  rownames(dat) <- seq_len(nrow(dat))
  dat
}

See it in action

I created a Google Spreadsheet with data about the periodic table of elements. I then published this sheet. If you follow the link, you will notice that the document contains two sheets. Sheet1 contains the periodic table data.

To read the table, try the following three lines of code:

gdoc <- "https://docs.google.com/spreadsheets/d/1MQ50_tn76GqQAOpFigcHms4zFqkoM_JS4sOittv_vgA/pubhtml"
elem <- readGoogleSheet(gdoc)
m <- cleanGoogleTable(elem, table=1)
head(m)
    Atomic no          Name Symbol Group Period Block State at STP Occurrence
  1         1      Hydrogen      H     1      1     s          Gas Primordial
  2         2        Helium     He    18      1     s          Gas Primordial
  3         3       Lithium     Li     1      2     s        Solid Primordial
  4         4     Beryllium     Be     2      2     s        Solid Primordial
  5         5         Boron      B    13      2     p        Solid Primordial
  6         6        Carbon      C    14      2     p        Solid Primordial
tail(m)
    Atomic no          Name Symbol Group Period Block State at STP Occurrence
113       113   (Ununtrium)    Uut    13      7     p               Synthetic
114       114 (Ununquadium)    Uuq    14      7     p               Synthetic
115       115 (Ununpentium)    Uup    15      7     p               Synthetic
116       116  (Ununhexium)    Uuh    16      7     p               Synthetic
117       117 (Ununseptium)    Uus    17      7     p               Synthetic
118       118  (Ununoctium)    Uuo    18      7     p               Synthetic

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)