XLConnect 0.2-4

February 3, 2013
By

(This article was first published on Mirai Solutions » R, and kindly contributed to R-bloggers)

Mirai Solutions GmbH (http://www.mirai-solutions.com) is very pleased to announce the release of XLConnect 0.2-4, which is available from CRAN.

This newest release comes along with a number of new features:

  • Ability to read cached cell values. There is a new argument useCachedValues to methods readWorksheet and readNamedRegion. This is especially useful for cells with complex formulas that XLConnect (actually the underlying Apache POI library) cannot evaluate and so far would have resulted in NAs being returned. With this new option, XLConnect can be instructed to read the cached cell values instead of evaluating the cells – which should also result in improved runtime performance.
  • There are new arguments keep and drop to readWorksheet and readNamedRegion which allow filtering of columns before the data is imported to R. This can result in much better performance compared to importing the full data to R and then doing the subsetting in R.
  • There is a new vignette called XLConnect for the Impatient which provides a one page overview of XLConnect’s read/write methods – to get new users started quickly.
  • The bounding box detection algorithm for readWorksheet has been reworked to better deal with leading and trailing rows and columns of blank (empty) cells. In addition, new arguments autofitRow and autofitCol have been introduced that allow removal of leading and trailing rows and columns even in case boundaries have been specified by the user. This is useful in situations where the data is expected within certain given boundaries but the exact location is unknown.
  • String conversions when reading in data from Excel now respect the data format of the cells.
  • Support for numeric missing value identifiers has been added. See method setMissingValue for more information.
  • XLConnect 0.2-4 comes along with Apache POI 3.9 (see package dependency XLConnectJars)

A summary of these news can be found in the NEWS file on CRAN.

To see some of the new features in action, please find below some examples:

# Load the XLConnect package
require(XLConnect)

# Download sample data
download.file("http://miraisolutions.files.wordpress.com/2013/01/example.xlsx", destfile = "example.xlsx")

# Load the workbook
wb = loadWorkbook("example.xlsx")

# Read the data on the sheet named "data"
data = readWorksheet(wb, sheet = "data")
# Note all the NA values due to the cells with text containing
# 'some stuff'
print(head(data))

# Let's assume we only roughly know the data is located within some
# given boundaries (exact location would be startRow = 10,
# startCol = 8, endRow = 42, endCol = 20)
data = readWorksheet(wb, sheet = "data", startRow = 7, startCol = 7,
                     endRow = 46, endCol = 23)
# However, XLConnect is capable to find the actual data region
print(head(data))
# --> This is due to the arguments autofitRow and autofitCol
# which default to TRUE

# Let's see what would happen if we put them to FALSE
data = readWorksheet(wb, sheet = "data", startRow = 7, startCol = 7,
                     endRow = 46, endCol = 23, autofitRow = FALSE,
                     autofitCol = FALSE)
# --> Well, it reads the region exactly as specified above,
# so we get NAs obviously.

# Going back to the previous version
data = readWorksheet(wb, sheet = "data", startRow = 7, startCol = 7,
                     endRow = 46, endCol = 23)

# Did you note the column named 'complex' which is all NAs?
# This is due to the fact that the 'complex' column is computed
# using the BESSELJ function in Excel which is not yet implemented
# in Apache POI. However, we can still read the values by instructing
# XLConnect to read the _cached_ values rather than trying to evaluate
# the formula
data = readWorksheet(wb, sheet = "data", startRow = 7, startCol = 7,
                     endRow = 46, endCol = 23, useCachedValues = TRUE)
print(head(data))

# Let's get on to filtering. Let's assume we are only interested
# in the columns "car", "mpg" and "wt". Generally, we could
# do something like this:
data = readWorksheet(wb, sheet = "data", startRow = 7, startCol = 7,
                     endRow = 46, endCol = 23, useCachedValues = TRUE)
dataSub = data[, c("car", "mpg", "wt")]
print(head(dataSub))
# What we do here is import ALL the data and subset the data
# in R. This is fine and no problem for small data sets. For
# large data sets, however, there might be a performance penalty.
# It may be better to the subsetting on the Java side directly:
data = readWorksheet(wb, sheet = "data", startRow = 7, startCol = 7,
                     endRow = 46, endCol = 23, useCachedValues = TRUE,
                     keep = c("car", "mpg", "wt"))
print(head(data))

# Similarly, instead of 'keeping' columns, you can also drop
# them:
data = readWorksheet(wb, sheet = "data", startRow = 7, startCol = 7,
                     endRow = 46, endCol = 23, useCachedValues = TRUE,
                     drop = c("disp", "hp", "drat", "carb"))
print(head(data))

# Now on to missing values. Noted the values 999999 in column
# "gear"? We might would want those to read in as NAs. How can
# we achieve that?

# Set the missing value identifier
setMissingValue(wb, value = 999999)
# Read in the data
data = readWorksheet(wb, sheet = "data", startRow = 7, startCol = 7,
                     endRow = 46, endCol = 23, useCachedValues = TRUE)
# Here we go ...
print(head(data))

XLConnect is a comprehensive and cross-platform R package for manipulating Microsoft Excel files from within R. XLConnect differs from other related R packages in that it is completely cross-platform and as such runs under Windows, Unix/Linux and Mac (32- and 64-bit). Moreover, it does not require any installation of Microsoft Excel or any other special drivers to be able to read & write Excel files. The only requirement is a recent version of a Java Runtime Environment (JRE).

The package can easily be installed from CRAN via install.packages("XLConnect") (use install.packages("XLConnect", type = "source") on Mac OS X). In order to get started have a look at the XLConnect and XLConnect for the Impatient package vignettes, the numerous demos available via demo(package = "XLConnect") or browse through the comprehensive reference manual.

We are always happy to hear from you! For general feedback write us an email to [email protected]. If you find any bugs or hope to see some currently missing features in the future, just send us an email to [email protected]. You may also want to check out our XLConnect FAQ.


To leave a comment for the author, please follow the link and comment on his blog: Mirai Solutions » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.