XLConnect – A platform-independent interface to Excel

August 31, 2011
By

(This article was first published on Mirai Solutions » R, and kindly contributed to R-bloggers)

XLConnect is a comprehensive and platform-independent R package for manipulating Microsoft Excel files from within R. XLConnect differs from other related R packages in that it is completely cross-platform and as such runs under Windows, Unix/Linux and Mac (32- and 64-bit). Moreover, it does not require any installation of Microsoft Excel or any other special drivers to be able to read & write Excel files. The only requirement is a recent version of a Java Runtime Environment (JRE). Also, XLConnect can deal with the old *.xls (BIFF) and the new *.xlsx (Office Open XML) file formats. Under the hood, XLConnect uses Apache POI (http://poi.apache.org) – a Java API to manipulate Microsoft Office documents.

XLConnect‘s main features include (as of version 0.1-5):

  • Reading & writing of worksheets (via data.frames)
  • Reading & writing of named regions/ranges (via data.frames)
  • Creating/removing/renaming/cloning worksheets
  • Adding graphics
  • (Named) cellstyles: data formats, borders, back- and foreground fill color, fill pattern, text wrapping
  • Controlling sheet visibility
  • Defining column width and row height
  • Merging/unmerging cells
  • Setting/getting cell formulas
  • Defining formula recalculation behavior (when workbooks are opened)
  • Setting auto-filters
  • Style actions: controlling application of cell styles when writing (e.g. when using templates)
  • Defining behavior when error cells are encountered

In order to make the XLConnect package a bit more accessible, this post will run through an example whose use case is the generation of a very simple Excel report showing the recent development of currency exchange rates (EUR, USD, GBP, JPY vs Swiss Franc (CHF)). Note that this post does not intend to give any investment advise!

##################################################
# Installation of Packages Required by this Demo #
##################################################

# XLConnect: Excel Connector for R (that's what this demo is about)

# fImport: Rmetrics - Economical and Financial Data Import
# forecast: Forecasting functions for time series
# zoo: S3 Infrastructure for Regular and Irregular Time Series
# ggplot2: An implementation of the Grammar of Graphics

install.packages(c("XLConnect", "fImport", "forecast", "zoo", "ggplot2"))

#################
# Load Packages #
#################

require(XLConnect)
require(fImport)
require(forecast)
require(zoo)
require(ggplot2)

####################
# Data Preparation #
####################

# Currencies we're interested in compared to CHF
currencies = c("EUR", "USD", "GBP", "JPY")

# Fetch currency exchange rates (currency to CHF) from OANDA (last 366 days)
curr = do.call("cbind", args = lapply(currencies,
               function(cur) oandaSeries(paste(cur, "CHF", sep = "/"))))
# Make a copy for later use
curr.orig = curr
# Scale currencies to exchange rate on first day in the series (baseline)
curr = curr * matrix(1/curr[1,], nrow = nrow(curr),
                     ncol = ncol(curr), byrow = TRUE) - 1
# Some data transformations to bring the data into a simple data.frame
curr = transform(curr, Time = time(curr)@Data)
names(curr) = c(currencies, "Time")
# Cyclic shift to bring the Time column to the front
curr = curr[(seq(along = curr) - 2) %% ncol(curr) + 1]

# Let's do some predictions ...

# Number of days to predict
predictDays = 20
# For each currency ...
currFit = sapply(curr[, -1], function(cur) {
  as.numeric(forecast(cur, h = predictDays)$mean)
  # Note: Normally one would obviously do some model diagnostics first...
})
# Add Time column to predictions
currFit = cbind(
  Time = seq(from = curr[nrow(curr), "Time"],
             length.out = predictDays + 1, by = "days")[-1],
  as.data.frame(currFit))

# Bind actual data with predictions
curr = rbind(curr, currFit)

###########################################
# Writing Data to Excel / Excel Reporting #
###########################################

## Let's start simple!

# Workbook filename
wbFilename = "swiss_franc.xlsx"
# Create a new workbook
wb = loadWorkbook(wbFilename, create = TRUE)

# Create a new sheet named 'Swiss_Franc'
sheet = "Swiss_Franc"
createSheet(wb, name = sheet)
# Alternatively, in a more object-oriented style:
# wb$createSheet(name = sheet)

# Create a new Excel name referring to the top left corner
# of the sheet 'Swiss_Franc' - this name is going to hold
# our currency data
dataName = "currency"
createName(wb, name = dataName, formula = paste(sheet, "$A$1", sep = "!"))
# Alternatively:
# wb$createName(name = dataName, formula = paste(sheet, "$A$1", sep = "!"))

# Write the currency data to the named region created above
# Note: the named region will be automatically redefined to encompass all
# written data
writeNamedRegion(wb, data = curr, name = dataName, header = TRUE)
# Alternatively:
# wb$writeNamedRegion(data = curr, name = dataName, header = TRUE)

# Save the workbook (this actually writes the file to disk)
saveWorkbook(wb)

# Just dumping out some data to Excel is a very common case.
# Therefore, there is a simpler way of doing this:
if(file.exists(wbFilename))
  file.remove(wbFilename)
writeNamedRegionToFile(wbFilename, data = curr, name = dataName,
                       formula = paste(sheet, "$A$1", sep = "!"),
                       header = TRUE)
# Note: writeNamedRegionToFile automatically generates the required sheet
# and Excel name!

## In the following steps we will make the currency report a bit nicer...

# In addition to writing out the data, for each currency we want to
# highlight the points in time when there was a change of more than 2%
# compared to the previous day.

# Load the workbook created above
wb = loadWorkbook(wbFilename)

# Create a date cell style with a custom format for the Time column
# (only show year, month and day without any time fields)
csDate = createCellStyle(wb, name = "date")
setDataFormat(csDate, format = "yyyy-mm-dd")
# Create a time/date cell style for the prediction records
csPrediction = createCellStyle(wb, name = "prediction")
setDataFormat(csPrediction, format = "yyyy-mm-dd")
setFillPattern(csPrediction, fill = XLC$FILL.SOLID_FOREGROUND)
setFillForegroundColor(csPrediction, color = XLC$COLOR.GREY_25_PERCENT)
# Create a percentage cell style
# Number format: 2 digits after decimal point
csPercentage = createCellStyle(wb, name = "currency")
setDataFormat(csPercentage, format = "0.00%")
# Create a highlighting cell style
csHlight = createCellStyle(wb, name = "highlight")
setFillPattern(csHlight, fill = XLC$FILL.SOLID_FOREGROUND)
setFillForegroundColor(csHlight, color = XLC$COLOR.CORNFLOWER_BLUE)
setDataFormat(csHlight, format = "0.00%")

# Index for all rows except header row
allRows = seq(length = nrow(curr)) + 1

# Apply date cell style to the Time column
setCellStyle(wb, sheet = sheet, row = allRows, col = 1, cellstyle = csDate)
# Set column width such that the full date column is visible
setColumnWidth(wb, sheet = sheet, column = 1, width = 2800)
# Apply prediction cell style
setCellStyle(wb, sheet = sheet, row = tail(allRows, n = predictDays), col = 1,
             cellstyle = csPrediction)
# Apply number format to the currency columns
currencyColumns = seq(along = currencies) + 1
for(col in currencyColumns) {
  setCellStyle(wb, sheet = sheet, row = allRows, col = col,
               cellstyle = csPercentage)
}

# Check if there was a change of more than 2% compared to the previous day
# (per currency)
idx = rollapply(curr.orig, width = 2, FUN = function(x) abs(x[2] / x[1] - 1),
                by.column = TRUE) > 0.02
idx = rbind(rep(FALSE, ncol(idx)), idx)
widx = apply(as.data.frame(idx), which)
# Apply highlighting cell style
for(i in seq(along = currencies)) {
  if(length(widx[[i]]) > 0) {
    setCellStyle(wb, sheet = sheet, row = widx[[i]] + 1, col = i + 1,
                 cellstyle = csHlight)
  }
  # Note:
  # +1 for row since there is a header row
  # +1 for column since the first column is the time column
}

saveWorkbook(wb)

## In a next step we are going to add a currency graph to our report

wb = loadWorkbook(wbFilename)

# Stack currencies into a currency variable (for use with ggplot2 below)
gcurr = reshape(curr, varying = currencies, direction = "long",
                v.names = "Value", times = currencies, timevar = "Currency")
# Also add a discriminator column to differentiate between actual and
# prediction values
gcurr[["Type"]] = ifelse(gcurr$Time %in% currFit$Time, "prediction", "actual")

# Create a png graph showing the currencies in the context of the Swiss Franc
png(filename = "swiss_franc.png", width = 800, height = 600)
ggplot(gcurr, aes(Time, Value, colour = Currency, linetype = Type)) +
  geom_line() + stat_smooth(method = "loess") + xlab("") +
  scale_y_continuous("Change to baseline", formatter = "percent") +
  opts(title = "Currencies vs Swiss Franc",
       axis.title.y = theme_text(size = 10, angle = 90, vjust = 0.3))
dev.off()

# Define where the image should be placed via a named region;
# let's put the image two columns left to the data starting in the 5th row
createName(wb, name = "graph",
           formula = paste(sheet, idx2cref(c(5, ncol(curr) + 2)), sep = "!"))
# Note: idx2cref converts indices (row, col) to Excel cell references

# Put the image created above at the corresponding location
addImage(wb, filename = "swiss_franc.png", name = "graph",
         originalSize = TRUE)

# As a short side-note: summary also works on workbooks!
summary(wb)

saveWorkbook(wb)

## Want to add formulas to your reports? Here we go...

wb = loadWorkbook(wbFilename)

# Let's add the maximum downturn (= minimum) fur the currencies at the bottom
# of the data (offset by three rows)

# Get reference coordinates (corners) of the named region containing our data
corners = getReferenceCoordinates(wb, dataName)
# Excel area references to calculate the average for
areaRefs = idx2aref(t(sapply(currencyColumns,
                    function(col) c(corners[1,1] + 1, col,
                                    corners[2,1], col))))
# Note: corners[1,1] + 1 as we do not want to include the header row in
# the formula

# Construct the required formulas
avg = paste("MIN(", areaRefs,  ")", sep = "")

# Set cell formulas
setCellFormula(wb, sheet = sheet, row = corners[2,1] + 3,
               col = currencyColumns, formula = avg)
# Add some cell styling...
csAvg = createCellStyle(wb, name = "min")
setDataFormat(csAvg, format = "0.00%")
setBorder(csAvg, side = "top", type = XLC$BORDER.MEDIUM,
          color = XLC$COLOR.BLACK)
setCellStyle(wb, sheet = sheet, row = corners[2,1] + 3,
             col = currencyColumns, cellstyle = csAvg)

# And, as we are already at it - let's also add an Excel auto-filter to
# the columns
setAutoFilter(wb, sheet = sheet, reference = aref(corners[1,], corners[2,]))

saveWorkbook(wb)

# Depending on the size and complexity of reports to create, starting from
# scratch and coding everything in R can be cumbersome (especially cell styling).
# Therefore, XLConnect supports working with templates. The application of cell
# styles when writing data to Excel can be controlled via so-called "style
# actions" (see function setStyleAction).

#######################################
# Reading / Importing Data from Excel #
#######################################

# Reading (rectangular) data from Excel is straight forward.
# There is basically two ways:
# - reading data from worksheets via readWorksheet
# - reading data from named regions via readNamedRegion

# Generally, reading from named regions is the simplest and preferred way but
# requires a named region to be set up correspondingly. If this is not the case,
# readWorksheet can be used.

## Reading from named regions

wb = loadWorkbook(wbFilename)

# Read the named region that was created in the section above
data = readNamedRegion(wb, name = dataName, header = TRUE)
# Alternatively: wb$readNamedRegion(name = dataName, header = TRUE)
# Do some calculations ...
colMeans(data[, currencyColumns])

## Reading from worksheets

data = readWorksheet(wb, sheet = sheet, header = TRUE)
# Note: When not specifying any of the boundaries (startRow, startCol, endRow,
# endCol) XLConnect will try to determine the boundaries automatically.
# If there are several data regions on a worksheet, this may not give the desired
# result and you may need to specify some boundaries.
# Actually, in our example the above won't give the desired result as the data
# read in also includes the mean values computed via the Excel formulas.
tail(data)

# Therefore we need to do the following
data = readWorksheet(wb, sheet = sheet, startRow = -1, endRow = nrow(curr) + 1,
                     startCol = -1, endCol = -1, header = TRUE)
tail(data)
# Note: -1 means that the correspondingly values will be automatically
# determined. Currently it is still necessary to specify either none or all
# boundaries. This may be changed in the future to simplify the application of
# readWorksheet and to allow something like:
# data = readWorksheet(wb, sheet = sheet, endRow = nrow(curr) + 1,
#                      header = TRUE)

# Also note that for simple cases there are the functions
# readNamedRegionFromFile and readWorksheetFromFile that prevent you having to
# go via the sequence loadWorkbook & readNamedRegion/readWorksheet:
data = readNamedRegionFromFile(wbFilename, name = dataName, header = TRUE)

data = readWorksheetFromFile(wbFilename, sheet = sheet, startRow = -1,
                             endRow = nrow(curr + 1), startCol = -1,
                             endCol = -1, header = TRUE)

## Some additional syntactic sugar:

# In case a workbook contains only a few named data regions, you can use 'with'
# to reference them directly (via the name of the named region) without having
# to read them in manually first:
with(wb, {
  colMeans(currency[, currencyColumns])
})

# Note: 'with' should only be used if the number of named regions in the workbook
# is small, since 'with' basically reads in all named regions automatically.

###########################
# Handling of Error Cells #
###########################

# XLConnect supports two different schemes on how to deal with error cells
# (cells that produce an error when trying to be read):
# - immediately stop on the first occurrence on an error cell
# - produce warnings for erroring cells and assign NA as their value

# Read example workbook containing various forms of error cells
errorCell <- system.file("demoFiles/errorCell.xlsx", package = "XLConnect")
wb = loadWorkbook(errorCell)

# Define to produce warnings on erroring cells and assume error cells to be NA
# (this is the default behavior)
onErrorCell(wb, XLC$ERROR.WARN)
# Read named region containing error cells
data = readNamedRegion(wb, name = "MyData")
print(data)

# Define to stop immediately when encountering error cells
onErrorCell(wb, XLC$ERROR.STOP)
# Read named region containing error cells
data = readNamedRegion(wb, name = "MyData")

We hope that the above example succeeded in giving a better understanding of XLConnect and its capabilities. The package can easily be installed from CRAN via install.packages(“XLConnect”). You may also want to have a look at the numerous demos available via demo(package = “XLConnect”). E.g. in order to run the “writeNamedRegion” demo simply run demo(topic = “writeNamedRegion”, package = “XLConnect”).

We are always glad to receive feedback! For general feedback write us an email to [email protected]. If you find any bugs or hope to see some currently missing features in the future, just send us an email to [email protected]. You may also want to check out our XLConnect FAQ.

Further references & resources:


To leave a comment for the author, please follow the link and comment on his blog: Mirai Solutions » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.