When life gives you coloured cells, make categories

August 5, 2014
By

(This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers)

Let’s start by making one thing clear. Using coloured cells in Excel to encode different categories of data is wrong. Next time colleagues explain excitedly how “green equals normal and red = tumour”, you must explain that (1) they have sinned and (2) what they meant to do was add a column containing the words “normal” and “tumour”.

I almost hesitate to write this post but…we have to deal with the world as it is, not as we would like it to be. So in the interests of just getting the job done: here’s one way to deal with coloured cells in Excel, should someone send them your way.

I have created a simple Excel workbook. It contains one sheet, in which you will find one column headed “x” followed by the numbers 1-10. The cells A2:A11 are filled with colour and alternate: red-green-red-green-red-green-red-green-red-green.

We’ll read it into R using the xlsx package and extract the worksheet.

library(xlsx)
wb     <- loadWorkbook("test.xlsx")
sheet1 <- getSheets(wb)[[1]]

Now we can get the rows and the cells.

# get all rows
rows  <- getRows(sheet1)
cells <- getCells(rows)
# quick look at the values
sapply(cells, getCellValue)
#  1.1  2.1  3.1  4.1  5.1  6.1  7.1  8.1  9.1 10.1 11.1 
#  "x"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9" "10" 

Colour information is contained in the cell styles.

styles <- sapply(cells, getCellStyle)

Somewhat confusingly, although the Excel GUI specifies “fill background”, the property is accessed using getFillForegroundXSSFColor().

Here’s a little function which, given a cell, returns the RGB code for the fill colour. The header cell has no fill colour, so getRgb() returns an error that we can catch.

cellColor <- function(style) {
    fg  <- style$getFillForegroundXSSFColor()
    rgb <- tryCatch(fg$getRgb(), error = function(e) NULL)
    rgb <- paste(rgb, collapse = "")
    return(rgb)
}

Applying that to our styles gives this:

sapply(styles, cellColor)
     1.1      2.1      3.1      4.1      5.1      6.1      7.1      8.1      9.1     10.1 
      "" "ff0000" "00ff00" "ff0000" "00ff00" "ff0000" "00ff00" "ff0000" "00ff00" "ff0000" 
    11.1 
"00ff00" 

Knowing that, for example, green = 00ff00 = normal and red = ff0000 = tumour, we can now generate category labels something like this:

pheno <- list(normal = "00ff00", tumour = "ff0000")
m     <- match(sapply(styles, cellColor), pheno)
labs  <-names(pheno)[m]
labs
 [1] NA       "tumour" "normal" "tumour" "normal" "tumour" "normal" "tumour" "normal"
[10] "tumour" "normal"

and get on with our lives. Or at least, our analyses.


Filed under: bioinformatics, programming, R, statistics Tagged: excel, spreadsheet

To leave a comment for the author, please follow the link and comment on his blog: What You're Doing Is Rather Desperate » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.