When life gives you coloured cells, make categories

[This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Let’s start by making one thing clear. Using coloured cells in Excel to encode different categories of data is wrong. Next time colleagues explain excitedly how “green equals normal and red = tumour”, you must explain that (1) they have sinned and (2) what they meant to do was add a column containing the words “normal” and “tumour”.

I almost hesitate to write this post but…we have to deal with the world as it is, not as we would like it to be. So in the interests of just getting the job done: here’s one way to deal with coloured cells in Excel, should someone send them your way.

I have created a simple Excel workbook. It contains one sheet, in which you will find one column headed “x” followed by the numbers 1-10. The cells A2:A11 are filled with colour and alternate: red-green-red-green-red-green-red-green-red-green.

We’ll read it into R using the xlsx package and extract the worksheet.

wb     <- loadWorkbook("test.xlsx")
sheet1 <- getSheets(wb)[[1]]

Now we can get the rows and the cells.

# get all rows
rows  <- getRows(sheet1)
cells <- getCells(rows)
# quick look at the values
sapply(cells, getCellValue)
#  1.1  2.1  3.1  4.1  5.1  6.1  7.1  8.1  9.1 10.1 11.1 
#  "x"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9" "10" 

Colour information is contained in the cell styles.

styles <- sapply(cells, getCellStyle)

Somewhat confusingly, although the Excel GUI specifies “fill background”, the property is accessed using getFillForegroundXSSFColor().

Here’s a little function which, given a cell, returns the RGB code for the fill colour. The header cell has no fill colour, so getRgb() returns an error that we can catch.

cellColor <- function(style) {
    fg  <- style$getFillForegroundXSSFColor()
    rgb <- tryCatch(fg$getRgb(), error = function(e) NULL)
    rgb <- paste(rgb, collapse = "")

Applying that to our styles gives this:

sapply(styles, cellColor)
     1.1      2.1      3.1      4.1      5.1      6.1      7.1      8.1      9.1     10.1 
      "" "ff0000" "00ff00" "ff0000" "00ff00" "ff0000" "00ff00" "ff0000" "00ff00" "ff0000" 

Knowing that, for example, green = 00ff00 = normal and red = ff0000 = tumour, we can now generate category labels something like this:

pheno <- list(normal = "00ff00", tumour = "ff0000")
m     <- match(sapply(styles, cellColor), pheno)
labs  <-names(pheno)[m]
 [1] NA       "tumour" "normal" "tumour" "normal" "tumour" "normal" "tumour" "normal"
[10] "tumour" "normal"

and get on with our lives. Or at least, our analyses.

Filed under: bioinformatics, programming, R, statistics Tagged: excel, spreadsheet

To leave a comment for the author, please follow the link and comment on their blog: What You're Doing Is Rather Desperate » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)