When life gives you coloured cells, make categories

August 5, 2014

(This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers)

Let’s start by making one thing clear. Using coloured cells in Excel to encode different categories of data is wrong. Next time colleagues explain excitedly how “green equals normal and red = tumour”, you must explain that (1) they have sinned and (2) what they meant to do was add a column containing the words “normal” and “tumour”.

I almost hesitate to write this post but…we have to deal with the world as it is, not as we would like it to be. So in the interests of just getting the job done: here’s one way to deal with coloured cells in Excel, should someone send them your way.

I have created a simple Excel workbook. It contains one sheet, in which you will find one column headed “x” followed by the numbers 1-10. The cells A2:A11 are filled with colour and alternate: red-green-red-green-red-green-red-green-red-green.

We’ll read it into R using the xlsx package and extract the worksheet.

wb     <- loadWorkbook("test.xlsx")
sheet1 <- getSheets(wb)[[1]]

Now we can get the rows and the cells.

# get all rows
rows  <- getRows(sheet1)
cells <- getCells(rows)
# quick look at the values
sapply(cells, getCellValue)
#  1.1  2.1  3.1  4.1  5.1  6.1  7.1  8.1  9.1 10.1 11.1 
#  "x"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9" "10" 

Colour information is contained in the cell styles.

styles <- sapply(cells, getCellStyle)

Somewhat confusingly, although the Excel GUI specifies “fill background”, the property is accessed using getFillForegroundXSSFColor().

Here’s a little function which, given a cell, returns the RGB code for the fill colour. The header cell has no fill colour, so getRgb() returns an error that we can catch.

cellColor <- function(style) {
    fg  <- style$getFillForegroundXSSFColor()
    rgb <- tryCatch(fg$getRgb(), error = function(e) NULL)
    rgb <- paste(rgb, collapse = "")

Applying that to our styles gives this:

sapply(styles, cellColor)
     1.1      2.1      3.1      4.1      5.1      6.1      7.1      8.1      9.1     10.1 
      "" "ff0000" "00ff00" "ff0000" "00ff00" "ff0000" "00ff00" "ff0000" "00ff00" "ff0000" 

Knowing that, for example, green = 00ff00 = normal and red = ff0000 = tumour, we can now generate category labels something like this:

pheno <- list(normal = "00ff00", tumour = "ff0000")
m     <- match(sapply(styles, cellColor), pheno)
labs  <-names(pheno)[m]
 [1] NA       "tumour" "normal" "tumour" "normal" "tumour" "normal" "tumour" "normal"
[10] "tumour" "normal"

and get on with our lives. Or at least, our analyses.

Filed under: bioinformatics, programming, R, statistics Tagged: excel, spreadsheet

To leave a comment for the author, please follow the link and comment on their blog: What You're Doing Is Rather Desperate » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training



CRC R books series

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)