Microarrays, scan dates and Bioconductor: it shouldn’t be this difficult

[This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When dealing with data from high-throughput experimental platforms such as microarrays, it’s important to account for potential batch effects. A simple example: if you process all your normal tissue samples this week and your cancerous tissue samples next week, you’re in big trouble. Differences between cancer and normal are now confounded with processing time and you may as well start over with new microarrays.

Processing date is often a good surrogate for batch and it was once easy to extract dates from Affymetrix CEL files using Bioconductor. It seems that this is no longer the case.

Once upon a time (about 10 months ago), I could write code to process CEL files from the Affymetrix Human Exon 1.0 ST array that looked like this:

# assuming a covdesc file in same directory as CEL files
exp.cel         <- read.affy(path = "data/celfiles")
exp.cel@cdfName <- "exon.pmcdf"
exp.rma         <- rma(exp.cel)

Getting the scan date could not have been easier:

pd <- pData(protocolData(exp.rma))
#                               ScanDate
# HCT_CON_A.CEL     2009-12-11T02:57:33Z
# HCT_CON_B.CEL     2009-12-11T00:56:07Z
# HCT_Treated_A.CEL 2009-12-11T04:27:12Z
# HCT_Treated_B.CEL 2009-12-11T01:26:20Z
# HT29_CON_A.CEL    2009-12-11T03:27:22Z
# HT29_CON_B.CEL    2009-12-11T03:56:58Z

Alas, this code now fails at the first hurdle (reading CEL files):

Error in read.affybatch(filenames = l$filenames, phenoData = l$phenoData,  : 
The affy package is not designed for this array type.
Please use either the oligo or xps package.


Fair enough. I’ve used the oligo package before, so we try that instead:

# assuming CEL files in ./data/celfiles
exp.cel <- read.celfiles(list.celfiles("data/celfiles", full.names = T))
exp.rma <- rma(exp.cel)

Wait – what happened to the dates?

pd <- pData(protocolData(exp.rma))
#                                              exprs dates
# HCT_CON_A.CEL         data/celfiles//HCT_CON_A.CEL      
# HCT_CON_B.CEL         data/celfiles//HCT_CON_B.CEL      
# HCT_Treated_A.CEL data/celfiles//HCT_Treated_A.CEL      
# HCT_Treated_B.CEL data/celfiles//HCT_Treated_B.CEL      
# HT29_CON_A.CEL       data/celfiles//HT29_CON_A.CEL      
# HT29_CON_B.CEL       data/celfiles//HT29_CON_B.CEL

# oligo has a runDate() method
# no joy there either
#  [1]                
# Levels:

OK – let’s look at xps. A prerequisite for installation is something called ROOT; not the super user, but a data analysis framework. If you’re on an Ubuntu-like system, ignore all the confusing advice around the Web about setting environment variables and just:

sudo apt-get install root-system

After xps installation we read the manual:

In contrast to other packages, which rely on the Bioconductor method for creating cdf environments, we need to create ROOT scheme files directly from the Affymetrix source files, which need to be downloaded first from the Affymetrix web site. However, once created, it is in principle possible to distribute the ROOT scheme files, too.

I’m sorry. There are only so many hoops that I’m willing to jump through in order to simply read from a file and this is a step too far.

So my current fix – use apt-cel-convert, part of the Affymetrix Power Tools suite, to convert to text and then grep for dates in the format MM/DD/YY:

apt-cel-convert -f text -o . myfile.CEL
grep ^DatHeader myfile.CEL | grep -ioP "\b\d+\/\d+\/\d+\b"
# 01/15/13

I don’t know yet whether those regular expressions work for all cases.

Have a better suggestion? Let’s hear it.

Update – for completeness I’m using R 3.0.1, Bioconductor 2.12, oligo 1.24.2 on Ubuntu 10.04. I’ve also used a CEL file from a different platform, HG U133A, which gave this:

# [1] [0..46114]  10434-10425 #799 133A 8-222-02:CLS=4733 RWS=4733 XIN=3  YIN=3  VE=17        # 2.0 08/22/02 12:08:07    \024  \024 HG-U133A.1sq \024  \024  \024  \024  \024  \024  \024  # \024  \024 6
# Levels: [0..46114]  10434-10425 #799 133A 8-222-02:CLS=4733 RWS=4733 XIN=3  YIN=3  VE=17        # 2.0 08/22/02 12:08:07    \024  \024 HG-U133A.1sq \024  \024  \024  \024  \024  \024  \024  # \024  \024 6

i.e. the entire DatHeader line has been stored as a factor.

Filed under: bioinformatics, R, research diary, statistics Tagged: affymetrix, batch effect, microarray

To leave a comment for the author, please follow the link and comment on their blog: What You're Doing Is Rather Desperate » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)