When dealing with data from high-throughput experimental platforms such as microarrays, it’s important to account for potential batch effects. A simple example: if you process all your normal tissue samples this week and your cancerous tissue samples next week, you’re in big trouble. Differences between cancer and normal are now confounded with processing time and you may as well start over with new microarrays.
Processing date is often a good surrogate for batch and it was once easy to extract dates from Affymetrix CEL files using Bioconductor. It seems that this is no longer the case.
Once upon a time (about 10 months ago), I could write code to process CEL files from the Affymetrix Human Exon 1.0 ST array that looked like this:
library(simpleaffy) # assuming a covdesc file in same directory as CEL files exp.cel <- read.affy(path = "data/celfiles") [email protected] <- "exon.pmcdf" exp.rma <- rma(exp.cel)
Getting the scan date could not have been easier:
pd <- pData(protocolData(exp.rma)) head(pd) # ScanDate # HCT_CON_A.CEL 2009-12-11T02:57:33Z # HCT_CON_B.CEL 2009-12-11T00:56:07Z # HCT_Treated_A.CEL 2009-12-11T04:27:12Z # HCT_Treated_B.CEL 2009-12-11T01:26:20Z # HT29_CON_A.CEL 2009-12-11T03:27:22Z # HT29_CON_B.CEL 2009-12-11T03:56:58Z
Alas, this code now fails at the first hurdle (reading CEL files):
Error in read.affybatch(filenames = l$filenames, phenoData = l$phenoData, : The affy package is not designed for this array type. Please use either the oligo or xps package. FALSE
Fair enough. I’ve used the oligo package before, so we try that instead:
library(oligo) # assuming CEL files in ./data/celfiles exp.cel <- read.celfiles(list.celfiles("data/celfiles", full.names = T)) exp.rma <- rma(exp.cel)
Wait – what happened to the dates?
pd <- pData(protocolData(exp.rma)) head(pd) # exprs dates # HCT_CON_A.CEL data/celfiles//HCT_CON_A.CEL # HCT_CON_B.CEL data/celfiles//HCT_CON_B.CEL # HCT_Treated_A.CEL data/celfiles//HCT_Treated_A.CEL # HCT_Treated_B.CEL data/celfiles//HCT_Treated_B.CEL # HT29_CON_A.CEL data/celfiles//HT29_CON_A.CEL # HT29_CON_B.CEL data/celfiles//HT29_CON_B.CEL # oligo has a runDate() method # no joy there either runDate(exp.cel) #  # Levels:
OK – let’s look at xps. A prerequisite for installation is something called ROOT; not the super user, but a data analysis framework. If you’re on an Ubuntu-like system, ignore all the confusing advice around the Web about setting environment variables and just:
sudo apt-get install root-system
After xps installation we read the manual:
In contrast to other packages, which rely on the Bioconductor method for creating cdf environments, we need to create ROOT scheme files directly from the Affymetrix source files, which need to be downloaded first from the Affymetrix web site. However, once created, it is in principle possible to distribute the ROOT scheme files, too.
I’m sorry. There are only so many hoops that I’m willing to jump through in order to simply read from a file and this is a step too far.
So my current fix – use apt-cel-convert, part of the Affymetrix Power Tools suite, to convert to text and then grep for dates in the format MM/DD/YY:
apt-cel-convert -f text -o . myfile.CEL grep ^DatHeader myfile.CEL | grep -ioP "\b\d+\/\d+\/\d+\b" # 01/15/13
I don’t know yet whether those regular expressions work for all cases.
Have a better suggestion? Let’s hear it.
Update – for completeness I’m using R 3.0.1, Bioconductor 2.12, oligo 1.24.2 on Ubuntu 10.04. I’ve also used a CEL file from a different platform, HG U133A, which gave this:
runDate(exp.cel) #  [0..46114] 10434-10425 #799 133A 8-222-02:CLS=4733 RWS=4733 XIN=3 YIN=3 VE=17 # 2.0 08/22/02 12:08:07 \024 \024 HG-U133A.1sq \024 \024 \024 \024 \024 \024 \024 # \024 \024 6 # Levels: [0..46114] 10434-10425 #799 133A 8-222-02:CLS=4733 RWS=4733 XIN=3 YIN=3 VE=17 # 2.0 08/22/02 12:08:07 \024 \024 HG-U133A.1sq \024 \024 \024 \024 \024 \024 \024 # \024 \024 6
i.e. the entire DatHeader line has been stored as a factor.