I’ve recently been working with methylation data; specifically, from the Illumina Infinium HumanMethylation450 bead chip. It’s a rather complex array which uses two types of probes to determine the methylation state of DNA at ~ 485 000 sites in the genome.
The Bioconductor project has risen to the challenge with a (somewhat bewildering) variety of packages to analyse data from this chip. I’ve used lumi, methylumi and minfi, but will focus here on just the latter.
Update: a related post from Oliver Hofmann (@fiamh).
The data came to me in a standard directory and file structure (created, I think, using BeadStudio), as raw IDAT files with a sample sheet. So, reading the data into R was straightforward:
library(minfi) library(IlluminaHumanMethylation450kmanifest) baseDir <- "path/to/datadir" targets <- read.450k.sheet(baseDir) RGset <- read.450k.exp(base = baseDir, targets = targets)
This creates an RGChannelSet:
RGChannelSet (storageMode: lockedEnvironment) assayData: 622399 features, 108 samples element names: Green, Red protocolData: none phenoData sampleNames: 6057841162_R01C01 6057841162_R02C01 ... 6264488081_R06C02 (108 total) varLabels: Sample_Name Sample_Well ... filenames (9 total) varMetadata: labelDescription featureData: none experimentData: use 'experimentData(object)' Annotation: IlluminaHumanMethylation450k
The minfi package provides a function to convert the red/green channel values to raw intensity measurements representing methylated or unmethylated:
raw <- preprocessRaw(RGset)
That results in a MethylSet:
MethylSet (storageMode: lockedEnvironment) assayData: 485577 features, 108 samples element names: Meth, Unmeth protocolData: none phenoData sampleNames: 6057841162_R01C01 6057841162_R02C01 ... 6264488081_R06C02 (108 total) varLabels: Sample_Name Sample_Well ... filenames (9 total) varMetadata: labelDescription featureData: none experimentData: use 'experimentData(object)' Annotation: IlluminaHumanMethylation450k Preprocessing Raw (no normalization or bg correction) minfi version 1.0.0 Manifest version 0.2.0
The functions getMeth() and getUnmeth() extract the methylated/unmethylated intensities, respectively, to a matrix:
meth <- getMeth(raw) dim(meth) ##  485577 108
The MethylSet object, raw, has the same structure as the more familiar eSet, so we can extract phenotype data using pData():
names(pData(raw)) ##  "Sample_Name" "Sample_Well" "Sample_Plate" "Sample_Group" "Pool_ID" ##  "Array" "Slide" "Basename" "filenames"
It turns out that for my data, the 108 samples were run in 9 groups of 12 and are ordered as such in the sample sheet: that is samples 1-12 = Group1, 13-24 = Group2 and so on, up to 97-108 = Group9. It also turns out that the variable pData(raw)$Array is ordered in groups of 6. So for the first 18 samples:
head(pData(raw)$Array, 18) ##  "R01C01" "R02C01" "R03C01" "R04C01" "R05C01" "R06C01" "R01C02" "R02C02" ##  "R03C02" "R04C02" "R05C02" "R06C02" "R01C01" "R02C01" "R03C01" "R04C01" ##  "R05C01" "R06C01"
Keep those groups and orderings in mind, as we generate a boxplot of raw methylated intensities for the 108 samples:
library(RColorBrewer) cols <- brewer.pal(9, "Set3") groups.uniq <- unique(pData(raw)$Sample_Group) m <- match(pData(raw)$Sample_Group, groups.uniq) boxplot(meth, col = cols[m])
Result on the right.
Interesting and evidently, something “non-random” is going on. It looks as though we have 2 artifacts to consider. First, for these samples, groups 6-9 differ from groups 1-5. More interestingly, within groups, we see a cyclic signal where intensity rises gradually across the first 6 arrays, then falls and rises again for the second set of 6 arrays.
Take-home message: very simple plots can be informative. Now to figure out what it all means.