I’ve recently been working with methylation data; specifically, from the Illumina Infinium HumanMethylation450 bead chip. It’s a rather complex array which uses two types of probes to determine the methylation state of DNA at ~ 485 000 sites in the genome.
The Bioconductor project has risen to the challenge with a (somewhat bewildering) variety of packages to analyse data from this chip. I’ve used lumi, methylumi and minfi, but will focus here on just the latter.
Update: a related post from Oliver Hofmann (@fiamh).
The data came to me in a standard directory and file structure (created, I think, using BeadStudio), as raw IDAT files with a sample sheet. So, reading the data into R was straightforward:
library(minfi) library(IlluminaHumanMethylation450kmanifest) baseDir <- "path/to/datadir" targets <- read.450k.sheet(baseDir) RGset <- read.450k.exp(base = baseDir, targets = targets)
This creates an RGChannelSet:
RGChannelSet (storageMode: lockedEnvironment)
assayData: 622399 features, 108 samples
element names: Green, Red
protocolData: none
phenoData
sampleNames: 6057841162_R01C01 6057841162_R02C01 ...
6264488081_R06C02 (108 total)
varLabels: Sample_Name Sample_Well ... filenames (9 total)
varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation: IlluminaHumanMethylation450k
The minfi package provides a function to convert the red/green channel values to raw intensity measurements representing methylated or unmethylated:
raw <- preprocessRaw(RGset)
That results in a MethylSet:
MethylSet (storageMode: lockedEnvironment)
assayData: 485577 features, 108 samples
element names: Meth, Unmeth
protocolData: none
phenoData
sampleNames: 6057841162_R01C01 6057841162_R02C01 ...
6264488081_R06C02 (108 total)
varLabels: Sample_Name Sample_Well ... filenames (9 total)
varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation: IlluminaHumanMethylation450k
Preprocessing Raw (no normalization or bg correction) minfi version 1.0.0
Manifest version 0.2.0
The functions getMeth() and getUnmeth() extract the methylated/unmethylated intensities, respectively, to a matrix:
meth <- getMeth(raw) dim(meth) ## [1] 485577 108
The MethylSet object, raw, has the same structure as the more familiar eSet, so we can extract phenotype data using pData():
names(pData(raw)) ## [1] "Sample_Name" "Sample_Well" "Sample_Plate" "Sample_Group" "Pool_ID" ## [6] "Array" "Slide" "Basename" "filenames"
It turns out that for my data, the 108 samples were run in 9 groups of 12 and are ordered as such in the sample sheet: that is samples 1-12 = Group1, 13-24 = Group2 and so on, up to 97-108 = Group9. It also turns out that the variable pData(raw)$Array is ordered in groups of 6. So for the first 18 samples:
head(pData(raw)$Array, 18) ## [1] "R01C01" "R02C01" "R03C01" "R04C01" "R05C01" "R06C01" "R01C02" "R02C02" ## [9] "R03C02" "R04C02" "R05C02" "R06C02" "R01C01" "R02C01" "R03C01" "R04C01" ## [17] "R05C01" "R06C01"
Keep those groups and orderings in mind, as we generate a boxplot of raw methylated intensities for the 108 samples:
library(RColorBrewer) cols <- brewer.pal(9, "Set3") groups.uniq <- unique(pData(raw)$Sample_Group) m <- match(pData(raw)$Sample_Group, groups.uniq) boxplot(meth, col = cols[m])
Result on the right.
Interesting and evidently, something “non-random” is going on. It looks as though we have 2 artifacts to consider. First, for these samples, groups 6-9 differ from groups 1-5. More interestingly, within groups, we see a cyclic signal where intensity rises gradually across the first 6 arrays, then falls and rises again for the second set of 6 arrays.
I love that when I tweet about this technical, esoteric stuff, I get instant and useful responses. Thanks for following: @iameltonjohn, @tweetingtechno and @fiamh.
Take-home message: very simple plots can be informative. Now to figure out what it all means.
Filed under: bioinformatics, R, research diary, statistics Tagged: illumina 450k, methylation, microarray
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...


Zero Inflated Models and Generalized Linear Mixed Models with R.
Zuur, Saveliev, Ieno (2012).