After a (really long!) hiatus I am reactivating my statistical blog. The first article concerns the clarification of a point made in the manual of our recently published statistical model for short RNA sequencing data.
The background for this post, in case one wants to skip reading the manuscript (please do read it !), centers around the limitations of existing methods for the analysis of data for this very promising class of biomarkers. To overcome these limitations our group comprised from investigators from Division of Nephrology, University of New Mexico and the Galas Lab at Pacific Northwest Research Institute introduced a novel method for the analysis of short RNA sequencing (sRNAseq) data. This method (RNAseqGAMLSS), which was derived from first principles modeling of the short RNAseq process, was shown to have a number of desirable properties in an analysis of nearly 200 public and internal datasets:
- It can quantify the intrinsic, sequencing specific bias of sRNAseq from calibration, synthetic equimolar mixes of the target short RNAs (e.g. microRNAs)
- It can use such estimates to correct for the bias present in experimental samples of different composition and input than the calibration runs. This in turns opens the way for the use of short RNAseq measurements in personalized medicine applications (as explained here)
- Adapted to the problem of differential expression analysis, our method exhibited greater accuracy, higher sensitivity and specificity than six existing algorithms (DESeq2, edgeR, EBSeq, limma, DSS, voom) for the analysis of short RNA-seq data.
- In contrast to these popular methods which force the expression profiles to have a certain form of symmetry (equal number and magnitude of over-expressed and under-expressed sequences), our method can be used to discover global, directional changes in expression profiles which are missed by the aforementioned methods. Accounting for such a possibility may be appropriate in certain instances, in which the disease process leads to loss or gain in the number of cells of origin of the affected organ/tissue.
The proposed methodology which is based on Generalized Additive Models for Location, Scale and Shape (GAMLSS) involves the fitting of simultaneous regression models for the location (mean) and the scale (dispersion) of sequence counts using either the Negative Binomial or a particular parameterization of the Normal distribution. However there is price to pay for the advantages of RNASeqGAMLSS over alternatives: this comes in the form of a small (but not infinitesimal) probability that the fitting algorithm will execute successfully. In the manual of our method (Section 6.4) we explain that a numerically more stable way of fitting these complex models exists and should be adapted if one encounters numerical optimization errors with the default approach used in the Nucleic Acids Research (NAR) manuscript. The three steps of this sequential fitting strategy are as follows:
- One fits a Poisson count mixed model to the RNAseq data, to obtain estimates of the relative mean expression for the different short RNA species in the expression profile
- These estimates are used to fix the values of the mean parameter model of the RNASeqGAMLSS model while estimating the values of the dispersion parameters.
- Finally, one uses the values of the mean (Step 1) and dispersion (Step 2) parameters to fit the general RNASeqGAMLSS model
In essence one ignores the overdispersion (additional variability) of the short RNAseq data (Step 1) to guide the algorithm into estimates of the dispersion parameters (Step 2). Finally one uses the separate estimates of the mean (Step 1) and dispersion (Step 2) parameters as an initial point for the simultaneous estimation of both (Step 3). The reason that this approach works is because the dispersion parameters do not affect the mean parameters, so that the Poisson distribution of Step 1 has the same mean as the RNASeqGAMLSS model. Hence the estimates produced by this Step are identical (to the limit of numerical precision) to those that would have been produced by a successful execution of the RNASeqGAMLSS optimization algorithms. Fixing these values when fitting the RNASeqGAMLSS model in Step 2 facilitates estimation of the dispersion parameters. Having very good initial guesses for these parameters virtually guarantees convergence of the 3rd Step (which is the only step in the NAR paper).
A fully worked out example is shown below (note that the data used in the NAR paper, source code, manual that includes instructions to compile the source code of the RNASeqGAMLSS and Win64 DLL libraries are all available in the BitBucket repository for this project)
First, we load the data, the C++ libraries and extract the data to the two groups we would like to compare :
library(TMB) ## the TMB framework for optimizati library(lme4) ## load the DE C++ libraries dyn.load(dynlib("LQNO_DE")) ## Note about the form of data storage for use with this software ##================================================================================ ## the long format should be employed when storing microRNA data for ## GAMLSS type of analyses: in this format, the data are arranged in columns: ## - column miRNA : yields the name of the miRNA ## - column reads : reads of the miRNA from a *single sample* ## - column SampleID: the sample the reads were recorded at ## this is called long format, because data from experiments are stored in ## different rows (rather than appearing as consecutive columns) ##================================================================================ ## lads the data from the 286 series load("286.RData") ## Obtain data for GAMLSS - we will compare the two ratiometric series datRat<-subset(dat286.long,(Series=="RatioB" | Series =="RatioA") & Amount=="100 fmoles") datRat$SampleID<-factor(datRat$SampleID) datRat$Series<-factor(datRat$Series) ## STEP 0: PREPARE THE DATA FOR THE RNASeqGAMLSS FIT u_X<-as.numeric(factor(datRat$miRNA)) ## maps readings to the identity of the miRNA u_G<-as.numeric(factor(datRat$Series)) ## maps counts to group y=datRat$reads ## extract the actual counts X<-model.matrix(~Series,data=datRat) ## design matrix (ANOVA) for group comparisons
Secondly, we fit the Poisson model (Step 1), using the facilities of the lme4 R package:
## STEP 1: USE A POISSON MODEL TO OBTAIN ESTIMATES FOR THE MU SUBMODEL ##========================================================================================== ## fit the parameters for the mu submodel using the poisson GLM gl<-glmer(reads~Series+(0+Series|miRNA),data=datRat,family="poisson")
Then we extract the values of these parameters and used them to fix the values of the mean submodel (Step 2):
## STEP 2: USE THE MU MODEL ESTIMATES TO FIT THE PHI SUBMODEL ##========================================================================================== ## initializes standard deviation of RE for the mu submodel sigmu=sqrt(diag((summary(gl)[["varcor"]])[])) sigsig=rep(1,max(u_G)) ## initializes standard deviation of RE for the phi submodel b=fixef(gl) ## initial values for the overall group means (mean submodel) ## initial values for the variation of miRNAs around their group mean (mean submodel) u_m=as.matrix(ranef(gl)$miRNA) ## Very rough initial values for the phi submodel parameters s_b=rep(0,ncol(X)) ## initial values for the overall group means (phi submodel) ## initial values for the variation of miRNAs around their group mean (phi submodel) u_s= matrix(0,max(u_X),max(u_G)) ## MAP list that allow us to fix some parameters to their values MAP<-NULL MAP[["b"]]<-factor(rep(NA,length(b))) MAP[["u_m"]]<-factor(rep(NA,length(c(u_m)))) MAP[["sigmu"]]<-factor(rep(NA,length(sigmu))) ## construct the AD object - note that we fix the mu at their values while estimating the ## phi submodel obj.TMB<-MakeADFun(data=list(y=y,X=X,u_X=u_X,u_G=u_G), parameters=list(b=b,s_b=s_b,u_m=u_m,u_s=u_s, sigmu=sigmu,sigsig=sigsig), DLL="LQNO_DE",random=c("u_s"),hessian=FALSE,silent=TRUE, method="BFGS",random.start=expression(last.par.best[random]), ADReport=TRUE,map=MAP) ## parameter estimation - note errors may be generated during some iterations f.TMB<-nlminb(obj.TMB$par,obj.TMB$fn,obj.TMB$gr, control=list(eval.max=10000,iter.max=10000),lower=-30,upper=30) ## obtain the report on the parameters to extract the fitted values of the gamlss model rep<-sdreport(obj.TMB) u_s = matrix(summary(rep,"random",p.value=FALSE)[,1],ncol=max(u_G)) dummy<-summary(rep,"fixed",p.value=FALSE)[,1] ## parameter estimates s_b=dummy[1:max(u_G)] sigsig=dummy[-(1:max(u_G))]
Finally, we refit the model letting all parameters vary:
## STEP 3: REFIT THE MODEL WITHOUT FIXING ANY PARAMETERS ##========================================================================================== obj.TMB<-MakeADFun(data=list(y=y,X=X,u_X=u_X,u_G=u_G), parameters=list(b=b,s_b=s_b,u_m=u_m,u_s=u_s, sigmu=sigmu,sigsig=sigsig), DLL="LQNO_DE",random=c("u_m","u_s"),hessian=TRUE,silent=TRUE, method="BFGS",random.start=expression(last.par.best[random]), ADReport=TRUE) ## scale objective by the magnitude of the deviance of the fitted Poisson model f.TMB<-nlminb(obj.TMB$par,obj.TMB$fn,obj.TMB$gr, control=list(eval.max=10000,iter.max=10000,scale=deviance(gl)),lower=-30,upper=30) ## obtain the report on the parameters rep<-sdreport(obj.TMB) ## differential expression ratios, standard errors z and p-values gamlssAD<-summary(rep,"report",p.value=TRUE)[1:nlevels(datRat$miRNA),] rownames(gamlssAD)<-levels(datRat$miRNA) ## rownames are the miRNAs; columns are estimates, standard error, z value and pvalue ## the final estimates with their standard errors head(gamlssAD)
These steps (and the RNASeqGAMLSS code) is going to be incorporated in an upcoming Bioconductor package for the analysis of short RNA sequencing data by Dr Lorena Pantano PhD. Until this package becomes available, the aforementioned code snippets may adapted very easily to one’s application by suitable adaptations of the code (i.e. the names of the columns corresponding to the RNA species, sample identifiers and experimental groups).