Sequential Fitting Strategies For Models of short RNA Sequencing Data

June 17, 2017
By

(This article was first published on R – Statistical Reflections of a Medical Doctor, and kindly contributed to R-bloggers)

After a (really long!) hiatus I am reactivating my statistical blog. The first article  concerns the clarification of a point made in the manual of our recently published statistical model for short RNA sequencing data.
The background for this post, in case one wants to skip reading the manuscript (please do read it !), centers around the limitations of existing methods for the analysis of data for this very promising class of biomarkers. To overcome these limitations our group comprised from investigators from Division of Nephrology, University of New Mexico and the Galas Lab at Pacific Northwest Research Institute introduced a novel method for the analysis of short RNA sequencing (sRNAseq) data. This method (RNAseqGAMLSS), which was derived from first principles modeling of the short RNAseq process, was shown to have a number of desirable properties in an analysis of nearly 200 public and internal datasets:

  1. It can quantify the intrinsic, sequencing specific bias of sRNAseq from calibration, synthetic equimolar mixes of the target short RNAs (e.g. microRNAs)
  2.  It can use such estimates to correct for the bias present in experimental samples of different composition and input than the calibration runs. This in turns opens the way for the use of short RNAseq measurements in personalized medicine applications (as explained here)
  3. Adapted to the problem of differential expression analysis, our method exhibited  greater accuracy, higher sensitivity and specificity than six existing algorithms (DESeq2, edgeR, EBSeq, limma, DSS, voom) for the analysis of short RNA-seq data.
  4. In contrast to these popular methods which force the expression profiles to have a certain form of symmetry (equal number and magnitude of over-expressed and under-expressed sequences), our method can be used to discover global, directional changes in expression profiles which are missed by the aforementioned methods. Accounting for such a possibility may be appropriate in certain instances, in which the disease process leads to loss or gain in the number of cells of origin of the affected organ/tissue.

The proposed methodology which is based on Generalized Additive Models for Location, Scale and Shape (GAMLSS) involves the fitting of simultaneous regression models for the location (mean) and the scale (dispersion) of sequence counts using either the Negative Binomial or a particular parameterization of the Normal distribution. However there is price to pay for the advantages of RNASeqGAMLSS over alternatives: this comes in the form of a small (but not infinitesimal) probability[1] that the fitting algorithm will execute successfully. In the manual of our method (Section 6.4) we explain that a numerically more stable way of fitting these complex models exists and should be adapted if one encounters numerical optimization errors with the default approach used in the Nucleic Acids Research (NAR) manuscript. The three steps of this sequential fitting strategy are as follows:

  1. One fits a Poisson count mixed model to the RNAseq data, to obtain estimates of the relative mean expression for the different short RNA species in the expression profile
  2. These estimates are used to fix the values of the mean parameter model of the RNASeqGAMLSS model while estimating the values of the dispersion parameters.
  3. Finally, one uses the values of the mean (Step 1) and dispersion (Step 2) parameters to fit the general RNASeqGAMLSS model

In essence one ignores the overdispersion (additional variability) of the short RNAseq data (Step 1) to guide the algorithm into estimates of the dispersion parameters (Step 2). Finally one uses the separate estimates of the mean (Step 1) and dispersion (Step 2) parameters as an initial point for the simultaneous estimation of both (Step 3). The reason that this approach works is because the dispersion parameters do not affect the mean parameters, so that the Poisson distribution of Step 1 has the same mean as the RNASeqGAMLSS model. Hence the estimates produced by this Step are identical (to the limit of numerical precision) to those that would have been produced by a successful execution of the RNASeqGAMLSS optimization algorithms. Fixing these values when fitting the RNASeqGAMLSS model in Step 2 facilitates estimation of the dispersion parameters. Having very good initial guesses for these parameters virtually guarantees convergence of the 3rd Step (which is the only step in the NAR paper).

A fully worked out example is shown below (note that the data used in the NAR paper, source code, manual that includes instructions to compile the source code of the RNASeqGAMLSS and Win64 DLL libraries are all available in the BitBucket repository for this project)

First, we load the data, the C++ libraries and extract the data to the two groups we would like to compare :

library(TMB) ## the TMB framework for optimizati
library(lme4)
## load the DE C++ libraries
dyn.load(dynlib("LQNO_DE"))
## Note about the form of data storage for use with this software
##================================================================================
## the long format should be employed when storing microRNA data for
## GAMLSS type of analyses: in this format, the data are arranged in columns:
## - column miRNA : yields the name of the miRNA
## - column reads : reads of the miRNA from a *single sample*
## - column SampleID: the sample the reads were recorded at
## this is called long format, because data from experiments are stored in
## different rows (rather than appearing as consecutive columns)
##================================================================================
## lads the data from the 286 series
load("286.RData")
## Obtain data for GAMLSS - we will compare the two ratiometric series
datRat<-subset(dat286.long,(Series=="RatioB" | Series =="RatioA") & Amount=="100 fmoles")
datRat$SampleID<-factor(datRat$SampleID)
datRat$Series<-factor(datRat$Series)

## STEP 0: PREPARE THE DATA FOR THE RNASeqGAMLSS FIT
u_X<-as.numeric(factor(datRat$miRNA)) ## maps readings to the identity of the miRNA
u_G<-as.numeric(factor(datRat$Series)) ## maps counts to group
y=datRat$reads ## extract the actual counts
X<-model.matrix(~Series,data=datRat) ## design matrix (ANOVA) for group comparisons

Secondly, we fit the Poisson model (Step 1), using the facilities of the lme4 R package:

## STEP 1: USE A POISSON MODEL TO OBTAIN ESTIMATES FOR THE MU SUBMODEL
##==========================================================================================
## fit the parameters for the mu submodel using the poisson GLM
gl<-glmer(reads~Series+(0+Series|miRNA),data=datRat,family="poisson")

Then we extract the values of these parameters and used them to fix the values of the mean submodel (Step 2):

## STEP 2: USE THE MU MODEL ESTIMATES TO FIT THE PHI SUBMODEL
##==========================================================================================
## initializes standard deviation of RE for the mu submodel
sigmu=sqrt(diag((summary(gl)[["varcor"]])[[1]]))
sigsig=rep(1,max(u_G)) ## initializes standard deviation of RE for the phi submodel
b=fixef(gl) ## initial values for the overall group means (mean submodel)
## initial values for the variation of miRNAs around their group mean (mean submodel)
u_m=as.matrix(ranef(gl)$miRNA)
## Very rough initial values for the phi submodel parameters
s_b=rep(0,ncol(X)) ## initial values for the overall group means (phi submodel)
## initial values for the variation of miRNAs around their group mean (phi submodel)
u_s= matrix(0,max(u_X),max(u_G))
## MAP list that allow us to fix some parameters to their values
MAP<-NULL
MAP[["b"]]<-factor(rep(NA,length(b)))
MAP[["u_m"]]<-factor(rep(NA,length(c(u_m))))
MAP[["sigmu"]]<-factor(rep(NA,length(sigmu)))
## construct the AD object - note that we fix the mu at their values while estimating the
## phi submodel
obj.TMB<-MakeADFun(data=list(y=y,X=X,u_X=u_X,u_G=u_G),
parameters=list(b=b,s_b=s_b,u_m=u_m,u_s=u_s,
sigmu=sigmu,sigsig=sigsig),
DLL="LQNO_DE",random=c("u_s"),hessian=FALSE,silent=TRUE,
method="BFGS",random.start=expression(last.par.best[random]),
ADReport=TRUE,map=MAP)
## parameter estimation - note errors may be generated during some iterations
f.TMB<-nlminb(obj.TMB$par,obj.TMB$fn,obj.TMB$gr,
control=list(eval.max=10000,iter.max=10000),lower=-30,upper=30)
## obtain the report on the parameters to extract the fitted values of the gamlss model
rep<-sdreport(obj.TMB)
u_s = matrix(summary(rep,"random",p.value=FALSE)[,1],ncol=max(u_G))
dummy<-summary(rep,"fixed",p.value=FALSE)[,1] ## parameter estimates
s_b=dummy[1:max(u_G)]
sigsig=dummy[-(1:max(u_G))]

Finally, we refit the model letting all parameters vary:


## STEP 3: REFIT THE MODEL WITHOUT FIXING ANY PARAMETERS
##==========================================================================================
obj.TMB<-MakeADFun(data=list(y=y,X=X,u_X=u_X,u_G=u_G),
parameters=list(b=b,s_b=s_b,u_m=u_m,u_s=u_s,
sigmu=sigmu,sigsig=sigsig),
DLL="LQNO_DE",random=c("u_m","u_s"),hessian=TRUE,silent=TRUE,
method="BFGS",random.start=expression(last.par.best[random]),
ADReport=TRUE)
## scale objective by the magnitude of the deviance of the fitted Poisson model
f.TMB<-nlminb(obj.TMB$par,obj.TMB$fn,obj.TMB$gr,
control=list(eval.max=10000,iter.max=10000,scale=deviance(gl)),lower=-30,upper=30)
## obtain the report on the parameters
rep<-sdreport(obj.TMB)
## differential expression ratios, standard errors z and p-values
gamlssAD<-summary(rep,"report",p.value=TRUE)[1:nlevels(datRat$miRNA),]
rownames(gamlssAD)<-levels(datRat$miRNA)
## rownames are the miRNAs; columns are estimates, standard error, z value and pvalue

## the final estimates with their standard errors
head(gamlssAD)

These steps (and the RNASeqGAMLSS code) is going to be incorporated in an upcoming Bioconductor package for the analysis of short RNA sequencing data by Dr Lorena Pantano PhD. Until this package becomes available, the aforementioned code snippets may adapted very easily to one’s application by suitable adaptations of the code (i.e. the names of the columns corresponding to the RNA species, sample identifiers and experimental groups).

1. This is particularly likely when the underlying software implementations are not compiled against the Intel®Math Kernel Libraries.

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Reflections of a Medical Doctor.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)