Estimation of hydraulic conductivity and its uncertainty from grain-size data using GLUE and artificial neural networks.

[This article was first published on Bart Rogiers - Sreigor, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Abstract
Various approaches exist to relate saturated hydraulic conductivity (Ks) to grain-size data. Most methods use a single grain-size parameter and hence omit the information encompassed by the entire grain-size distribution. This study compares two data-driven modelling methods—multiple linear regression and artificial neural networks—that use the entire grain-size distribution data as input for Kprediction. Besides the predictive capacity of the methods, the uncertainty associated with the model predictions is also evaluated, since such information is important for stochastic groundwater flow and contaminant transport modelling.
Artificial neural networks (ANNs) are combined with a generalised likelihood uncertainty estimation (GLUE) approach to predict Ks from grain-size data. The resulting GLUE-ANN hydraulic conductivity predictions and associated uncertainty estimates are compared with those obtained from the multiple linear regression models by a leave-one-out cross-validation. The GLUE-ANN ensemble prediction proved to be slightly better than multiple linear regression. The prediction uncertainty, however, was reduced by half an order of magnitude on average, and decreased at most by an order of magnitude. This demonstrates that the proposed method outperforms classical data-driven modelling techniques. Moreover, a comparison with methods from the literature demonstrates the importance of site-specific calibration. The data set used for this purpose originates mainly from unconsolidated sandy sediments of the Neogene aquifer, northern Belgium. The proposed predictive models are developed for 173 grain-size Ks-pairs. Finally, an application with the optimised models is presented for a borehole lacking Kdata.

Keywords
Early stopping – Cross-validation – Generalised likelihood uncertainty estimation – Artificial neural networks – Sedimentary aquifer – Data-driven modelling – Likelihood measures – Principal component analysis – GLUE-ANN

Reference
Rogiers, B., Mallants, D., Batelaan, O., Gedeon, M., Huysmans, M., Dassargues, A. 2012. Estimation of hydraulic conductivity and its uncertainty from grain-size data using GLUE and artificial neural networks. Mathematical Geosciences 44: 739–763. PaperAuthor’s versionR scriptBlog post



An example script for performing such analysis in the R language is provided with the paper. The most up to date version can be found below. Two simple examples are provided:

The first one models the [0,100] range of x² in function of x. It plots two figures: one with the results for the training data, and one that predicts new data on the [-100,200] range. The data outside of the training data range can of course not be predicted, as is clear from the second figure.
 The second example shows a cross-validation with the air quality dataset:

R script with examples of GLUE-ANN analyses:

################################################################################
### GLUE-ANN parameters + examples (Rogiers et al. 2011)                     ###
################################################################################
# A dataframe has to be provided with the independent variable(s), and a vector
# with the dependent variable. A list of parameters has to be set to define the
# a priori distributions of the stochastic parameters, and fixed values for the
# ones considered to be constant.
# Be sure to load the additional GLUE-ANN functions into R before trying
# the examples provided below. The AMORE package should be installed and loaded:
install.packages(c('AMORE', 'Hmisc'))
library(AMORE)
library(Hmisc)
 
### Parameter list #############################################################
# A list of all required parameters with their default values.
 
  ### Pre-processing parameters ################################################
  # Orthogonalise input data; options are: 'Cov', 'Cor', and 'No'
    orthogonal <- 'Cor'
  # Number of input variables, provide vector for stochastic treatment
    variableNumber <- c(1:ncol(inputData))
  # T: Select always most variance explaining PCs; F: Choose PCs randomly
    dimRed <- T
  # Rescale output to 0-1 range
    rescale.output <- T
 
  ### ANN parameters ###########################################################
  # Enable early stopping
    ES <- T
  # Part of the data used for early stopping
    randomESPart <- 0.5
  # Early stopping samples should have a similar mean as the training samples
    nonEqualMeans <- T
  # Maximum difference between early stopping and training observation means
    maxMeanDiff <- mean(outputData)*0.2
  # maximum number of training cycles, provide a vector for stochastic treatment
    nCycles <- 100
  # Amount of hidden layers, provide vector for stochastic treatment
    HLrange <- c(1:1)
  # Amount of hidden nodes, provide vector for stochastic treatment
    HNrange <- c(1:10)
  # Maximum amount of hidden nodes per input variable
    maxHNPerVar <- Inf
  # Hidden layer activation function; options are 'tansig', 'sigmoid', 'purelin'
    hidden.layer <- 'tansig'
  # Output layer activation function; options are 'tansig', 'sigmoid', 'purelin'
    output.layer <- 'sigmoid'
 
  ### GLUE parameters ##########################################################
  # Number of models in the ensemble
    nSets <- 100
  # Number of random variates for weights based on Stedinger et al (2008)
    nStedinger <- 50
  # GLUE weighting procedure; options are 'MSE', 'MSEes' (for weights based on
  # the early stopping subset performance), 'NSEff' and 'Stedinger'
    weighting <- 'Stedinger'
 
  ### Cross-validation parameters ##############################################
  # Use leave-one-out cross-validation
    cv <- T
 
### Examples ###################################################################
 
  ### Example of training and prediction for fictional data ####################
  # Create fictional data for x range [1,100]
    inputData <- data.frame(x=seq(1,100,0.5))
  # Dependent variable y = x^2                              
    outputData <- inputData$x ^ 2
  # add random noise to y, representing different sources of error
    outputData <- jitter(outputData, amount=mean(outputData)*0.2)  
  # Run GLUE-ANN functions   
    glue.ann.ensemble <- train.glue.ann(inputData, outputData, nCycles=500)
    training.results <- predict.glue.ann(inputData, glue.ann.ensemble)
  # Visualise results for training data
    plot(inputData$x, training.results$eMean, type='l', xlab='x', ylab=
    expression(y = x^2 + epsilon), ylim=c(-2000,13000))
    lines(inputData$x, glue.ann.ensemble$outputData, lty=2, col='red')
    lines(inputData$x, training.results$eQ025, lty=3, col='blue')
    lines(inputData$x, training.results$eQ975, lty=3, col='blue')
    legend('topleft', c('GLUE-ANN ensemble mean prediction','Observations',
    '95% prediction uncertainty'), lty=c(1,2,3),col=c('black','red','blue'),
    bty='n')
  # Predict new data
    newData <- data.frame(x=c(-100:200)+0.5)                                       
    newData.predictions <- predict.glue.ann(newData, glue.ann.ensemble)
  # Visualise results of prediction
    plot(newData$x, newData.predictions$eMean, type='l', xlab='x', ylab=
    expression(y = x^2 + epsilon), ylim=c(-2000,13000))                                  
    lines(newData$x, jitter(newData$x^2, amount=mean(outputData)*0.2) , lty=2,
    col='red')
    lines(newData$x, newData.predictions$eQ025, lty=3, col='blue')
    lines(newData$x, newData.predictions$eQ975, lty=3, col='blue')
    abline(v=c(0,100))
    text(50,13000,'Training data range')
    legend('bottomright', c('GLUE-ANN ensemble mean prediction','Observations',
    '95% prediction uncertainty'), lty=c(1,2,3),col=c('black','red','blue'),
    bty='n')
 
  ### Example of cross-validation with an R dataset ############################
  # Read input and output data (ozone concentration in function of wind,
  # temperature and solar radiation)
    outputData <- na.omit(airquality)[,1]
    inputData <- na.omit(airquality)[,2:4]
  # Perform cross-validation
    cross_validation <- train.glue.ann(inputData, outputData, cv=T, nCycles=1000
    , nSets=10)
  # Visualisation
    plot(cross_validation$target, ylim=c(-100,200), xlab='Sample number', ylab=
    'Ozone')
    points(cross_validation$eMean, pch=2, col='red')
    lines(cross_validation$eQ025, lty=3, col='blue')
    lines(cross_validation$eQ975, lty=3, col='blue')
    legend('bottomright', c('Observations','GLUE-ANN ensemble mean prediction',
    '95% prediction uncertainty'), pch=c(1,2,NA), lty=c(NA,NA,3),col=c('black',
    'red','blue'),bty='n')

R functions for GLUE-ANN analysis with the AMORE package (Castejón Limas et al 2010):
################################################################################
### GLUE-ANN - functions (Rogiers et al. 2012)                               ###
################################################################################
 
### ANN training function ######################################################
nncalc <- function(Ptrain, Pval, targetTrain, targetVal, architecture, n, Stao=1
, es=T, hidden.layer="sigmoid", output.layer="purelin")
{
  neuralnetwork <- newff(n.neurons=architecture, learning.rate.global=1e-2,
  momentum.global=0.5, error.criterium = 'LMS', Stao = Stao, hidden.layer=
  hidden.layer, output.layer=output.layer, method="ADAPTgdwm")
  if(!es) { Pval <- NULL; targetVal <- NULL }     
  capture.output(results <- train(neuralnetwork, Ptrain, Stao=Stao,  Pval=Pval,
  Tval=targetVal,targetTrain, error.criterium = 'LMS', report=T, show.step=1,
  n.shows=n ))
  return(results)
}
 
### Sampling parameter vectors function ########################################
resample <- function(x, n, ...) x[sample.int(length(x), size=n, ...)]
 
### Model ensemble training function ###########################################
train.glue.ann <- function(inputData,outputData,orthogonal='Cor',variableNumber=
c(1:ncol(inputData)),dimRed=T,rescale.output=T,ES=T,randomESPart=0.5,
nonEqualMeans=T,maxMeanDiff=mean(outputData)*0.2,nCycles=100,HLrange=c(1:1),
HNrange=c(1:10),maxHNPerVar=Inf,hidden.layer='tansig',output.layer='sigmoid',
nSets=100,nStedinger=50,weighting='Stedinger',cv=F)
{
  ### Initialisation ###########################################################
  parameters <- NULL
  parameters$orthogonal <- orthogonal
  parameters$variableNumber <- variableNumber
  parameters$dimRed <- dimRed
  parameters$rescale.output <- rescale.output
  parameters$ES <- ES
  parameters$randomESPart <- randomESPart
  parameters$nonEqualMeans <- nonEqualMeans
  parameters$maxMeanDiff <- maxMeanDiff
  parameters$nCycles <- nCycles
  parameters$HLrange <- HLrange
  parameters$HNrange <- HNrange
  parameters$maxHNPerVar <- maxHNPerVar
  parameters$hidden.layer <- hidden.layer
  parameters$output.layer <- output.layer
  parameters$nSets <- nSets
  parameters$nStedinger <- nStedinger
  parameters$weighting <- weighting
  parameters$cv <- cv        
  glue.ann <- NULL
  cvPredictions <- NULL
  glue.ann$parameters <- parameters      
  glue.ann$esNumber <- matrix(nrow=nrow(inputData), ncol=nSets)
 
  ### Training loop ############################################################
  for(crossValidationSampleNumber in 1:ifelse(cv,nrow(inputData),1))
  {
    if(cv)
    { 
        inputDataModellingCV <- as.data.frame(
        inputData[crossValidationSampleNumber,])
        inputDataModelling <- as.data.frame(
        inputData[-crossValidationSampleNumber,])
        outputDataModellingCV <- outputData[crossValidationSampleNumber]
        outputDataModelling <- outputData[-crossValidationSampleNumber] 
    } else {
        inputDataModelling <- inputData
        outputDataModelling <- outputData
    }
    names(inputDataModelling) <- c(1:ncol(inputDataModelling))
    if(cv) names(inputDataModellingCV) <- c(1:ncol(inputDataModellingCV))
    glue.ann$inputData <- inputDataModelling
    glue.ann$outputData <- outputDataModelling
    P <- cbind(inputDataModelling)    
    if(orthogonal != 'No'){Ppca <- princomp(P, cor=ifelse(orthogonal=='Cor', T,
    F));P <- Ppca$scores}
    trans <- P
    glue.ann$P <- P
    for(i in 1:ncol(inputData)) trans[,i] <- (trans[,i]-mean(P[,i]))/sd(P[,i])
    if(rescale.output) outputDataModelling <- (outputDataModelling - (
    min(glue.ann$outputData)))/(((max(glue.ann$outputData))-(
    min(glue.ann$outputData))))
    if(rescale.output & cv) outputDataModellingCV <- (outputDataModellingCV -(
    min(glue.ann$outputData)))/(((max(glue.ann$outputData))-(
    min(glue.ann$outputData))))
    variableslist <- list();networks <- list(); esNumber <- NULL
    earlyStoppingMSE <- NULL; earlyStoppingCycle <- NULL; hn <- list()
    totalVar <- NULL; trainingMSE <- NULL
    for(j in 1:nSets)
    {
      if(cv) cat(paste('# Cross-validating sample',crossValidationSampleNumber,
      ' '))
      esDataNumbers <- sample(1:nrow(inputDataModelling),
      round(nrow(inputDataModelling)*randomESPart,0))
      while(nonEqualMeans)
      {
        esDataNumbers <- sample(1:nrow(inputDataModelling),
        round(nrow(inputDataModelling)*randomESPart,0))
        esMean <- mean(outputDataModelling[esDataNumbers])
        trainMean <- mean(outputDataModelling[-esDataNumbers])
        if(abs(esMean-trainMean) < maxMeanDiff) nonEqualMeans <- F
      }        
      trainingDataNumbers <- c(1:nrow(inputDataModelling))[-esDataNumbers]
      if(!ES) {trainingDataNumbers <- c(1:nrow(inputDataModelling))
      esDataNumbers <- trainingDataNumbers}
      nvars <- resample(variableNumber,1)
      variables <- resample(c(1:ncol(inputDataModelling)),nvars)
      if(dimRed) variables <- c(1:length(variables))
      variableslist[[j]] <- variables
      hiddenLayers <- resample(HLrange,1, replace=T)
      hiddenNodes <- resample(HNrange,hiddenLayers, replace=T)
      for(i in 1:hiddenLayers) if(maxHNPerVar < hiddenNodes[i]/nvars) {
      hiddenNodes[i] <- resample((min(HNrange):maxHNPerVar),1, replace=T)}
      ncyc <- resample(nCycles,1)
      cat('# Training ANN nr. ',j,', with architecture: ',sep='')
      cat(c(nvars,hiddenNodes[1:hiddenLayers],1), sep='-')
      Ptrain <- as.data.frame(trans[trainingDataNumbers,variables]   )
      Pval <- as.data.frame(trans[esDataNumbers,variables])
      targetTrain <- outputDataModelling[trainingDataNumbers]
      targetVal <- outputDataModelling[esDataNumbers]
      networks[[j]] <- nncalc(Ptrain, Pval, targetTrain, targetVal,
      architecture=c(nvars,hiddenNodes,1), ncyc, Stao=10, es=ES, hidden.layer=
      hidden.layer, output.layer=output.layer)
      if(ES) esNumber[j] <- which.min(networks[[j]]$Merror[,2])
      predTrain <- sim(networks[[j]]$net, Ptrain)
      trainingMSE[j] <- mean((targetTrain-predTrain)^2)
      if(ES) predEarlyStopping <- sim(networks[[j]]$net, Pval)
      if(ES) earlyStoppingMSE[j] <- mean((targetVal-predEarlyStopping)^2)
      if(ES) totalVar[j] <- var(c(predTrain-targetTrain, predEarlyStopping-
      targetVal))
      if(!ES) totalVar[j] <- var(c(predTrain-targetTrain))
      if(ES) earlyStoppingCycle[j] <- which.min(networks[[j]]$Merror[,2])
      hn[[j]] <- hiddenNodes
      if(ES) cat(paste(' # Training MSE: ',round(trainingMSE[j],2),
      ', early stopping MSE: ',round(earlyStoppingMSE[j],2),', cycles: ',
      earlyStoppingCycle[j],'\n', sep=''))
      if(!ES) cat(paste(' # Training MSE: ',round(trainingMSE[j],2),
      ', cycles: ',earlyStoppingCycle[j],'\n', sep=''))
    }
    glue.ann$networks <- networks
    glue.ann$variables <- variableslist
    glue.ann$trainMSE <- trainingMSE
    if(ES) glue.ann$esMSE <- earlyStoppingMSE
    if(ES) glue.ann$totalMSE <- (1-randomESPart) * glue.ann$trainMSE +
    randomESPart * glue.ann$esMSE
    if(!ES) glue.ann$totalMSE <- glue.ann$trainMSE
    glue.ann$totalVar <- totalVar
    glue.ann$hn <- hn
    if(orthogonal != 'No') glue.ann$pca <- Ppca
 
    ### Predict cross-validated sample #########################################
    if(cv)
    {
      cvPrediction <- predict.glue.ann(inputDataModellingCV, glue.ann)
      cvPredictions$MSE[crossValidationSampleNumber] <- mean(
      (outputData[crossValidationSampleNumber] - cvPrediction$eMean)^2)
      cvPredictions$eMean[crossValidationSampleNumber] <- cvPrediction$eMean
      cvPredictions$eMedian[crossValidationSampleNumber] <- cvPrediction$eMedian
      cvPredictions$eMin[crossValidationSampleNumber] <- cvPrediction$eMin
      cvPredictions$eMax[crossValidationSampleNumber] <- cvPrediction$eMax
      cvPredictions$eQ025[crossValidationSampleNumber] <- cvPrediction$eQ025
      cvPredictions$eQ975[crossValidationSampleNumber] <- cvPrediction$eQ975
      cvPredictions$MLE[crossValidationSampleNumber] <- cvPrediction$MLE
      cvPredictions$target[crossValidationSampleNumber] <-
      outputData[crossValidationSampleNumber]
    }
  }
  if(cv){return(cvPredictions)} else {return(glue.ann)}
}
 
### Model ensemble prediction function #########################################
predict.glue.ann <- function(inputData, glue.ann,
weighting=glue.ann$parameters$weighting)
{
  names(inputData) <- c(1:ncol(inputData))
  if(glue.ann$parameters$orthogonal != 'No') inputData <-
  predict(glue.ann$pca, newdata=inputData) # orthogonalise
  for(i in 1:ncol(inputData)) inputData[,i] <-
  (inputData[,i]-mean(glue.ann$P[,i]))/sd(glue.ann$P[,i]) # standardise
 
  ### Predict model ensemble results for new data ##############################
  resultsMatrix <- matrix(nrow=nrow(inputData), ncol=glue.ann$parameters$nSets)
  if(nrow(inputData) > 1) {for(i in 1:glue.ann$parameters$nSets) {
  resultsMatrix[,i] <- sim(glue.ann$networks[[i]]$net,
  inputData[,glue.ann$variables[[i]]])}}
  if(nrow(inputData) == 1) {for(i in 1:glue.ann$parameters$nSets) {
  resultsMatrix[,i] <- sim(glue.ann$networks[[i]]$net, t(
  inputData[,glue.ann$variables[[i]]]))}}
 
  ### Construct GLUE weights ###################################################
  if(glue.ann$parameters$weighting=='MSEes'){weightsGLUE <- (max(glue.ann$esMSE)
  - glue.ann$esMSE)
  weightsGLUE[which(weightsGLUE < 0)] <- 0
  weightsGLUE <- weightsGLUE/sum(weightsGLUE)}
  if(glue.ann$parameters$weighting=='MSE'){weightsGLUE <- (
  max(glue.ann$totalMSE) - glue.ann$totalMSE)^2
  weightsGLUE[which(weightsGLUE < 0)] <- 0
  weightsGLUE <- weightsGLUE/sum(weightsGLUE)}
  if(glue.ann$parameters$weighting=='Stedinger')
  {
    se2 <- min(glue.ann$totalVar)
    n <- length(glue.ann$outputData)
    weightsGLUE <- exp((-n/2)* (glue.ann$totalMSE)/(se2) )
  }
  if(glue.ann$parameters$weighting=='NSEff')
  {
    weightsGLUE <- (1-(glue.ann$totalMSE/var(glue.ann$outputData)))
    weightsGLUE[which(weightsGLUE < 0)] <- 0
  }
 
  ### Weight model predictions #################################################
  predictions <- NULL
  predictions$MLE <- resultsMatrix[1,which.min(glue.ann$totalVar)]    
  if(glue.ann$parameters$weighting != 'Stedinger')
  {
    predictions$eMean <- apply(resultsMatrix, 1, "wtd.mean", weights =
    weightsGLUE, normwt=T)
    predictions$eMedian <- apply(resultsMatrix, 1, "wtd.quantile", weights =
    weightsGLUE, probs = c(0.5), normwt=T)
    predictions$eQ025 <- apply(resultsMatrix, 1, "wtd.quantile", weights =
    weightsGLUE, probs = c(0.025), normwt=T)
    predictions$eQ975 <- apply(resultsMatrix, 1, "wtd.quantile", weights =
    weightsGLUE, probs = c(0.975), normwt=T)
    predictions$eMin <- apply(resultsMatrix, 1, "min")
    predictions$eMax <- apply(resultsMatrix, 1, "max")
    predictions$eResults <- resultsMatrix
  }
  if(glue.ann$parameters$weighting=='Stedinger')
  {
    resultsMatrix_sted <- matrix(nrow=nrow(inputData), ncol=
    glue.ann$parameters$nSets*glue.ann$parameters$nStedinger)
    for(i in 1:nrow(inputData))
    {
      for(k in 1:glue.ann$parameters$nSets)
      {
        errorVector <- rnorm(glue.ann$parameters$nStedinger,mean=0,sd=sqrt(se2))
        resultsMatrix_sted[i,(((k-1)*glue.ann$parameters$nStedinger)+c(
        1:glue.ann$parameters$nStedinger))] <- errorVector + resultsMatrix[i,k]
      }
    }
    weightsGLUE <- rep(weightsGLUE, each=glue.ann$parameters$nStedinger)
    predictions$eMean <- apply(resultsMatrix_sted, 1, "wtd.mean", weights =
    weightsGLUE, normwt=T)
    for(i in 1:nrow(inputData))
    {
      quantiles <- wtd.quantile(resultsMatrix_sted[i,], weights = weightsGLUE,
      probs = c(0.025, 0.5, 0.975), normwt=T)
      predictions$eMedian[i] <- quantiles[2]
      predictions$eQ025[i] <- quantiles[1]
      predictions$eQ975[i] <- quantiles[3] 
    }
    predictions$eResults <- resultsMatrix_sted
    predictions$eMin <- apply(resultsMatrix_sted, 1, "min")
    predictions$eMax <- apply(resultsMatrix_sted, 1, "max")    
  }
 
  ### Backtransform of predictions according to training data output ###########
  if(glue.ann$parameters$rescale.output)
  {
    outputRange <- max(glue.ann$outputData)-min(glue.ann$outputData)
    outputMin <- min(glue.ann$outputData)
    predictions$MLE <- predictions$MLE * outputRange + outputMin
    predictions$eMean <- predictions$eMean * outputRange + outputMin
    predictions$eMedian <- predictions$eMedian * outputRange + outputMin
    predictions$eQ025 <- predictions$eQ025 * outputRange + outputMin
    predictions$eQ975 <- predictions$eQ975 * outputRange + outputMin
    predictions$eMin <- predictions$eMin * outputRange + outputMin
    predictions$eMax <- predictions$eMax * outputRange + outputMin
    predictions$eResults <- predictions$eResults * outputRange + outputMin
  }
  return(predictions)
}

To leave a comment for the author, please follow the link and comment on their blog: Bart Rogiers - Sreigor.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)