Estimation of hydraulic conductivity and its uncertainty from grain-size data using GLUE and artificial neural networks.
[This article was first published on Bart Rogiers - Sreigor, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Abstract
Various approaches exist to relate saturated hydraulic conductivity (Ks) to grain-size data. Most methods use a single grain-size parameter and hence omit the information encompassed by the entire grain-size distribution. This study compares two data-driven modelling methods—multiple linear regression and artificial neural networks—that use the entire grain-size distribution data as input for Ks prediction. Besides the predictive capacity of the methods, the uncertainty associated with the model predictions is also evaluated, since such information is important for stochastic groundwater flow and contaminant transport modelling.
Various approaches exist to relate saturated hydraulic conductivity (Ks) to grain-size data. Most methods use a single grain-size parameter and hence omit the information encompassed by the entire grain-size distribution. This study compares two data-driven modelling methods—multiple linear regression and artificial neural networks—that use the entire grain-size distribution data as input for Ks prediction. Besides the predictive capacity of the methods, the uncertainty associated with the model predictions is also evaluated, since such information is important for stochastic groundwater flow and contaminant transport modelling.
Artificial neural networks (ANNs) are combined with a generalised likelihood uncertainty estimation (GLUE) approach to predict Ks from grain-size data. The resulting GLUE-ANN hydraulic conductivity predictions and associated uncertainty estimates are compared with those obtained from the multiple linear regression models by a leave-one-out cross-validation. The GLUE-ANN ensemble prediction proved to be slightly better than multiple linear regression. The prediction uncertainty, however, was reduced by half an order of magnitude on average, and decreased at most by an order of magnitude. This demonstrates that the proposed method outperforms classical data-driven modelling techniques. Moreover, a comparison with methods from the literature demonstrates the importance of site-specific calibration. The data set used for this purpose originates mainly from unconsolidated sandy sediments of the Neogene aquifer, northern Belgium. The proposed predictive models are developed for 173 grain-size Ks-pairs. Finally, an application with the optimised models is presented for a borehole lacking Ks data.
Keywords
Early stopping – Cross-validation – Generalised likelihood uncertainty estimation – Artificial neural networks – Sedimentary aquifer – Data-driven modelling – Likelihood measures – Principal component analysis – GLUE-ANN
Early stopping – Cross-validation – Generalised likelihood uncertainty estimation – Artificial neural networks – Sedimentary aquifer – Data-driven modelling – Likelihood measures – Principal component analysis – GLUE-ANN
Reference
Rogiers, B., Mallants, D., Batelaan, O., Gedeon, M., Huysmans, M., Dassargues, A. 2012. Estimation of hydraulic conductivity and its uncertainty from grain-size data using GLUE and artificial neural networks. Mathematical Geosciences 44: 739–763. Paper – Author’s version – R script – Blog post
An example script for performing such analysis in the R language is provided with the paper. The most up to date version can be found below. Two simple examples are provided:
An example script for performing such analysis in the R language is provided with the paper. The most up to date version can be found below. Two simple examples are provided:
The first one models the [0,100] range of x² in function of x. It plots two figures: one with the results for the training data, and one that predicts new data on the [-100,200] range. The data outside of the training data range can of course not be predicted, as is clear from the second figure.
The second example shows a cross-validation with the air quality dataset:
R script with examples of GLUE-ANN analyses:
R functions for GLUE-ANN analysis with the AMORE package (Castejón Limas et al 2010):
################################################################################ ### GLUE-ANN parameters + examples (Rogiers et al. 2011) ### ################################################################################ # A dataframe has to be provided with the independent variable(s), and a vector # with the dependent variable. A list of parameters has to be set to define the # a priori distributions of the stochastic parameters, and fixed values for the # ones considered to be constant. # Be sure to load the additional GLUE-ANN functions into R before trying # the examples provided below. The AMORE package should be installed and loaded: install.packages(c('AMORE', 'Hmisc')) library(AMORE) library(Hmisc) ### Parameter list ############################################################# # A list of all required parameters with their default values. ### Pre-processing parameters ################################################ # Orthogonalise input data; options are: 'Cov', 'Cor', and 'No' orthogonal <- 'Cor' # Number of input variables, provide vector for stochastic treatment variableNumber <- c(1:ncol(inputData)) # T: Select always most variance explaining PCs; F: Choose PCs randomly dimRed <- T # Rescale output to 0-1 range rescale.output <- T ### ANN parameters ########################################################### # Enable early stopping ES <- T # Part of the data used for early stopping randomESPart <- 0.5 # Early stopping samples should have a similar mean as the training samples nonEqualMeans <- T # Maximum difference between early stopping and training observation means maxMeanDiff <- mean(outputData)*0.2 # maximum number of training cycles, provide a vector for stochastic treatment nCycles <- 100 # Amount of hidden layers, provide vector for stochastic treatment HLrange <- c(1:1) # Amount of hidden nodes, provide vector for stochastic treatment HNrange <- c(1:10) # Maximum amount of hidden nodes per input variable maxHNPerVar <- Inf # Hidden layer activation function; options are 'tansig', 'sigmoid', 'purelin' hidden.layer <- 'tansig' # Output layer activation function; options are 'tansig', 'sigmoid', 'purelin' output.layer <- 'sigmoid' ### GLUE parameters ########################################################## # Number of models in the ensemble nSets <- 100 # Number of random variates for weights based on Stedinger et al (2008) nStedinger <- 50 # GLUE weighting procedure; options are 'MSE', 'MSEes' (for weights based on # the early stopping subset performance), 'NSEff' and 'Stedinger' weighting <- 'Stedinger' ### Cross-validation parameters ############################################## # Use leave-one-out cross-validation cv <- T ### Examples ################################################################### ### Example of training and prediction for fictional data #################### # Create fictional data for x range [1,100] inputData <- data.frame(x=seq(1,100,0.5)) # Dependent variable y = x^2 outputData <- inputData$x ^ 2 # add random noise to y, representing different sources of error outputData <- jitter(outputData, amount=mean(outputData)*0.2) # Run GLUE-ANN functions glue.ann.ensemble <- train.glue.ann(inputData, outputData, nCycles=500) training.results <- predict.glue.ann(inputData, glue.ann.ensemble) # Visualise results for training data plot(inputData$x, training.results$eMean, type='l', xlab='x', ylab= expression(y = x^2 + epsilon), ylim=c(-2000,13000)) lines(inputData$x, glue.ann.ensemble$outputData, lty=2, col='red') lines(inputData$x, training.results$eQ025, lty=3, col='blue') lines(inputData$x, training.results$eQ975, lty=3, col='blue') legend('topleft', c('GLUE-ANN ensemble mean prediction','Observations', '95% prediction uncertainty'), lty=c(1,2,3),col=c('black','red','blue'), bty='n') # Predict new data newData <- data.frame(x=c(-100:200)+0.5) newData.predictions <- predict.glue.ann(newData, glue.ann.ensemble) # Visualise results of prediction plot(newData$x, newData.predictions$eMean, type='l', xlab='x', ylab= expression(y = x^2 + epsilon), ylim=c(-2000,13000)) lines(newData$x, jitter(newData$x^2, amount=mean(outputData)*0.2) , lty=2, col='red') lines(newData$x, newData.predictions$eQ025, lty=3, col='blue') lines(newData$x, newData.predictions$eQ975, lty=3, col='blue') abline(v=c(0,100)) text(50,13000,'Training data range') legend('bottomright', c('GLUE-ANN ensemble mean prediction','Observations', '95% prediction uncertainty'), lty=c(1,2,3),col=c('black','red','blue'), bty='n') ### Example of cross-validation with an R dataset ############################ # Read input and output data (ozone concentration in function of wind, # temperature and solar radiation) outputData <- na.omit(airquality)[,1] inputData <- na.omit(airquality)[,2:4] # Perform cross-validation cross_validation <- train.glue.ann(inputData, outputData, cv=T, nCycles=1000 , nSets=10) # Visualisation plot(cross_validation$target, ylim=c(-100,200), xlab='Sample number', ylab= 'Ozone') points(cross_validation$eMean, pch=2, col='red') lines(cross_validation$eQ025, lty=3, col='blue') lines(cross_validation$eQ975, lty=3, col='blue') legend('bottomright', c('Observations','GLUE-ANN ensemble mean prediction', '95% prediction uncertainty'), pch=c(1,2,NA), lty=c(NA,NA,3),col=c('black', 'red','blue'),bty='n')
R functions for GLUE-ANN analysis with the AMORE package (Castejón Limas et al 2010):
################################################################################ ### GLUE-ANN - functions (Rogiers et al. 2012) ### ################################################################################ ### ANN training function ###################################################### nncalc <- function(Ptrain, Pval, targetTrain, targetVal, architecture, n, Stao=1 , es=T, hidden.layer="sigmoid", output.layer="purelin") { neuralnetwork <- newff(n.neurons=architecture, learning.rate.global=1e-2, momentum.global=0.5, error.criterium = 'LMS', Stao = Stao, hidden.layer= hidden.layer, output.layer=output.layer, method="ADAPTgdwm") if(!es) { Pval <- NULL; targetVal <- NULL } capture.output(results <- train(neuralnetwork, Ptrain, Stao=Stao, Pval=Pval, Tval=targetVal,targetTrain, error.criterium = 'LMS', report=T, show.step=1, n.shows=n )) return(results) } ### Sampling parameter vectors function ######################################## resample <- function(x, n, ...) x[sample.int(length(x), size=n, ...)] ### Model ensemble training function ########################################### train.glue.ann <- function(inputData,outputData,orthogonal='Cor',variableNumber= c(1:ncol(inputData)),dimRed=T,rescale.output=T,ES=T,randomESPart=0.5, nonEqualMeans=T,maxMeanDiff=mean(outputData)*0.2,nCycles=100,HLrange=c(1:1), HNrange=c(1:10),maxHNPerVar=Inf,hidden.layer='tansig',output.layer='sigmoid', nSets=100,nStedinger=50,weighting='Stedinger',cv=F) { ### Initialisation ########################################################### parameters <- NULL parameters$orthogonal <- orthogonal parameters$variableNumber <- variableNumber parameters$dimRed <- dimRed parameters$rescale.output <- rescale.output parameters$ES <- ES parameters$randomESPart <- randomESPart parameters$nonEqualMeans <- nonEqualMeans parameters$maxMeanDiff <- maxMeanDiff parameters$nCycles <- nCycles parameters$HLrange <- HLrange parameters$HNrange <- HNrange parameters$maxHNPerVar <- maxHNPerVar parameters$hidden.layer <- hidden.layer parameters$output.layer <- output.layer parameters$nSets <- nSets parameters$nStedinger <- nStedinger parameters$weighting <- weighting parameters$cv <- cv glue.ann <- NULL cvPredictions <- NULL glue.ann$parameters <- parameters glue.ann$esNumber <- matrix(nrow=nrow(inputData), ncol=nSets) ### Training loop ############################################################ for(crossValidationSampleNumber in 1:ifelse(cv,nrow(inputData),1)) { if(cv) { inputDataModellingCV <- as.data.frame( inputData[crossValidationSampleNumber,]) inputDataModelling <- as.data.frame( inputData[-crossValidationSampleNumber,]) outputDataModellingCV <- outputData[crossValidationSampleNumber] outputDataModelling <- outputData[-crossValidationSampleNumber] } else { inputDataModelling <- inputData outputDataModelling <- outputData } names(inputDataModelling) <- c(1:ncol(inputDataModelling)) if(cv) names(inputDataModellingCV) <- c(1:ncol(inputDataModellingCV)) glue.ann$inputData <- inputDataModelling glue.ann$outputData <- outputDataModelling P <- cbind(inputDataModelling) if(orthogonal != 'No'){Ppca <- princomp(P, cor=ifelse(orthogonal=='Cor', T, F));P <- Ppca$scores} trans <- P glue.ann$P <- P for(i in 1:ncol(inputData)) trans[,i] <- (trans[,i]-mean(P[,i]))/sd(P[,i]) if(rescale.output) outputDataModelling <- (outputDataModelling - ( min(glue.ann$outputData)))/(((max(glue.ann$outputData))-( min(glue.ann$outputData)))) if(rescale.output & cv) outputDataModellingCV <- (outputDataModellingCV -( min(glue.ann$outputData)))/(((max(glue.ann$outputData))-( min(glue.ann$outputData)))) variableslist <- list();networks <- list(); esNumber <- NULL earlyStoppingMSE <- NULL; earlyStoppingCycle <- NULL; hn <- list() totalVar <- NULL; trainingMSE <- NULL for(j in 1:nSets) { if(cv) cat(paste('# Cross-validating sample',crossValidationSampleNumber, ' ')) esDataNumbers <- sample(1:nrow(inputDataModelling), round(nrow(inputDataModelling)*randomESPart,0)) while(nonEqualMeans) { esDataNumbers <- sample(1:nrow(inputDataModelling), round(nrow(inputDataModelling)*randomESPart,0)) esMean <- mean(outputDataModelling[esDataNumbers]) trainMean <- mean(outputDataModelling[-esDataNumbers]) if(abs(esMean-trainMean) < maxMeanDiff) nonEqualMeans <- F } trainingDataNumbers <- c(1:nrow(inputDataModelling))[-esDataNumbers] if(!ES) {trainingDataNumbers <- c(1:nrow(inputDataModelling)) esDataNumbers <- trainingDataNumbers} nvars <- resample(variableNumber,1) variables <- resample(c(1:ncol(inputDataModelling)),nvars) if(dimRed) variables <- c(1:length(variables)) variableslist[[j]] <- variables hiddenLayers <- resample(HLrange,1, replace=T) hiddenNodes <- resample(HNrange,hiddenLayers, replace=T) for(i in 1:hiddenLayers) if(maxHNPerVar < hiddenNodes[i]/nvars) { hiddenNodes[i] <- resample((min(HNrange):maxHNPerVar),1, replace=T)} ncyc <- resample(nCycles,1) cat('# Training ANN nr. ',j,', with architecture: ',sep='') cat(c(nvars,hiddenNodes[1:hiddenLayers],1), sep='-') Ptrain <- as.data.frame(trans[trainingDataNumbers,variables] ) Pval <- as.data.frame(trans[esDataNumbers,variables]) targetTrain <- outputDataModelling[trainingDataNumbers] targetVal <- outputDataModelling[esDataNumbers] networks[[j]] <- nncalc(Ptrain, Pval, targetTrain, targetVal, architecture=c(nvars,hiddenNodes,1), ncyc, Stao=10, es=ES, hidden.layer= hidden.layer, output.layer=output.layer) if(ES) esNumber[j] <- which.min(networks[[j]]$Merror[,2]) predTrain <- sim(networks[[j]]$net, Ptrain) trainingMSE[j] <- mean((targetTrain-predTrain)^2) if(ES) predEarlyStopping <- sim(networks[[j]]$net, Pval) if(ES) earlyStoppingMSE[j] <- mean((targetVal-predEarlyStopping)^2) if(ES) totalVar[j] <- var(c(predTrain-targetTrain, predEarlyStopping- targetVal)) if(!ES) totalVar[j] <- var(c(predTrain-targetTrain)) if(ES) earlyStoppingCycle[j] <- which.min(networks[[j]]$Merror[,2]) hn[[j]] <- hiddenNodes if(ES) cat(paste(' # Training MSE: ',round(trainingMSE[j],2), ', early stopping MSE: ',round(earlyStoppingMSE[j],2),', cycles: ', earlyStoppingCycle[j],'\n', sep='')) if(!ES) cat(paste(' # Training MSE: ',round(trainingMSE[j],2), ', cycles: ',earlyStoppingCycle[j],'\n', sep='')) } glue.ann$networks <- networks glue.ann$variables <- variableslist glue.ann$trainMSE <- trainingMSE if(ES) glue.ann$esMSE <- earlyStoppingMSE if(ES) glue.ann$totalMSE <- (1-randomESPart) * glue.ann$trainMSE + randomESPart * glue.ann$esMSE if(!ES) glue.ann$totalMSE <- glue.ann$trainMSE glue.ann$totalVar <- totalVar glue.ann$hn <- hn if(orthogonal != 'No') glue.ann$pca <- Ppca ### Predict cross-validated sample ######################################### if(cv) { cvPrediction <- predict.glue.ann(inputDataModellingCV, glue.ann) cvPredictions$MSE[crossValidationSampleNumber] <- mean( (outputData[crossValidationSampleNumber] - cvPrediction$eMean)^2) cvPredictions$eMean[crossValidationSampleNumber] <- cvPrediction$eMean cvPredictions$eMedian[crossValidationSampleNumber] <- cvPrediction$eMedian cvPredictions$eMin[crossValidationSampleNumber] <- cvPrediction$eMin cvPredictions$eMax[crossValidationSampleNumber] <- cvPrediction$eMax cvPredictions$eQ025[crossValidationSampleNumber] <- cvPrediction$eQ025 cvPredictions$eQ975[crossValidationSampleNumber] <- cvPrediction$eQ975 cvPredictions$MLE[crossValidationSampleNumber] <- cvPrediction$MLE cvPredictions$target[crossValidationSampleNumber] <- outputData[crossValidationSampleNumber] } } if(cv){return(cvPredictions)} else {return(glue.ann)} } ### Model ensemble prediction function ######################################### predict.glue.ann <- function(inputData, glue.ann, weighting=glue.ann$parameters$weighting) { names(inputData) <- c(1:ncol(inputData)) if(glue.ann$parameters$orthogonal != 'No') inputData <- predict(glue.ann$pca, newdata=inputData) # orthogonalise for(i in 1:ncol(inputData)) inputData[,i] <- (inputData[,i]-mean(glue.ann$P[,i]))/sd(glue.ann$P[,i]) # standardise ### Predict model ensemble results for new data ############################## resultsMatrix <- matrix(nrow=nrow(inputData), ncol=glue.ann$parameters$nSets) if(nrow(inputData) > 1) {for(i in 1:glue.ann$parameters$nSets) { resultsMatrix[,i] <- sim(glue.ann$networks[[i]]$net, inputData[,glue.ann$variables[[i]]])}} if(nrow(inputData) == 1) {for(i in 1:glue.ann$parameters$nSets) { resultsMatrix[,i] <- sim(glue.ann$networks[[i]]$net, t( inputData[,glue.ann$variables[[i]]]))}} ### Construct GLUE weights ################################################### if(glue.ann$parameters$weighting=='MSEes'){weightsGLUE <- (max(glue.ann$esMSE) - glue.ann$esMSE) weightsGLUE[which(weightsGLUE < 0)] <- 0 weightsGLUE <- weightsGLUE/sum(weightsGLUE)} if(glue.ann$parameters$weighting=='MSE'){weightsGLUE <- ( max(glue.ann$totalMSE) - glue.ann$totalMSE)^2 weightsGLUE[which(weightsGLUE < 0)] <- 0 weightsGLUE <- weightsGLUE/sum(weightsGLUE)} if(glue.ann$parameters$weighting=='Stedinger') { se2 <- min(glue.ann$totalVar) n <- length(glue.ann$outputData) weightsGLUE <- exp((-n/2)* (glue.ann$totalMSE)/(se2) ) } if(glue.ann$parameters$weighting=='NSEff') { weightsGLUE <- (1-(glue.ann$totalMSE/var(glue.ann$outputData))) weightsGLUE[which(weightsGLUE < 0)] <- 0 } ### Weight model predictions ################################################# predictions <- NULL predictions$MLE <- resultsMatrix[1,which.min(glue.ann$totalVar)] if(glue.ann$parameters$weighting != 'Stedinger') { predictions$eMean <- apply(resultsMatrix, 1, "wtd.mean", weights = weightsGLUE, normwt=T) predictions$eMedian <- apply(resultsMatrix, 1, "wtd.quantile", weights = weightsGLUE, probs = c(0.5), normwt=T) predictions$eQ025 <- apply(resultsMatrix, 1, "wtd.quantile", weights = weightsGLUE, probs = c(0.025), normwt=T) predictions$eQ975 <- apply(resultsMatrix, 1, "wtd.quantile", weights = weightsGLUE, probs = c(0.975), normwt=T) predictions$eMin <- apply(resultsMatrix, 1, "min") predictions$eMax <- apply(resultsMatrix, 1, "max") predictions$eResults <- resultsMatrix } if(glue.ann$parameters$weighting=='Stedinger') { resultsMatrix_sted <- matrix(nrow=nrow(inputData), ncol= glue.ann$parameters$nSets*glue.ann$parameters$nStedinger) for(i in 1:nrow(inputData)) { for(k in 1:glue.ann$parameters$nSets) { errorVector <- rnorm(glue.ann$parameters$nStedinger,mean=0,sd=sqrt(se2)) resultsMatrix_sted[i,(((k-1)*glue.ann$parameters$nStedinger)+c( 1:glue.ann$parameters$nStedinger))] <- errorVector + resultsMatrix[i,k] } } weightsGLUE <- rep(weightsGLUE, each=glue.ann$parameters$nStedinger) predictions$eMean <- apply(resultsMatrix_sted, 1, "wtd.mean", weights = weightsGLUE, normwt=T) for(i in 1:nrow(inputData)) { quantiles <- wtd.quantile(resultsMatrix_sted[i,], weights = weightsGLUE, probs = c(0.025, 0.5, 0.975), normwt=T) predictions$eMedian[i] <- quantiles[2] predictions$eQ025[i] <- quantiles[1] predictions$eQ975[i] <- quantiles[3] } predictions$eResults <- resultsMatrix_sted predictions$eMin <- apply(resultsMatrix_sted, 1, "min") predictions$eMax <- apply(resultsMatrix_sted, 1, "max") } ### Backtransform of predictions according to training data output ########### if(glue.ann$parameters$rescale.output) { outputRange <- max(glue.ann$outputData)-min(glue.ann$outputData) outputMin <- min(glue.ann$outputData) predictions$MLE <- predictions$MLE * outputRange + outputMin predictions$eMean <- predictions$eMean * outputRange + outputMin predictions$eMedian <- predictions$eMedian * outputRange + outputMin predictions$eQ025 <- predictions$eQ025 * outputRange + outputMin predictions$eQ975 <- predictions$eQ975 * outputRange + outputMin predictions$eMin <- predictions$eMin * outputRange + outputMin predictions$eMax <- predictions$eMax * outputRange + outputMin predictions$eResults <- predictions$eResults * outputRange + outputMin } return(predictions) }
To leave a comment for the author, please follow the link and comment on their blog: Bart Rogiers - Sreigor.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.