(This article was first published on

**Bart Rogiers - Sreigor**, and kindly contributed to R-bloggers)Here is a follow-up of my first post about using R. For our yearly KU Leuven Geology PhD Seminar (08-09/05/2012), I quickly pasted this script together from several examples I had run into in the past, as well as some things that I have been doing myself in R. Be aware these analyses of the Meuse dataset are not always very useful! This was just to demonstrate several numerical/statistical techniques within R, in an attempt to add a few more R converts to the community. Below the example graphs, some explanation about what they represent, and the R code that I sent to all members of the statistics and numerical analysis discussion group.

The Meuse data set consists of 155 samples of top soil heavy metal concentrations (ppm), along with a number of soil and landscape variables, collected in a flood plain of the river Meuse, near the village Stein. Below a map of the relative cadmium concentrations, and the scatterplot matrix for the different metal concentrations.

Linear mixed effects models are suited for linear dependence modelling, when your dataset has one or more categorical variables, and you expect both a general relationship, as well as slight differences between the different classes. What one could do with the Meuse dataset is the following: predict the copper concentrations in function of the lead concentration. The bold black line is the result of linear regression. By allowing an additional random effect for each flood frequency class, one obtains the other 3 lines, i.e. one for each class (this was done with lme4). This is probably not very useful for this dataset, but I have used this extension of linear regression several times by relating different material properties with for example the stratigraphy as categorical variable. This definitely is a useful technique within the geosciences.

Next is the non-linear modelling of the metal concentrations’ dependence on elevation, organic matter content and the distance from the river Meuse. While the concentrations show mutually linear dependence, this is not true for the other three variables, and we can for example use an artificial neural network (data transformations might of course also work, but I wanted to show how to use ANNs in R with AMORE).

The two figures below show the model fit (predicted vs target values) after training a single network, and the leave-one-out cross-validation results to estimate the true predictive error. Especially from the lead and zinc plots, it becomes apparent that the training error is smaller than the true predictive error. Adjusting with the ANN parameters would probably yield better results though, but this was outside of the scope of this demonstration.

Next is data clustering, an unsupervised machine learning technique. In the Meuse dataset, it would be nice to discover different classes based on the metal concentrations. The lead vs copper scatterplot nicely shows different classes due to the distance from the Meuse. Supposed we do not know these distances, are we able to detect these classes just from the concentration data?

Two algorithms are compared here: k-means and model-based clustering. For the first one the number of classes is specified. The plots are created for up to ten classes. The model-based clustering (done with MCLUST) uses expectation maximization methods to estimate the optimal number of classes. This example clearly demonstrates the disadvantages of k-means clustering, as all clusters are equal in size and have a more or less spherical shape. The use of model-based clustering overcomes these problems, and provides a decent classification, according to the distance to the river Meuse, as given in the plot above.

Finally, some geostatistics of course! Below are the examples for the ordinary kriging of the logarithmic zinc concentrations, as provided in the gstat demo.

And here is the script, if you want to try it yourself:

####################################################################

# PhD Seminar KU Leuven Geology Section: R examples, by B. Rogiers #

####################################################################

# Download and install R from: http://cran.freestatistics.org/

# Open this script in RGui, or a more advanced R editor like RStudio: http://rstudio.org/

# Execute the code lines one by one

install.packages(c('gstat','lme4','AMORE','mclust')) # install packages; only necessary the first time!

library(gstat) # load geostatistics package, that contains our dataset

data(meuse) # load the example dataset

?meuse # Help file for the Meuse river data set

plot(meuse$x, meuse$y, cex=(meuse$cadmium)^(1/3), pch=16, xlab='X coordinate',ylab='Y coordinate',main='Cadmium concentration (ppm)') # plot sample locations and cadmium concentration

plot(meuse[,3:6]) # concentration scatterplot matrix

n <- nrow(meuse) # number of samples

# Dependence modelling

# Linear mixed models

library(lme4) # load necessary package

classes <- as.numeric(meuse$ffreq) # choose data classes: flood frequency 1,2,3

plot(meuse$lead, meuse$copper, col=classes, pch=classes, xlab='Lead (ppm)',ylab='Copper (ppm)',main='Flood frequency classes') # plot lead and copper concentrations for each of the classes

abline(lm(meuse$copper ~ meuse$lead)$coef, lwd=2) # draw a fitted linear regression line

?lmer # Help file for the lmer function

lmer(meuse$copper ~ meuse$lead + (meuse$lead|classes)) # fit a linear mixed effects model, with a random effect for each class

coefs <- coef(lmer(meuse$copper ~ meuse$lead + (meuse$lead|classes))) # save the coefficients of this model

for(i in 1:3) abline(a=coefs$classes[i,1], b=coefs$classes[i,2],col=i, lty=i) # plot the 3 resulting lines

# Neural networks

library(AMORE) # load necessary package

plot(meuse[,c(3:7,9,14)]) # scatterplot matrix of some numeric parameters

input <- meuse[-c(42,43),c(7,9,14)] # use elevation, distance to the Meuse, and organic matter content (input data), to model

target <- meuse[-c(42,43),c(3:6)] # the cadmium, copper, lead and zinc concentrations (target values); line 42 and 43 contain NA values

input <- scale(input, center=T, scale=T) # standardize input data

target <- scale(target, center=T, scale=T) # standardize target data

?newff # Help file for the newff function

ann <- newff(c(3,20,4), learning.rate.global=0.03, momentum.global=0.1, error.criterium="LMS", Stao=NA, hidden.layer="sigmoid", output.layer="purelin", method="ADAPTgdwm") # create a neural network

?train # Help file for the train function

trainedANN <- train(ann, input, target, show.step=1, n.shows=1000) # train the neural network

fit <- sim(trainedANN$net, input) # get the predictions for all data

par(mfrow=c(2,2)) # plot 2 x 2 figures

plot(fit[,1], target[,1], xlab='Model prediction', ylab='Target variable', main='Cadmium') # scatterplot of observations vs predictions

plot(fit[,2], target[,2], xlab='Model prediction', ylab='Target variable', main='Copper') # scatterplot of observations vs predictions

plot(fit[,3], target[,3], xlab='Model prediction', ylab='Target variable', main='Lead') # scatterplot of observations vs predictions

plot(fit[,4], target[,4], xlab='Model prediction', ylab='Target variable', main='Zinc') # scatterplot of observations vs predictions

par(mfrow=c(1,1)) # reset plot options

# Uncertainty quantification

# http://episun7.med.utah.edu/~alun/teach/stats/week08.pdf

hist(meuse$cadmium) # histogram of the cadmium concentrations

mean(meuse$cadmium) # mean of cadmium concentrations

var(meuse$cadmium)/n # variance of the sample mean

dataSample <- meuse$cadmium # use all data

#n <- 15;dataSample <- sample(meuse$cadmium, size=n) # use smaller data sample

# Bootstrapping

# estimate variance of a statistic

# good for smaller samples or awkward distributions

# generate set of random samples with replacement to evaluate the statistic

# non-parametric bootstrap:

s <- 1000 # number of resampling cases

t <- rep(0,s) # initialize the resulting vector of means

for (i in 1:s) t[i] = mean(sample(dataSample,replace=TRUE)) # loop through all resampling cases, and calculate the mean

mean(t) # bootstrapped estimate of mean

var(t) # bootstrapped estimate of variance of the mean

# semi-parametric bootstrap:

a <- 0.1 # smoothing parameter

for (i in 1:s) t[i] = mean( sample(dataSample,replace=TRUE) + a * rnorm(n)) # loop through all resampling cases, add random normal error to each sample, and calculate the mean

mean(t) # bootstrapped estimate of mean

var(t) # bootstrapped estimate of variance of the mean

# Jack-knifing

# estimate variance and bias of a statistic

# more orderly version of the bootstrap

# generate n samples by leaving one observation out at a time

t = rep(0,n) # initialize the resulting vector of means

for (i in 1:n) t[i] = mean(dataSample[-i]) # loop through all cases

mean(t) # jack-knifed estimate of the mean

(n-1)/n * sum( (t-mean(t))^2 ) # jack-knifed estimate of the variance of the mean

n * mean(dataSample) - (n-1) * mean(t) # jack-knifed bias-corrected estimate

# Cross-validation

# evaluating predictive error for a model fitted to all data

cv <- function(input=input, target=target, sample=i) # function to predict a sample through cross-validation, with the ANN we created above

{

cvSample <- input[i,];inputCV <- input[-i,];targetCV <- target[-i,] # remove the ith sample from the data

ann <- newff(c(3,20,4), learning.rate.global=0.03, momentum.global=0.1, error.criterium="LMS", Stao=NA, hidden.layer="sigmoid", output.layer="purelin", method="ADAPTgdwm")

trainedANNCV <- train(ann, inputCV, targetCV, show.step=500, n.shows=1)

return(sim(trainedANNCV$net, input)[i,]) # predict the left out sample

}

results <- matrix(nrow=nrow(input), ncol=4) # initialize the results matrix

for(i in 1:nrow(input)) results[i,] <- cv(input, target, i) # loop through all samples, applying the above function

par(mfrow=c(2,2)) # plot 2 x 2 figures

for(i in 1:4) {plot(results[,i], target[,i], xlab='Model prediction',ylab='Target variable',main=c('Cadmium','Copper','Lead','Zinc')[i]); print(mean((target[,i]-results[,i])^2))} # plot observations vs predictions for each element

par(mfrow=c(1,1)) # reset plot options

# Clustering

# Unsupervised machine learning technique to find classes in a dataset

plot(meuse$lead, meuse$copper, cex=(meuse$dist.m)^(1/3)/7, pch=16, xlab='Lead (ppm)',ylab='Copper (ppm)',main='Distance to Meuse river') # let us classify the lead and copper concentrations; two groups are clearly present: close to and far away from the Meuse

# K-means clustering

# http://en.wikipedia.org/wiki/K-means_clustering

?kmeans # Help file for the kmeans function

par(ask=T) # adapt graphical parameters; now press enter in the console window to display plots

for(i in 1:10) # loop from 1 to 10 k-means classes

{

classes <- kmeans(x=data.frame(lead=meuse$lead, copper=meuse$copper), centers=i, iter.max=100, nstart=100)$cluster # classify

plot(meuse$lead, meuse$copper, col=classes, pch=classes,xlab='Lead (ppm)',ylab='Copper (ppm)', main=paste('K-means clustering, with',i,'classes')) # plot data with classes

}

par(ask=F) # reset graphical parameters

# Model-based clustering

# http://www.stat.washington.edu/raftery/Research/mbc.html

library(mclust) # load necessary package

?Mclust # Help file for the Mclust function

classes <- Mclust(data.frame(lead=meuse$lead, copper=meuse$copper))$classification # classify

plot(meuse$lead, meuse$copper, col=classes, pch=classes,xlab='Lead (ppm)',ylab='Copper (ppm)',main='Model-based clustering') # plot data with classes

# Geostatistics

?gstat # Help file for the gstat function

coordinates(meuse) = ~ x+y # Set x and y variables as coordinates

data(meuse.grid) # Load data grid for estimating Zinc concentrations

gridded(meuse.grid) = ~ x+y # Set data as gridded with x and y as coordinates

v <- vgm(0.581, "Sph", 900, nug = 0.0554) # Variogram model

g <- gstat(id = "ln.zinc", form = log(zinc) ~ 1,data = meuse, nmax = 40, nmin = 20, maxdist = 1000, model = v) # Initialize gstat object

plot(variogram(g), model=v) # Plot variogram model fit

x <- predict(g, meuse.grid) # Predict using ordinary kriging

spplot(x["ln.zinc.pred"], main = "log(zinc) ordinary kriging predictions", col.regions=terrain.colors(30)) # Plot ordinary kriging predictions

x$ln.zinc.se = sqrt(x$ln.zinc.var) # convert kriging variance to kriging standard error for plotting

spplot(x["ln.zinc.se"], main = "log(zinc) ordinary kriging prediction standard errors", col.regions=terrain.colors(30)) # Plot ordinary kriging standard error

# need more examples?

demo(cokriging)

demo(examples)

# Other useful info:

# The R project website: http://www.r-project.org/

# List of packages: http://cran.freestatistics.org/web/packages/available_packages_by_name.html

# Visualization package for the R statistical analysis platform, loosely based on "The Grammar of Graphics" from Leland Wilkinson: http://had.co.nz/ggplot2/

# R graph gallery: http://addictedtor.free.fr/graphiques/

# R & LaTeX:

# knitr: http://yihui.name/knitr/

# Sweave: http://www.statistik.lmu.de/~leisch/Sweave/

To

**leave a comment**for the author, please follow the link and comment on their blog:**Bart Rogiers - Sreigor**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...