# Use R! – Part 2

May 9, 2012
By

(This article was first published on Bart Rogiers - Sreigor, and kindly contributed to R-bloggers)

Here is a follow-up of my first post about using R. For our yearly KU Leuven Geology PhD Seminar (08-09/05/2012), I quickly pasted this script together from several examples I had run into in the past, as well as some things that I have been doing myself in R. Be aware these analyses of the Meuse dataset are not always very useful! This was just to demonstrate several numerical/statistical techniques within R, in an attempt to add a few more R converts to the community. Below the example graphs, some explanation about what they represent, and the R code that I sent to all members of the statistics and numerical analysis discussion group.
The Meuse data set consists of 155 samples of top soil heavy metal concentrations (ppm), along with a number of soil and landscape variables, collected in a flood plain of the river Meuse, near the village Stein. Below a map of the relative cadmium concentrations, and the scatterplot matrix for the different metal concentrations.

Linear mixed effects models are suited for linear dependence modelling, when your dataset has one or more categorical variables, and you expect both a general relationship, as well as slight differences between the different classes. What one could do with the Meuse dataset is the following: predict the copper concentrations in function of the lead concentration. The bold black line is the result of linear regression. By allowing an additional random effect for each flood frequency class, one obtains the other 3 lines, i.e. one for each class (this was done with lme4). This is probably not very useful for this dataset, but I have used this extension of linear regression several times by relating different material properties with for example the stratigraphy as categorical variable. This definitely is a useful technique within the geosciences.

Next is the non-linear modelling of the metal concentrations’ dependence on elevation, organic matter content and the distance from the river Meuse. While the concentrations show mutually linear dependence, this is not true for the other three variables, and we can for example use an artificial neural network (data transformations might of course also work, but I wanted to show how to use ANNs in R with AMORE).

The two figures below show the model fit (predicted vs target values) after training a single network, and the leave-one-out cross-validation results to estimate the true predictive error. Especially from the lead and zinc plots, it becomes apparent that the training error is smaller than the true predictive error. Adjusting with the ANN parameters would probably yield better results though, but this was outside of the scope of this demonstration.

Next is data clustering, an unsupervised machine learning technique. In the Meuse dataset, it would be nice to discover different classes based on the metal concentrations. The lead vs copper scatterplot nicely shows different classes due to the distance from the Meuse. Supposed we do not know these distances, are we able to detect these classes just from the concentration data?

Two algorithms are compared here: k-means and model-based clustering. For the first one the number of classes is specified. The plots are created for up to ten classes. The model-based clustering (done with MCLUST) uses expectation maximization methods to estimate the optimal number of classes. This example clearly demonstrates the disadvantages of k-means clustering, as all clusters are equal in size and have a more or less spherical shape. The use of model-based clustering overcomes these problems, and provides a decent classification, according to the distance to the river Meuse, as given in the plot above.

Finally, some geostatistics of course! Below are the examples for the ordinary kriging of the logarithmic zinc concentrations, as provided in the gstat demo.

And here is the script, if you want to try it yourself:

`##################################################################### PhD Seminar KU Leuven Geology Section: R examples, by B. Rogiers ###################################################################### Download and install R from: http://cran.freestatistics.org/# Open this script in RGui, or a more advanced R editor like RStudio: http://rstudio.org/# Execute the code lines one by one install.packages(c('gstat','lme4','AMORE','mclust')) # install packages; only necessary the first time! library(gstat) # load geostatistics package, that contains our datasetdata(meuse) # load the example dataset?meuse # Help file for the Meuse river data setplot(meuse\$x, meuse\$y, cex=(meuse\$cadmium)^(1/3), pch=16, xlab='X coordinate',ylab='Y coordinate',main='Cadmium concentration (ppm)') # plot sample locations and cadmium concentrationplot(meuse[,3:6]) # concentration scatterplot matrixn <- nrow(meuse) # number of samples # Dependence modelling   # Linear mixed models     library(lme4) # load necessary package    classes <- as.numeric(meuse\$ffreq) # choose data classes: flood frequency 1,2,3    plot(meuse\$lead, meuse\$copper, col=classes, pch=classes, xlab='Lead (ppm)',ylab='Copper (ppm)',main='Flood frequency classes') # plot lead and copper concentrations for each of the classes    abline(lm(meuse\$copper ~ meuse\$lead)\$coef, lwd=2) # draw a fitted linear regression line    ?lmer # Help file for the lmer function    lmer(meuse\$copper ~ meuse\$lead + (meuse\$lead|classes)) # fit a linear mixed effects model, with a random effect for each class    coefs <- coef(lmer(meuse\$copper ~ meuse\$lead + (meuse\$lead|classes))) # save the coefficients of this model    for(i in 1:3) abline(a=coefs\$classes[i,1], b=coefs\$classes[i,2],col=i, lty=i) # plot the 3 resulting lines   # Neural networks     library(AMORE) # load necessary package    plot(meuse[,c(3:7,9,14)]) # scatterplot matrix of some numeric parameters    input <- meuse[-c(42,43),c(7,9,14)] # use elevation, distance to the Meuse, and organic matter content (input data), to model    target <- meuse[-c(42,43),c(3:6)] # the cadmium, copper, lead and zinc concentrations (target values); line 42 and 43 contain NA values    input <- scale(input, center=T, scale=T) # standardize input data    target <- scale(target, center=T, scale=T) # standardize target data    ?newff # Help file for the newff function    ann <- newff(c(3,20,4), learning.rate.global=0.03, momentum.global=0.1, error.criterium="LMS", Stao=NA, hidden.layer="sigmoid", output.layer="purelin", method="ADAPTgdwm") # create a neural network    ?train # Help file for the train function    trainedANN <- train(ann, input, target, show.step=1, n.shows=1000) # train the neural network    fit <- sim(trainedANN\$net, input) # get the predictions for all data    par(mfrow=c(2,2)) # plot 2 x 2 figures    plot(fit[,1], target[,1], xlab='Model prediction', ylab='Target variable', main='Cadmium') # scatterplot of observations vs predictions    plot(fit[,2], target[,2], xlab='Model prediction', ylab='Target variable', main='Copper') # scatterplot of observations vs predictions    plot(fit[,3], target[,3], xlab='Model prediction', ylab='Target variable', main='Lead') # scatterplot of observations vs predictions    plot(fit[,4], target[,4], xlab='Model prediction', ylab='Target variable', main='Zinc') # scatterplot of observations vs predictions    par(mfrow=c(1,1)) # reset plot options # Uncertainty quantification# http://episun7.med.utah.edu/~alun/teach/stats/week08.pdf   hist(meuse\$cadmium) # histogram of the cadmium concentrations  mean(meuse\$cadmium) # mean of cadmium concentrations  var(meuse\$cadmium)/n # variance of the sample mean  dataSample <- meuse\$cadmium # use all data  #n <- 15;dataSample <- sample(meuse\$cadmium, size=n) # use smaller data sample   # Bootstrapping  # estimate variance of a statistic  # good for smaller samples or awkward distributions  # generate set of random samples with replacement to evaluate the statistic     # non-parametric bootstrap:       s <- 1000 # number of resampling cases      t <- rep(0,s) # initialize the resulting vector of means      for (i in 1:s) t[i] = mean(sample(dataSample,replace=TRUE)) # loop through all resampling cases, and calculate the mean      mean(t) # bootstrapped estimate of mean      var(t) # bootstrapped estimate of variance of the mean     # semi-parametric bootstrap:       a <- 0.1 # smoothing parameter      for (i in 1:s) t[i] = mean( sample(dataSample,replace=TRUE) + a * rnorm(n)) # loop through all resampling cases, add random normal error to each sample, and calculate the mean      mean(t) # bootstrapped estimate of mean      var(t) # bootstrapped estimate of variance of the mean   # Jack-knifing  # estimate variance and bias of a statistic  # more orderly version of the bootstrap  # generate n samples by leaving one observation out at a time     t = rep(0,n) # initialize the resulting vector of means    for (i in 1:n) t[i] = mean(dataSample[-i]) # loop through all cases    mean(t) # jack-knifed estimate of the mean    (n-1)/n * sum( (t-mean(t))^2 ) # jack-knifed estimate of the variance of the mean    n * mean(dataSample) - (n-1) * mean(t) # jack-knifed bias-corrected estimate   # Cross-validation  # evaluating predictive error for a model fitted to all data     cv <- function(input=input, target=target, sample=i) # function to predict a sample through cross-validation, with the ANN we created above    {      cvSample <- input[i,];inputCV <- input[-i,];targetCV <- target[-i,] # remove the ith sample from the data      ann <- newff(c(3,20,4), learning.rate.global=0.03, momentum.global=0.1, error.criterium="LMS", Stao=NA, hidden.layer="sigmoid", output.layer="purelin", method="ADAPTgdwm")      trainedANNCV <- train(ann, inputCV, targetCV, show.step=500, n.shows=1)      return(sim(trainedANNCV\$net, input)[i,]) # predict the left out sample    }    results <- matrix(nrow=nrow(input), ncol=4) # initialize the results matrix    for(i in 1:nrow(input)) results[i,] <- cv(input, target, i) # loop through all samples, applying the above function    par(mfrow=c(2,2)) # plot 2 x 2 figures    for(i in 1:4) {plot(results[,i], target[,i], xlab='Model prediction',ylab='Target variable',main=c('Cadmium','Copper','Lead','Zinc')[i]); print(mean((target[,i]-results[,i])^2))} # plot observations vs predictions for each element    par(mfrow=c(1,1)) # reset plot options # Clustering# Unsupervised machine learning technique to find classes in a dataset     plot(meuse\$lead, meuse\$copper, cex=(meuse\$dist.m)^(1/3)/7, pch=16, xlab='Lead (ppm)',ylab='Copper (ppm)',main='Distance to Meuse river') # let us classify the lead and copper concentrations; two groups are clearly present: close to and far away from the Meuse   # K-means clustering  # http://en.wikipedia.org/wiki/K-means_clustering     ?kmeans # Help file for the kmeans function    par(ask=T) # adapt graphical parameters; now press enter in the console window to display plots    for(i in 1:10) # loop from 1 to 10 k-means classes    {      classes <- kmeans(x=data.frame(lead=meuse\$lead, copper=meuse\$copper), centers=i, iter.max=100, nstart=100)\$cluster # classify      plot(meuse\$lead, meuse\$copper, col=classes, pch=classes,xlab='Lead (ppm)',ylab='Copper (ppm)', main=paste('K-means clustering, with',i,'classes')) # plot data with classes    }    par(ask=F) # reset graphical parameters   # Model-based clustering  # http://www.stat.washington.edu/raftery/Research/mbc.html     library(mclust) # load necessary package    ?Mclust # Help file for the Mclust function    classes <- Mclust(data.frame(lead=meuse\$lead, copper=meuse\$copper))\$classification # classify    plot(meuse\$lead, meuse\$copper, col=classes, pch=classes,xlab='Lead (ppm)',ylab='Copper (ppm)',main='Model-based clustering') # plot data with classes # Geostatistics   ?gstat # Help file for the gstat function  coordinates(meuse) = ~ x+y # Set x and y variables as coordinates  data(meuse.grid) # Load data grid for estimating Zinc concentrations  gridded(meuse.grid) = ~ x+y # Set data as gridded with x and y as coordinates  v <- vgm(0.581, "Sph", 900, nug = 0.0554) # Variogram model  g <- gstat(id = "ln.zinc", form = log(zinc) ~ 1,data = meuse, nmax = 40, nmin = 20, maxdist = 1000, model = v) # Initialize gstat object  plot(variogram(g), model=v) # Plot variogram model fit  x <- predict(g, meuse.grid) # Predict using ordinary kriging  spplot(x["ln.zinc.pred"], main = "log(zinc) ordinary kriging predictions", col.regions=terrain.colors(30)) # Plot ordinary kriging predictions  x\$ln.zinc.se = sqrt(x\$ln.zinc.var) # convert kriging variance to kriging standard error for plotting  spplot(x["ln.zinc.se"], main = "log(zinc) ordinary kriging prediction standard errors", col.regions=terrain.colors(30)) # Plot ordinary kriging standard error   # need more examples?     demo(cokriging)    demo(examples) # Other useful info:   # The R project website: http://www.r-project.org/  # List of packages: http://cran.freestatistics.org/web/packages/available_packages_by_name.html  # Visualization package for the R statistical analysis platform, loosely based on "The Grammar of Graphics" from Leland Wilkinson: http://had.co.nz/ggplot2/  # R graph gallery: http://addictedtor.free.fr/graphiques/  # R & LaTeX:     # knitr: http://yihui.name/knitr/    # Sweave: http://www.statistik.lmu.de/~leisch/Sweave/`
Created by Pretty R at inside-R.org

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...