Use R! – Part 2

May 9, 2012
By

(This article was first published on Bart Rogiers - Sreigor, and kindly contributed to R-bloggers)



Here is a follow-up of my first post about using R. For our yearly KU Leuven Geology PhD Seminar (08-09/05/2012), I quickly pasted this script together from several examples I had run into in the past, as well as some things that I have been doing myself in R. Be aware these analyses of the Meuse dataset are not always very useful! This was just to demonstrate several numerical/statistical techniques within R, in an attempt to add a few more R converts to the community. Below the example graphs, some explanation about what they represent, and the R code that I sent to all members of the statistics and numerical analysis discussion group.

The Meuse data set consists of 155 samples of top soil heavy metal concentrations (ppm), along with a number of soil and landscape variables, collected in a flood plain of the river Meuse, near the village Stein. Below a map of the relative cadmium concentrations, and the scatterplot matrix for the different metal concentrations.


Linear mixed effects models are suited for linear dependence modelling, when your dataset has one or more categorical variables, and you expect both a general relationship, as well as slight differences between the different classes. What one could do with the Meuse dataset is the following: predict the copper concentrations in function of the lead concentration. The bold black line is the result of linear regression. By allowing an additional random effect for each flood frequency class, one obtains the other 3 lines, i.e. one for each class (this was done with lme4). This is probably not very useful for this dataset, but I have used this extension of linear regression several times by relating different material properties with for example the stratigraphy as categorical variable. This definitely is a useful technique within the geosciences.


Next is the non-linear modelling of the metal concentrations' dependence on elevation, organic matter content and the distance from the river Meuse. While the concentrations show mutually linear dependence, this is not true for the other three variables, and we can for example use an artificial neural network (data transformations might of course also work, but I wanted to show how to use ANNs in R with AMORE).


The two figures below show the model fit (predicted vs target values) after training a single network, and the leave-one-out cross-validation results to estimate the true predictive error. Especially from the lead and zinc plots, it becomes apparent that the training error is smaller than the true predictive error. Adjusting with the ANN parameters would probably yield better results though, but this was outside of the scope of this demonstration.


Next is data clustering, an unsupervised machine learning technique. In the Meuse dataset, it would be nice to discover different classes based on the metal concentrations. The lead vs copper scatterplot nicely shows different classes due to the distance from the Meuse. Supposed we do not know these distances, are we able to detect these classes just from the concentration data?


Two algorithms are compared here: k-means and model-based clustering. For the first one the number of classes is specified. The plots are created for up to ten classes. The model-based clustering (done with MCLUST) uses expectation maximization methods to estimate the optimal number of classes. This example clearly demonstrates the disadvantages of k-means clustering, as all clusters are equal in size and have a more or less spherical shape. The use of model-based clustering overcomes these problems, and provides a decent classification, according to the distance to the river Meuse, as given in the plot above.


Finally, some geostatistics of course! Below are the examples for the ordinary kriging of the logarithmic zinc concentrations, as provided in the gstat demo.



And here is the script, if you want to try it yourself:


####################################################################
# PhD Seminar KU Leuven Geology Section: R examples, by B. Rogiers #
####################################################################
# Download and install R from: http://cran.freestatistics.org/
# Open this script in RGui, or a more advanced R editor like RStudio: http://rstudio.org/
# Execute the code lines one by one
 
install.packages(c('gstat','lme4','AMORE','mclust')) # install packages; only necessary the first time!
 
library(gstat) # load geostatistics package, that contains our dataset
data(meuse) # load the example dataset
?meuse # Help file for the Meuse river data set
plot(meuse$x, meuse$y, cex=(meuse$cadmium)^(1/3), pch=16, xlab='X coordinate',ylab='Y coordinate',main='Cadmium concentration (ppm)') # plot sample locations and cadmium concentration
plot(meuse[,3:6]) # concentration scatterplot matrix
n <- nrow(meuse) # number of samples
 
# Dependence modelling
 
# Linear mixed models
 
library(lme4) # load necessary package
classes <- as.numeric(meuse$ffreq) # choose data classes: flood frequency 1,2,3
plot(meuse$lead, meuse$copper, col=classes, pch=classes, xlab='Lead (ppm)',ylab='Copper (ppm)',main='Flood frequency classes') # plot lead and copper concentrations for each of the classes
abline(lm(meuse$copper ~ meuse$lead)$coef, lwd=2) # draw a fitted linear regression line
?lmer # Help file for the lmer function
lmer(meuse$copper ~ meuse$lead + (meuse$lead|classes)) # fit a linear mixed effects model, with a random effect for each class
coefs <- coef(lmer(meuse$copper ~ meuse$lead + (meuse$lead|classes))) # save the coefficients of this model
for(i in 1:3) abline(a=coefs$classes[i,1], b=coefs$classes[i,2],col=i, lty=i) # plot the 3 resulting lines
 
# Neural networks
 
library(AMORE) # load necessary package
plot(meuse[,c(3:7,9,14)]) # scatterplot matrix of some numeric parameters
input <- meuse[-c(42,43),c(7,9,14)] # use elevation, distance to the Meuse, and organic matter content (input data), to model
target <- meuse[-c(42,43),c(3:6)] # the cadmium, copper, lead and zinc concentrations (target values); line 42 and 43 contain NA values
input <- scale(input, center=T, scale=T) # standardize input data
target <- scale(target, center=T, scale=T) # standardize target data
?newff # Help file for the newff function
ann <- newff(c(3,20,4), learning.rate.global=0.03, momentum.global=0.1, error.criterium="LMS", Stao=NA, hidden.layer="sigmoid", output.layer="purelin", method="ADAPTgdwm") # create a neural network
?train # Help file for the train function
trainedANN <- train(ann, input, target, show.step=1, n.shows=1000) # train the neural network
fit <- sim(trainedANN$net, input) # get the predictions for all data
par(mfrow=c(2,2)) # plot 2 x 2 figures
plot(fit[,1], target[,1], xlab='Model prediction', ylab='Target variable', main='Cadmium') # scatterplot of observations vs predictions
plot(fit[,2], target[,2], xlab='Model prediction', ylab='Target variable', main='Copper') # scatterplot of observations vs predictions
plot(fit[,3], target[,3], xlab='Model prediction', ylab='Target variable', main='Lead') # scatterplot of observations vs predictions
plot(fit[,4], target[,4], xlab='Model prediction', ylab='Target variable', main='Zinc') # scatterplot of observations vs predictions
par(mfrow=c(1,1)) # reset plot options
 
# Uncertainty quantification
# http://episun7.med.utah.edu/~alun/teach/stats/week08.pdf
 
hist(meuse$cadmium) # histogram of the cadmium concentrations
mean(meuse$cadmium) # mean of cadmium concentrations
var(meuse$cadmium)/n # variance of the sample mean
dataSample <- meuse$cadmium # use all data
#n <- 15;dataSample <- sample(meuse$cadmium, size=n) # use smaller data sample
 
# Bootstrapping
# estimate variance of a statistic
# good for smaller samples or awkward distributions
# generate set of random samples with replacement to evaluate the statistic
 
# non-parametric bootstrap:
 
s <- 1000 # number of resampling cases
t <- rep(0,s) # initialize the resulting vector of means
for (i in 1:s) t[i] = mean(sample(dataSample,replace=TRUE)) # loop through all resampling cases, and calculate the mean
mean(t) # bootstrapped estimate of mean
var(t) # bootstrapped estimate of variance of the mean
 
# semi-parametric bootstrap:
 
a <- 0.1 # smoothing parameter
for (i in 1:s) t[i] = mean( sample(dataSample,replace=TRUE) + a * rnorm(n)) # loop through all resampling cases, add random normal error to each sample, and calculate the mean
mean(t) # bootstrapped estimate of mean
var(t) # bootstrapped estimate of variance of the mean
 
# Jack-knifing
# estimate variance and bias of a statistic
# more orderly version of the bootstrap
# generate n samples by leaving one observation out at a time
 
t = rep(0,n) # initialize the resulting vector of means
for (i in 1:n) t[i] = mean(dataSample[-i]) # loop through all cases
mean(t) # jack-knifed estimate of the mean
(n-1)/n * sum( (t-mean(t))^2 ) # jack-knifed estimate of the variance of the mean
n * mean(dataSample) - (n-1) * mean(t) # jack-knifed bias-corrected estimate
 
# Cross-validation
# evaluating predictive error for a model fitted to all data
 
cv <- function(input=input, target=target, sample=i) # function to predict a sample through cross-validation, with the ANN we created above
{
cvSample <- input[i,];inputCV <- input[-i,];targetCV <- target[-i,] # remove the ith sample from the data
ann <- newff(c(3,20,4), learning.rate.global=0.03, momentum.global=0.1, error.criterium="LMS", Stao=NA, hidden.layer="sigmoid", output.layer="purelin", method="ADAPTgdwm")
trainedANNCV <- train(ann, inputCV, targetCV, show.step=500, n.shows=1)
return(sim(trainedANNCV$net, input)[i,]) # predict the left out sample
}
results <- matrix(nrow=nrow(input), ncol=4) # initialize the results matrix
for(i in 1:nrow(input)) results[i,] <- cv(input, target, i) # loop through all samples, applying the above function
par(mfrow=c(2,2)) # plot 2 x 2 figures
for(i in 1:4) {plot(results[,i], target[,i], xlab='Model prediction',ylab='Target variable',main=c('Cadmium','Copper','Lead','Zinc')[i]); print(mean((target[,i]-results[,i])^2))} # plot observations vs predictions for each element
par(mfrow=c(1,1)) # reset plot options
 
# Clustering
# Unsupervised machine learning technique to find classes in a dataset
 
plot(meuse$lead, meuse$copper, cex=(meuse$dist.m)^(1/3)/7, pch=16, xlab='Lead (ppm)',ylab='Copper (ppm)',main='Distance to Meuse river') # let us classify the lead and copper concentrations; two groups are clearly present: close to and far away from the Meuse
 
# K-means clustering
# http://en.wikipedia.org/wiki/K-means_clustering
 
?kmeans # Help file for the kmeans function
par(ask=T) # adapt graphical parameters; now press enter in the console window to display plots
for(i in 1:10) # loop from 1 to 10 k-means classes
{
classes <- kmeans(x=data.frame(lead=meuse$lead, copper=meuse$copper), centers=i, iter.max=100, nstart=100)$cluster # classify
plot(meuse$lead, meuse$copper, col=classes, pch=classes,xlab='Lead (ppm)',ylab='Copper (ppm)', main=paste('K-means clustering, with',i,'classes')) # plot data with classes
}
par(ask=F) # reset graphical parameters
 
# Model-based clustering
# http://www.stat.washington.edu/raftery/Research/mbc.html
 
library(mclust) # load necessary package
?Mclust # Help file for the Mclust function
classes <- Mclust(data.frame(lead=meuse$lead, copper=meuse$copper))$classification # classify
plot(meuse$lead, meuse$copper, col=classes, pch=classes,xlab='Lead (ppm)',ylab='Copper (ppm)',main='Model-based clustering') # plot data with classes
 
# Geostatistics
 
?gstat # Help file for the gstat function
coordinates(meuse) = ~ x+y # Set x and y variables as coordinates
data(meuse.grid) # Load data grid for estimating Zinc concentrations
gridded(meuse.grid) = ~ x+y # Set data as gridded with x and y as coordinates
v <- vgm(0.581, "Sph", 900, nug = 0.0554) # Variogram model
g <- gstat(id = "ln.zinc", form = log(zinc) ~ 1,data = meuse, nmax = 40, nmin = 20, maxdist = 1000, model = v) # Initialize gstat object
plot(variogram(g), model=v) # Plot variogram model fit
x <- predict(g, meuse.grid) # Predict using ordinary kriging
spplot(x["ln.zinc.pred"], main = "log(zinc) ordinary kriging predictions", col.regions=terrain.colors(30)) # Plot ordinary kriging predictions
x$ln.zinc.se = sqrt(x$ln.zinc.var) # convert kriging variance to kriging standard error for plotting
spplot(x["ln.zinc.se"], main = "log(zinc) ordinary kriging prediction standard errors", col.regions=terrain.colors(30)) # Plot ordinary kriging standard error
 
# need more examples?
 
demo(cokriging)
demo(examples)
 
# Other useful info:
 
# The R project website: http://www.r-project.org/
# List of packages: http://cran.freestatistics.org/web/packages/available_packages_by_name.html
# Visualization package for the R statistical analysis platform, loosely based on "The Grammar of Graphics" from Leland Wilkinson: http://had.co.nz/ggplot2/
# R graph gallery: http://addictedtor.free.fr/graphiques/
# R & LaTeX:
# knitr: http://yihui.name/knitr/
# Sweave: http://www.statistik.lmu.de/~leisch/Sweave/
Created by Pretty R at inside-R.org

To leave a comment for the author, please follow the link and comment on his blog: Bart Rogiers - Sreigor.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.