Site icon R-bloggers

Visualization of spatial cross-validation partitioning

[This article was first published on r-bloggers on Machine Learning in R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

In July mlr got a new feature that extended the support for spatial data: The ability to visualize spatial partitions in cross-validation (CV) 9d4f3. When one uses the resampling descriptions “SpCV” or “SpRepCV” in mlr, the k-means clustering approach after Brenning (2005) is used to partition the dataset into equally sized, spatially disjoint subsets. See also this post on r-spatial.org and the mlr vignette about spatial data for more information.

Visualization of partitions

When using random partitiong in a normal cross-validation, one is usually not interested in the random pattern of the datasets train/test split However, for spatial data this information is important since it can help identifying problematic folds (ones that did not converge or showed a bad performance) and also proves that the k-means clustering algorithm did a good job on partitioning the dataset.

The new function to visualize these partitions in mlr is called createSpatialResamplingPlots(). It uses ggplot2 and its new geom_sf() function to create spatial plots based on the resampling indices of a resample() object. In this post we will use the examples of the function to demonstrate its use.

First, we create a resampling description rdesc using spatial partitioning with five folds and repeat it 4 times. This rdesc object is put into a resample() call together with our example task for spatial matters, spatial.task. Finally, we use the classif.qda learner to have a quick model fit.

library(mlr)
rdesc = makeResampleDesc("SpRepCV", folds = 5, reps = 4)
resamp = resample(makeLearner("classif.qda"), spatial.task, rdesc)

Now we can use createSpatialResamplingPlots() to automatically create one plot for each fold of the resamp object. Usually we do not want to plot all repetitions of the CV. We can restrict the number of repetitions in the argument repetitions.

Besides the required arguments task and resample, the user can specifiy the coordinate reference system that should be used for the plots. Here it is important to set the correct EPSG number in argument crs to receive accurate spatial plots. In the background, geom_sf() (more specifically coords_sf()) will transform the CRS on the fly to EPSG: 4326. This is done because lat/lon reference systems are better for plotting as UTM coordinates usually clutter the axis. However, if you insist on using UTM projection instead of WGS84 (EPSG: 4326) you can set the EPSG code of your choice in argument datum.

plots = createSpatialResamplingPlots(spatial.task, resamp, crs = 32717,
  repetitions = 2, x.axis.breaks = c(-79.065, -79.085),
  y.axis.breaks = c(-3.970, -4))

To avoid overlapping axis breaks you might want to set the axis breaks on your own as we did here.

Now we got a list of 2L back from createSpatialResamplingPlots(). In the first list we got all the plots, one for each fold. Since we used two repetitions and five folds, we have a total of ten instances in it.

The second list consists of labels for each plot. These are automatically created by createSpatialResamplingPlots() and can serve as titles later on. Note that for now we just created the ggplot objects (stored in the first list of the plots object). We still need to plot them!

Single ggplot objects can be plotted by just extracting a certain object from the list, e.g. plots[[1]][[3]]. This would plot fold #3 of repetition one.

plots[[1]][[3]]

However usually we want to visualize all plots in a grid. For this purpose we highly recommend to use the cowplot package and its function plot_grid().

The returned objects of createSpatialResamplingPlots() are already tailored to be used with this function. We just need to hand over the list of plots and (optional) the labels:

cowplot::plot_grid(plotlist = plots[["Plots"]], ncol = 5, nrow = 2,
  labels = plots[["Labels"]])

Multiple resample objects

createSpatialResamplingPlots() is quite powerful and can also take multiple resample() objects as inputs with the aim to compare both. A typical use case is to visualize the differences between spatial and non-spatial partitioning.

To do so, we first create two resample() objects (one using “SpRepCV”, one using “RepCV”):

rdesc1 = makeResampleDesc("SpRepCV", folds = 5, reps = 4)
r1 = resample(makeLearner("classif.qda"), spatial.task, rdesc1)
rdesc2 = makeResampleDesc("RepCV", folds = 5, reps = 4)
r2 = resample(makeLearner("classif.qda"), spatial.task, rdesc2)

Now we can hand over both objects using a named list. This way the list names will also directly be used as a prefix in the resulting plot labels.

plots = createSpatialResamplingPlots(spatial.task,
  list("SpRepCV" = r1, "RepCV" = r2), crs = 32717, repetitions = 1,
  x.axis.breaks = c(-79.055, -79.085), y.axis.breaks = c(-3.975, -4))

cowplot::plot_grid(plotlist = plots[["Plots"]], ncol = 5, nrow = 2,
  labels = plots[["Labels"]])

References

Brenning, A. (2012). Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: The R package sperrorest. In 2012 IEEE International Geoscience and Remote Sensing Symposium. IEEE. https://doi.org/10.1109/igarss.2012.6352393

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers on Machine Learning in R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.