mlr got a new feature that extended the support for spatial data: The ability to visualize spatial partitions in cross-validation (CV) 9d4f3.
When one uses the resampling descriptions “SpCV” or “SpRepCV” in
mlr, the k-means clustering approach after Brenning (2005) is used to partition the dataset into equally sized, spatially disjoint subsets.
See also this post on r-spatial.org and the mlr vignette about spatial data for more information.
Visualization of partitions
When using random partitiong in a normal cross-validation, one is usually not interested in the random pattern of the datasets train/test split
However, for spatial data this information is important since it can help identifying problematic folds (ones that did not converge or showed a bad performance) and also proves that the k-means clustering algorithm did a good job on partitioning the dataset.
The new function to visualize these partitions in
mlr is called
ggplot2 and its new
geom_sf() function to create spatial plots based on the resampling indices of a
In this post we will use the examples of the function to demonstrate its use.
First, we create a resampling description
rdesc using spatial partitioning with five folds and repeat it 4 times.
rdesc object is put into a
resample() call together with our example task for spatial matters,
Finally, we use the
classif.qda learner to have a quick model fit.
library(mlr) rdesc = makeResampleDesc("SpRepCV", folds = 5, reps = 4) resamp = resample(makeLearner("classif.qda"), spatial.task, rdesc)
Now we can use
createSpatialResamplingPlots() to automatically create one plot for each fold of the
Usually we do not want to plot all repetitions of the CV.
We can restrict the number of repetitions in the argument
Besides the required arguments
resample, the user can specifiy the coordinate reference system that should be used for the plots.
Here it is important to set the correct EPSG number in argument
crs to receive accurate spatial plots.
In the background,
geom_sf() (more specifically
coords_sf()) will transform the CRS on the fly to EPSG: 4326.
This is done because lat/lon reference systems are better for plotting as UTM coordinates usually clutter the axis.
However, if you insist on using UTM projection instead of WGS84 (EPSG: 4326) you can set the EPSG code of your choice in argument
plots = createSpatialResamplingPlots(spatial.task, resamp, crs = 32717, repetitions = 2, x.axis.breaks = c(-79.065, -79.085), y.axis.breaks = c(-3.970, -4))
To avoid overlapping axis breaks you might want to set the axis breaks on your own as we did here.
Now we got a list of 2L back from
In the first list we got all the plots, one for each fold.
Since we used two repetitions and five folds, we have a total of ten instances in it.
The second list consists of labels for each plot.
These are automatically created by
createSpatialResamplingPlots() and can serve as titles later on.
Note that for now we just created the
ggplot objects (stored in the first list of the
We still need to plot them!
ggplot objects can be plotted by just extracting a certain object from the list, e.g.
This would plot fold #3 of repetition one.
However usually we want to visualize all plots in a grid.
For this purpose we highly recommend to use the
cowplot package and its function
The returned objects of
createSpatialResamplingPlots() are already tailored to be used with this function.
We just need to hand over the list of plots and (optional) the labels:
cowplot::plot_grid(plotlist = plots[["Plots"]], ncol = 5, nrow = 2, labels = plots[["Labels"]])
Multiple resample objects
createSpatialResamplingPlots() is quite powerful and can also take multiple
resample() objects as inputs with the aim to compare both.
A typical use case is to visualize the differences between spatial and non-spatial partitioning.
To do so, we first create two
resample() objects (one using “SpRepCV”, one using “RepCV”):
rdesc1 = makeResampleDesc("SpRepCV", folds = 5, reps = 4) r1 = resample(makeLearner("classif.qda"), spatial.task, rdesc1) rdesc2 = makeResampleDesc("RepCV", folds = 5, reps = 4) r2 = resample(makeLearner("classif.qda"), spatial.task, rdesc2)
Now we can hand over both objects using a named list.
This way the list names will also directly be used as a prefix in the resulting plot labels.
plots = createSpatialResamplingPlots(spatial.task, list("SpRepCV" = r1, "RepCV" = r2), crs = 32717, repetitions = 1, x.axis.breaks = c(-79.055, -79.085), y.axis.breaks = c(-3.975, -4)) cowplot::plot_grid(plotlist = plots[["Plots"]], ncol = 5, nrow = 2, labels = plots[["Labels"]])
Brenning, A. (2012). Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: The R package sperrorest. In 2012 IEEE International Geoscience and Remote Sensing Symposium. IEEE. https://doi.org/10.1109/igarss.2012.6352393