Specialized R packages for spatial cross-validation: sperrorest and blockCV
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This is the sixth and the final blog post in a series on spatial machine learning with R.
You can find the list of other blog posts in this series in part one.
This document provides an overview of two R packages, sperrorest and blockCV, that can be used for spatial cross validation, but are outside of standard machine learning frameworks like caret, tidymodels, or mlr3.
All of the examples below use the same dataset, which includes the temperature measurements in Spain, a set of covariates, and the spatial coordinates of the temperature measurements.
spain <- sf::read_sf("https://github.com/LOEK-RS/FOSSGIS2025-examples/raw/refs/heads/main/data/spain.gpkg") covariates <- terra::rast("https://github.com/LOEK-RS/FOSSGIS2025-examples/raw/refs/heads/main/data/predictors.tif") temperature <- sf::read_sf("https://github.com/LOEK-RS/FOSSGIS2025-examples/raw/refs/heads/main/data/temp_train.gpkg") temperature <- terra::extract(covariates, temperature, bind = TRUE) |> sf::st_as_sf()
sperrorest
The sperrorest (https://doi.org/10.32614/CRAN.package.sperrorest) package is designed for spatial error estimation and variable importance assessment for predictive models. The package itself does not fit the models but provides a set of functions for spatial cross-validation, including data partitioning and model cross-validation.
While the sperrorest package has many functions (including a set of functions for data partitioning), its main function is sperrorest()
. It performs spatial cross-validation for spatial prediction models, including variable importance assessment and prediction error estimation. To use this function, we need to provide the formula, the data, the coordinates, the model function, the model arguments, the prediction function, the sampling function, and the sampling arguments.
Let’s do it step by step. First, we need to prepare the data by extracting the coordinates and creating a data frame with the dependent variable, covariates, and coordinates.
library(sperrorest) library(ranger) coordinates <- sf::st_coordinates(temperature) temperature_df <- sf::st_drop_geometry(temperature) temperature_df$x <- coordinates[, 1] temperature_df$y <- coordinates[, 2]
Second, we need to define the formula for the model and the prediction function.
response_name <- "temp" covariate_names <- colnames(temperature_df)[2:(ncol(temperature_df) - 7)] fo <- as.formula(paste( response_name, "~", paste(covariate_names, collapse = " + ") ))
Third, we need to define the custom prediction function. The sperrorest package works with many model functions, but it requires a custom prediction function to extract the predictions from the model object. In this example, we use the ranger
model, so we need to define a custom prediction function that extracts the predictions from the ranger
model object. The predict()
function from the ranger
package returns a list with several elements, so we need to extract the predictions from this list.1
mypred <- function(object, newdata) { predict(object, newdata)$predictions }
Fourth, we can perform the spatial cross-validation using the sperrorest()
function. We just need to provide previously prepared data, the formula, the model function, and the prediction function. Moreover, we can also define some additional parameters of the model, such as the number of trees in the ranger
model. Finally, the important part is to define the sampling function (smp_fun
) and its arguments (smp_args
). The sampling function is used to partition the data into training and testing sets: here, we use the partition_kmeans()
function to partition the data spatially into folds using k-means clustering of the coordinates.2
# Spatial cross-validation set.seed(6) sp_res <- sperrorest( formula = fo, data = temperature_df, coords = c("x", "y"), model_fun = ranger, model_args = list(num.trees = 100), pred_fun = mypred, smp_fun = partition_kmeans, smp_args = list(repetition = 1:2, nfold = 3), progress = FALSE )
The result is a list with several components, including the error at the repetition and fold levels, the resampling object, the variable importance (only when importance = TRUE
), the benchmark, and the package version.
summary(sp_res$error_rep)
mean sd median IQR train_bias 0.003667002 0.0224183696 0.003667002 0.0158521812 train_stddev 0.399217628 0.0044832845 0.399217628 0.0031701609 train_rmse 0.399038491 0.0042677806 0.399038491 0.0030177766 train_mad 0.316892003 0.0078406248 0.316892003 0.0055441590 train_median 0.033006028 0.0001897285 0.033006028 0.0001341583 train_iqr 0.427813987 0.0083456683 0.427813987 0.0059012787 train_count 390.000000000 0.0000000000 390.000000000 0.0000000000 test_bias -0.009090677 0.0742284999 -0.009090677 0.0524874756 test_stddev 1.447676903 0.0934364402 1.447676903 0.0660695405 test_rmse 1.444959463 0.0926651037 1.444959463 0.0655241232 test_mad 1.341466967 0.1144815717 1.341466967 0.0809506957 test_median 0.101932523 0.0950544902 0.101932523 0.0672136746 test_iqr 1.876642239 0.1533036116 1.876642239 0.1084020233 test_count 195.000000000 0.0000000000 195.000000000 0.0000000000
We can contrast the obtained results with the non-spatial cross-validation by changing the sampling function to partition_cv()
.
# Non-spatial cross-validation set.seed(11) nsp_res <- sperrorest( formula = fo, data = temperature_df, coords = c("x", "y"), model_fun = ranger, model_args = list(num.trees = 100), pred_fun = mypred, smp_fun = partition_cv, smp_args = list(repetition = 1:2, nfold = 3), progress = FALSE )
To compare both results, we can plot the RMSE values for the training and testing sets of both spatial and non-spatial cross-validation.
library(ggplot2) # Extract train/test RMSE from spatial CV sp_train_rmse <- sp_res$error_rep$train_rmse sp_test_rmse <- sp_res$error_rep$test_rmse # Extract train/test RMSE from non-spatial CV nsp_train_rmse <- nsp_res$error_rep$train_rmse nsp_test_rmse <- nsp_res$error_rep$test_rmse # Build data frame rmse_df <- data.frame( CV_Type = rep(c("Spatial", "Non-Spatial"), each = 4), Set = rep(c("Train", "Test"), each = 2), RMSE = c(sp_train_rmse, sp_test_rmse, nsp_train_rmse, nsp_test_rmse) ) ggplot(rmse_df, aes(x = CV_Type, y = RMSE, fill = Set)) + geom_boxplot() + facet_wrap(~Set) + labs(title = "RMSE Comparison", x = "CV Method", y = "RMSE")
The results show that the estimation using the spatial-cross validation is less optimistic than the non-spatial cross-validation for the test set.
More examples of the package use can be found at https://giscience-fsu.github.io/sperrorest/articles/spatial-modeling-use-case.html/
blockCV
The blockCV (https://doi.org/10.1111/2041-210X.13107) package provides a set of functions for block cross-validation, spatial and environmental clustering, and spatial autocorrelation estimation. The package itself does not fit the models.
# remotes::install_github("cran/automap") library(blockCV)
Cross-validation strategies separate the data into training and testing sets to evaluate the model’s performance. The blockCV package provides several cross-validation strategies, including block cross-validation, spatial clustering, environmental clustering, buffering LOO, and Nearest Neighbour Distance Matching (NNDM) LOO.
The block cross-validation is performed using the cv_spatial()
function. It assigns blocks to the training and testing folds randomly, systematically or in a checkerboard pattern (the selection
argument).
set.seed(61) sb1 <- cv_spatial( x = temperature, k = 10, # number of folds size = 300000, # size of the blocks in meters selection = "random", # random blocks-to-fold iteration = 50, # find evenly dispersed folds progress = FALSE, biomod2 = TRUE )
train test 1 170 25 2 169 26 3 171 24 4 180 15 5 181 14 6 183 12 7 182 13 8 165 30 9 179 16 10 175 20
The result is a list with several components, including the folds list, the folds IDs, the biomod table, the number of folds, the input size, the column name, the blocks, and the records. For example, we can check the structure of the folds list with the str()
function.
str(sb1$folds_list)
List of 10 $ :List of 2 ..$ : int [1:170] 117 87 103 84 105 99 119 101 98 97 ... ..$ : int [1:25] 23 31 57 39 58 60 21 22 32 27 ... $ :List of 2 ..$ : int [1:169] 117 87 103 84 105 99 119 101 98 97 ... ..$ : int [1:26] 153 155 93 108 95 91 88 154 94 140 ... $ :List of 2 ..$ : int [1:171] 117 87 103 84 105 99 119 101 98 97 ... ..$ : int [1:24] 12 45 175 176 166 20 177 19 167 16 ... $ :List of 2 ..$ : int [1:180] 117 87 103 84 105 99 119 101 98 97 ... ..$ : int [1:15] 134 138 136 133 187 192 174 173 189 190 ... $ :List of 2 ..$ : int [1:181] 117 87 103 84 105 99 119 101 98 97 ... ..$ : int [1:14] 161 142 143 150 160 149 146 147 158 151 ... $ :List of 2 ..$ : int [1:183] 117 87 103 84 105 99 119 101 98 97 ... ..$ : int [1:12] 4 8 5 2 193 1 6 194 164 3 ... $ :List of 2 ..$ : int [1:182] 117 87 103 84 105 99 119 101 98 97 ... ..$ : int [1:13] 128 124 129 130 127 132 131 120 126 125 ... $ :List of 2 ..$ : int [1:165] 117 87 103 84 105 99 119 101 98 97 ... ..$ : int [1:30] 61 70 75 52 77 83 79 86 50 48 ... $ :List of 2 ..$ : int [1:179] 23 31 57 39 58 60 21 22 32 27 ... ..$ : int [1:16] 117 87 103 84 105 99 119 101 98 97 ... $ :List of 2 ..$ : int [1:175] 117 87 103 84 105 99 119 101 98 97 ... ..$ : int [1:20] 163 181 159 157 186 67 71 41 178 179 ...
The cv_plot()
function additionally allows for the visualization of cross-validation results.
cv_plot(sb1, temperature)
Let’s compare the results of the block cross-validation with systematic and checkerboard patterns.
set.seed(16) sb2 <- cv_spatial( x = temperature, k = 10, rows_cols = c(4, 6), hexagon = FALSE, selection = "systematic" )
train test 1 172 23 2 180 15 3 177 18 4 169 26 5 178 17 6 182 13 7 180 15 8 162 33 9 178 17 10 177 18
cv_plot(sb2, temperature)
set.seed(12) sb3 <- cv_spatial( x = temperature, k = 10, size = 300000, hexagon = FALSE, selection = "checkerboard" )
train test 1 98 97 2 97 98
cv_plot(sb3, temperature)
The clustering strategies (cv_cluster()
) are used to group the data into clusters based on spatial or environmental similarity. The spatial similarity is based only on the clustering of the spatial coordinates.
set.seed(7) scv <- cv_cluster(x = temperature, k = 10)
train test 1 169 26 2 169 26 3 173 22 4 178 17 5 179 16 6 182 13 7 176 19 8 171 24 9 178 17 10 180 15
cv_plot(scv, temperature)
The environmental clustering, on the other hand, is based on the clustering of the values of the covariates extracted from the raster data.
set.seed(2) ecv <- cv_cluster(x = temperature, r = covariates, k = 5, scale = TRUE)
train test 1 164 31 2 190 5 3 182 13 4 154 41 5 90 105
cv_plot(ecv, temperature)
The next cross-validation strategy is buffering LOO (also known as Spatial LOO). It is performed using the cv_buffer()
function, which selects a buffer around each point (test point) and uses the points outside the buffer as the testing set.3
set.seed(21) bloo <- cv_buffer(x = temperature, size = 300000, progress = FALSE)
train test Min. : 97.0 Min. :1 Mean :132.7 Mean :1 Max. :170.0 Max. :1
cv_plot(bloo, temperature, num_plots = c(1, 50, 100))
Note that above, we plot only the first, 50th, and 100th points to avoid overplotting.
The last cross-validation strategy implemented in the blockCV package is the Nearest Neighbour Distance Matching (NNDM) LOO. It is performed using the cv_nndm()
function, which tries to match the nearest neighbor distance distribution function between the test and training data to the nearest neighbor distance distribution function between the target prediction and training points. Thus, in this base, we need to provide more arguments, including a raster with the covariates, the number of samples, the sampling strategy, and the minimum training size.
set.seed(12) nncv <- cv_nndm( x = temperature, r = covariates, size = 300000, num_sample = 5000, sampling = "regular", min_train = 0.1, plot = TRUE )
train test Min. :192.0 Min. :1 Mean :192.9 Mean :1 Max. :193.0 Max. :1
cv_plot(nncv, temperature, num_plots = c(1, 50, 100))
Let’s now use the block cross-validation to fit and evaluate a model.
# define formula response_name <- "temp" covariate_names <- colnames(temperature_df)[2:(ncol(temperature_df) - 7)] fo <- as.formula(paste( response_name, "~", paste(covariate_names, collapse = " + ") )) # extract the folds folds <- sb1$folds_list model_rmse <- data.frame(fold = seq_along(folds), rmse = rep(NA, length(folds))) for (k in seq_along(folds)) { trainSet <- unlist(folds[[k]][1]) # training set indices; first element testSet <- unlist(folds[[k]][2]) # testing set indices; second element rf <- ranger(fo, temperature_df[trainSet, ], num.trees = 100) # model fitting on training set pred <- predict(rf, temperature_df[testSet, ])$predictions # predict the test set model_rmse[k, "rmse"] <- sqrt(mean( (temperature_df[testSet, response_name] - pred)^2 )) # calculate RMSE } model_rmse
fold rmse 1 1 1.1401856 2 2 0.7848595 3 3 0.9219486 4 4 0.8007453 5 5 0.6485726 6 6 1.2989341 7 7 0.8140209 8 8 1.2050918 9 9 0.9696801 10 10 0.8681127
The blockCV package also provides functions for checking the similarity between the folds (cv_similarity()
) and estimating the effective range of spatial autocorrelation (cv_spatial_autocor()
). The first function is used to check the similarity between the folds in the cross-validation.
cv_similarity(cv = sb1, x = temperature, r = covariates, progress = FALSE)
Warning: The following aesthetics were dropped during statistical transformation: colour. ℹ This can happen when ggplot fails to infer the correct grouping structure in the data. ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?
The second function is used to estimate the effective range of spatial autocorrelation of all input raster layers or the response data – its role is to help to determine the size of the blocks in the block cross-validation.
cv_spatial_autocor(r = covariates, num_sample = 5000, progress = FALSE)
[1] "cv_spatial_autocor"
More examples of the package’s use can be found at https://cran.r-project.org/web/packages/blockCV/vignettes/tutorial_2.html.
Closing remarks
This blog post concludes the series on machine learning for spatial data with R. It showed established frameworks for machine learning in R, such as caret, tidymodels, and mlr3, but focused on how to use them for spatial data problems. Next, we presented specialized R packages for spatial machine learning, such as spatialRF, RandomForestsGLS, and meteo and two R packages aimed for spatial cross-validation: sperrorest and blockCV.
The series focused on the so-called “classical” machine learning methods, with a focus on the random forest method. At the same time, the series did not cover deep learning methods, as they require a different approach and tools. You can find more information on deep learning in R in the documentation of the packages like keras3 (https://keras3.posit.co/), torch (https://torch.mlverse.org/), or luz (https://mlverse.github.io/luz/). Moreover, the sits package provides various deep learning tools for satellite image time series analysis – you can learn more about it in the book “Satellite Image Time Series Analysis on Earth Observation Data Cubes”.
The number of methods and tools for spatial machine learning is growing rapidly, and it is impossible to cover all of them in a single series. However, the presented packages and methods should provide a good starting point for anyone interested in spatial machine learning with R. If you have any suggestions for other packages and methods that should be included in this series, feel free to contact me – you can find my contact details on my website.
This blog post was originally written as a supplement to the poster “An Inventory of Spatial Machine Learning Packages in R” presented at the FOSSGIS 2025 conference in Muenster, Germany. The poster is available at https://doi.org/10.5281/zenodo.15088973.
Footnotes
More information on the custom prediction functions is at https://cran.r-project.org/web/packages/sperrorest/vignettes/custom-pred-and-model-functions.html.↩︎
There are several other partition functions available in the package, including
partition_disc()
,partition_tiles()
, andpartition_cv()
.↩︎This approach is a form of leave-one-out cross-validation.↩︎
Reuse
Citation
@online{nowosad2025, author = {Nowosad, Jakub}, title = {Specialized {R} Packages for Spatial Cross-Validation: Sperrorest and {blockCV}}, date = {2025-07-09}, url = {https://geocompx.org/post/2025/sml-bp6/}, langid = {en} }
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.