Safe-and-simple cleaning of species occurrences

[This article was first published on modTools, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my species distribution modelling courses, for a quick and safe removal of the most common biodiversity database errors, I’ve so far used functions from the scrubr R package, namely ‘coord_incomplete’, ‘coord_impossible’, ‘coord_unlikely’, ‘coord_imprecise’ and ‘coord_uncertain’. There are other R packages for species occurrence data cleaning (e.g. CoordinateCleaner, biogeo, or the GUI-based bdclean), but they either require (somewhat laborious) data formatting, and/or they can eliminate many valid records (e.g. near region centroids or towns or institutions, or wrongly considered outliers) if their argument values are not mindfully tailored and their results not carefully inspected. As I need students to have at least most of the worst data errors cleaned before modelling, but can’t spend much of their time and attention on something that’s not the main focus of the course, scrubr was my go-to option… until it was archived, both on CRAN and on GitHub! While I always advise students to do a more thorough data inspection for real work after the course, I implemented those no-fuss, safe-and-simple essential cleaning procedures in a new cleanCoords function in package fuzzySim. This function removes (any or all of) duplicate, missing, impossible, unlikely, imprecise or overly uncertain spatial coordinates.

Here’s a usage example, which currently involves installing the latest development version of fuzzySim from R-Forge, and also the geodata package for downloading some GBIF records to clean:

install.packages("fuzzySim", repos="")

# download some species occurrences from GBIF:
occ <- geodata::sp_occurrence(genus = "Orycteropus", species = "afer", fixnames = FALSE)

# clean the occurrences:
occ_clean <- fuzzySim::cleanCoords(occ, coord.cols = c("decimalLongitude", "decimalLatitude"), uncert.col = "coordinateUncertaintyInMeters", uncert.limit = 10000)  # 10 km tolerance

# 761 rows in input data
# 574 rows after 'rm.dup'
# 573 rows after 'rm.equal'
# 573 rows after 'rm.imposs'
# 573 rows after 'rm.missing.any'
# 573 rows after ''
# 568 rows after 'rm.imprec.any'
# 462 rows after 'rm.uncert' (with uncert.limit=10000 and

Run help(cleanCoords) to see the arguments that you can turn on or off. As I always like my students to look at the results of every step of the analysis before moving on, I added a plot argument which is TRUE by default. So you’ll also get a figure like this, with the selected presences in blue and the excluded ones in red:

Feedback welcome before I update the package on CRAN!

To leave a comment for the author, please follow the link and comment on their blog: modTools. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)