Removing absences from GBIF datasets

[This article was first published on modTools, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I often come across GBIF users who are unaware that the records available for a given taxon are not necessarily all presences: there’s a column named “occurrenceStatus” whose value can be “PRESENT” or “ABSENT”! The absence records can, of course, be removed with simple operations in R or even omitted from the download, but many users overlook or accidentally skip this crucial step, and then end up analysing species distributions with some absence records being used as presences.

To reduce the chances of this happening, I just added new arguments to the fuzzySim function cleanCoords() (which was explained in a previous post) to allow removing absence records too, when the user wants to filter only the presences. Here’s a usage example using aardvark occurrence records downloaded with the geodata package. Note that this requires installing the latest version of fuzzySim (4.9.9):

occ <- geodata::sp_occurrence(genus = "Orycteropus", species = "afer",
       fixnames = FALSE)
# NOTE: as per the function help file, if you use GBIF data, remember to check the data use agreement and follow guidance on how to properly cite the actual data sources!

names(occ)

occ_clean <- fuzzySim::cleanCoords(occ,
             coord.cols = c("decimalLongitude", "decimalLatitude"),
             uncert.col = "coordinateUncertaintyInMeters",
             uncert.limit = 10000,
             abs.col = "occurrenceStatus")

# 764 rows in input data
# 576 rows after 'rm.dup'
# 575 rows after 'rm.equal'
# 575 rows after 'rm.imposs'
# 575 rows after 'rm.missing.any'
# 575 rows after 'rm.zero.any'
# 570 rows after 'rm.imprec.any'
# 465 rows after 'rm.uncert' (with uncert.limit=10000 and uncert.na.pass=TRUE)
# 461 rows after 'rm.abs'

As you can see, besides some common biodiversity data issues such as duplicated or erroneous coordinates, additional records were removed because they represented absences. Hopefully this can help prevent their incorrect use as species presences. Feedback welcome!

To leave a comment for the author, please follow the link and comment on their blog: modTools.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)