Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Some OpenStreetMap elements have Wikidata entities as attributes, like for example the summit of Mont Blanc has the key/value pair wikidata=Q583 directing to Mont Blanc where we can see it has the correct identifier “OpenStreetMap node ID” (P11693) directing back to the point 281399025; and everything is fine…
However, sometimes Wikidata entities are missing the OSM ID(s). Here is my workflow to find these entities in a defined area to check and complete them.
< section id="config" class="level2">Config
library(osmdata) # get data from OSM library(WikidataR) # get data from Wikidata library(sf) library(dplyr) library(purrr) library(tidyr) library(glue) library(janitor)
Data
First we chose an area of interest, here around Termignon, and get the OSM data.
osm_wd <- getbb("Termignon, France") |> 
  opq() |> 
  add_osm_feature(key = "wikidata") |> 
  osmdata_sf()
Utilities
Then we make two functions that will allow us to query the Wikidata API to get the OSM identifiers associated to a Wikidata entity. There are three properties needed (for nodes, ways and relations). An entity can have zero, one, two or three of these properties, and for each property it can have one or several IDs.
#' Get the OpenStreetMap IDs of a wikidata item
#'
#' @param item (char) wikidata ID (e.g. "Q19368619")
#'
#' @returns (vec<char>) 
#'   OSM ID (possibly several), prefixed with 
#'    "n" for "node",
#'    "w" for "way" and
#'    "r" for relation
#'  NULL if no OSM ID or
#'  NA if not available,
#' @example get_wd_osm_id("Q19368619")
get_wd_osm_id <- purrr::possibly(function(item) {
  
  # avoid spending time querying if no wikidata item
  if (is.na(item) | item == "" | !stringr::str_detect(item, "Q[0-9]+")) { 
    return(NA_character_)
  } else {
    i <- WikidataR::get_item(item)
    
    # P402 relation
    relation <- purrr::pluck(i, 
                             1, "claims", "P402", "mainsnak", "datavalue", "value")
    relation <- if (!is.null(relation)) { paste0("r", relation) } else { NULL }
    
    # P10689 way
    way <- purrr::pluck(i, 
                        1, "claims", "P10689", "mainsnak", "datavalue", "value")
    way <- if (!is.null(way)) { paste0("w", way) } else { NULL }
    
    # P11693 node
    node <- purrr::pluck(i, 
                         1, "claims", "P11693", "mainsnak", "datavalue", "value")
    node <- if (!is.null(node)) { paste0("n", node) } else { NULL }
    
    return(purrr::compact(c(relation, way, node)))
  }
}, otherwise = NA_character_)
#' Add a column with the OpenStreetMap IDs from wikidata
#' 
#' For all features of an osmdata sf object having a wikidata ID, get the 
#' associated OSM IDs recorded in wikidata
#'
#' @param osmdata_features (sf) osmdata sub-object (osm_points, osm_lines,...)
#'   with a `wikidata` column
#'
#' @returns (sf) input object with a new column `wd_osm_id` as a list-column of
#'   character IDs (or NULL)
#' @examples add_osmid_from_wikidata(my_osmdata_sf$osm_points)
add_osm_id_from_wikidata <- function(osmdata_features) {
  geom_type <- sf::st_geometry_type(osmdata_features, by_geometry = FALSE)
  osmdata_features |>
    tibble::as_tibble(.name_repair = janitor::make_clean_names) |> 
    dplyr::mutate(
      wd_osm_id = purrr::map(
        wikidata,
        slowly(get_wd_osm_id), 
        .progress = glue::glue("Getting OSM ID for {geom_type}...")))
}
#' Add a prefix to the OSM ID according to the element geometry type
#' 
#' It will allow us to compare `osm_id` and `wd_osm_id`
#'
#' @param x (sf) OSM data sub-object
#'
#' @returns x with the `osm_id` field prefixed with "n" for "node",
#'    "w" for "way" and "r" for relation
#' @examples prefix_osm_id(osm_wd$osm_points)
prefix_osm_id <- function(x, p = c("n", "w", "r")) {
  geom_type <- sf::st_geometry_type(osmdata_features, by_geometry = FALSE)
  p <- case_when(geom_type == "POINT" ~ "n",
                 geom_type %in% c("LINESTRING", "POLYGON") ~ "w",
                 geom_type %in% c("MULTILINESTRING", "MULTIPOLYGON") ~ "r",
                 .default = "")
  
  x |> 
    mutate(osm_id = glue("{p}{osm_id}"))
}
With these functions we can look at our OSM data, keep those having a Wikidata attribute, and for these entities get their OSM IDs, allowing us to check if they are similar or, for those missing, adding the ID manually in Wikidata. Since the OSM data is dispatched in different objects, according the geometry type, we need to do it for each of them.
osm_wd_augmented <- list(
  prefix_osm_id(osm_wd$osm_points, "n"),
  prefix_osm_id(osm_wd$osm_lines, "w"),
  prefix_osm_id(osm_wd$osm_polygons, "w"),
  prefix_osm_id(osm_wd$osm_multilines, "r"),
  prefix_osm_id(osm_wd$osm_multipolygons, "r")) |>
  map(\(x) {
    x |> 
      filter(!is.na(wikidata) & wikidata != "") |> 
      add_osm_id_from_wikidata() |> 
      select(osm_id, name, wikidata, wd_osm_id) |> 
      unnest(wd_osm_id, keep_empty = TRUE)}) |> 
  list_rbind() |> 
  distinct()
We get a new variable wd_osm_id whose signification is “it is one of the OSM identifiers in the Wikidata entity which is indicated in the OSM element”
Use cases
For example if we want to see the OSM elements having a Wikidata entity not linking back:
osm_wd_augmented |> filter(is.na(wd_osm_id)) |> arrange(name)
# A tibble: 29 × 4 osm_id name wikidata wd_osm_id <glue> <chr> <chr> <chr> 1 w194448267 Cenischia Q3539637 <NA> 2 n41645050 Champagny-en-Vanoise Q34791526 <NA> 3 w108980863 Chapelle Notre-Dame de la Visitation Q13518652 <NA> 4 w131213010 Chapelle Saint-Antoine Q22975798 <NA> 5 w131213054 Chapelle Saint-Sébastien Q22968509 <NA> 6 n4573291011 Cinéma Chantelouve Q61858809 <NA> 7 r2149907 Communauté de Communes Terra Modana Q17355571 <NA> 8 w37792978 Dora di Bardonecchia Q3714186 <NA> 9 w131215535 Espace Baroque Q22968504 <NA> 10 r377905 GR 55 La Vanoise Q124149580 <NA> # ℹ 19 more rows
Some are maybe legit (?), but some other may need editing…
Another example, check incoherence between osm_id and wd_osm_id:
osm_wd_augmented |> filter(osm_id != wd_osm_id) |> arrange(name)
# A tibble: 25 × 4 osm_id name wikidata wd_osm_id <glue> <chr> <chr> <chr> 1 n26691864 Albertville Q159469 r111528 2 n26691864 Albertville Q159469 r17160712 3 n41644953 Aussois Q567783 r89823 4 r2149905 Communauté de Communes de Haute Maurienne-Va… Q2987514 r6876759 5 n11646993705 Dent Parrachée Q1189850 n6705389… 6 w1257820968 Ferrovia del Moncenisio Q950823 r15987785 7 w1257820969 Ferrovia del Moncenisio Q950823 r15987785 8 w1257820970 Ferrovia del Moncenisio Q950823 r15987785 9 w1257820971 Ferrovia del Moncenisio Q950823 r15987785 10 w993297614 Glacier de Méan-Martin Q348352… w42239115 # ℹ 15 more rows
It could indicate that the OSM elements have been heavily edited or deleted/recreated without updating the corresponding Wikidata entity.
Corrections require manual back and forth between R, the OSM and Wikidata websites, but these utilities make it quite easy to improve data quality.
< !-- -->R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
