[This article was first published on   Spatial Processing in R, and kindly contributed to R-bloggers].  (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
            Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The problem
Last week, I replied to this interesting question posted by @Tim_K over stackoverflow. He was seeking efficient solutions to identify all points falling within a maximum distance of xx meters with respect to each single point in a spatial points dataset.If you have a look at the thread, you will see that a simple solution based on creating a “buffered” polygon dataset beforehand and then intersecting it with the original points is quite fast for “reasonably sized” datasets, thanks to sf spatial indexing capabilities which reduce the number of the required comparisons to be done (See http://r-spatial.org/r/2017/06/22/spatial-index.html). In practice, something like this:
# create test data: 50000 uniformly distributed points on a "square" of 100000
# metres
maxdist <- 500
pts     <- data.frame(x = runif(50000, 0, 100000),
                      y = runif(50000, 0, 100000),
                      id = 1:50000) %>%
  sf::st_as_sf(coords = c("x", "y"))
# create buffered polygons
pts_buf <- sf::st_buffer(pts, maxdist)
# Find points within 500 meters wrt each point
int <- sf::st_intersects(pts_buf, pts)
int
## Sparse geometry binary predicate list of length 50000, where the predicate was `intersects'
## first 10 elements:
##  1: 1, 11046, 21668, 25417
##  2: 2, 8720, 12595, 23620, 26926, 27169, 39484
##  3: 3, 11782, 20058, 27869, 33151, 47864
##  4: 4, 35665, 45691
##  5: 5, 37093, 37989
##  6: 6, 31487
##  7: 7, 38433, 42009, 45597, 49806
##  8: 8, 12129, 31486
##  9: 9, 27840, 35577, 36797, 40906
##  10: 10, 15482, 16615, 26103, 41417However, this starts to have problems over really large datasets, because the total number of comparisons to be done still rapidly increase besides the use of spatial indexes. A test done by changing the number of points in the above example in the range 25000 – 475000 shows for example this kind of behavior, for two different values of maxdist (500 and 2000 m):
On the test dataset, the relationships are almost perfectly quadratic (due to the uniform distribution of points). Extrapolating them to the 12 Million points dataset of the OP, we would get an execution time of about 14 hours for maxdist = 500, and a staggering 3.5 days formaxdist = 2000. Still doable, but not ideal…My suggestion to the OP was therefore to “split” the points in chunks based on the x-coordinate and then work on a per-split basis, eventually assigning each chunk to a different core within a parallellized cycle.In the end, I got curious and decided to give it a go to see what kind of performance improvement it was possible to obtain with that kind of approach. You can find results of some tests below.
Last week, I replied to this interesting question posted by @Tim_K over stackoverflow. He was seeking efficient solutions to identify all points falling within a maximum distance of xx meters with respect to each single point in a spatial points dataset.
If you have a look at the thread, you will see that a simple solution based on creating a “buffered” polygon dataset beforehand and then intersecting it with the original points is quite fast for “reasonably sized” datasets, thanks to 
sf spatial indexing capabilities which reduce the number of the required comparisons to be done (See http://r-spatial.org/r/2017/06/22/spatial-index.html). In practice, something like this:# create test data: 50000 uniformly distributed points on a "square" of 100000
# metres
maxdist <- 500
pts     <- data.frame(x = runif(50000, 0, 100000),
                      y = runif(50000, 0, 100000),
                      id = 1:50000) %>%
  sf::st_as_sf(coords = c("x", "y"))
# create buffered polygons
pts_buf <- sf::st_buffer(pts, maxdist)
# Find points within 500 meters wrt each point
int <- sf::st_intersects(pts_buf, pts)
int
## Sparse geometry binary predicate list of length 50000, where the predicate was `intersects'
## first 10 elements:
##  1: 1, 11046, 21668, 25417
##  2: 2, 8720, 12595, 23620, 26926, 27169, 39484
##  3: 3, 11782, 20058, 27869, 33151, 47864
##  4: 4, 35665, 45691
##  5: 5, 37093, 37989
##  6: 6, 31487
##  7: 7, 38433, 42009, 45597, 49806
##  8: 8, 12129, 31486
##  9: 9, 27840, 35577, 36797, 40906
##  10: 10, 15482, 16615, 26103, 41417However, this starts to have problems over really large datasets, because the total number of comparisons to be done still rapidly increase besides the use of spatial indexes. A test done by changing the number of points in the above example in the range 25000 – 475000 shows for example this kind of behavior, for two different values of 
maxdist (500 and 2000 m):On the test dataset, the relationships are almost perfectly quadratic (due to the uniform distribution of points). Extrapolating them to the 12 Million points dataset of the OP, we would get an execution time of about 14 hours for 
maxdist = 500, and a staggering 3.5 days formaxdist = 2000. Still doable, but not ideal…My suggestion to the OP was therefore to “split” the points in chunks based on the x-coordinate and then work on a per-split basis, eventually assigning each chunk to a different core within a parallellized cycle.
In the end, I got curious and decided to give it a go to see what kind of performance improvement it was possible to obtain with that kind of approach. You can find results of some tests below.
A (possible) solution: Speeding up computation by combining data.table and sf_intersect
The idea here is to use a simple divide-and-conquer approach.We first split the total spatial extent of the dataset in a certain number of regular quadrants. We then iterate over the quadrants and for each one we:- Extract the points contained into the quadrant and apply a buffer to them;
- Extract the points contained in a slightly larger area, computed by expanding the quadrant by an amount equal to the maximum distance for which we want to identify the “neighbors”;
- Compute and save the intersection between the buffered points and the points contained in the “expanded” quadrant
“Graphically”, this translates to exploring the dataset like this:
, where the points included in the current “quadrant” are shown in green and the additional points needed to perform the analysis for that quadrant are shown in red.
Provided that the subsetting operations do not introduce an excessive overhead (i.e., they are fast enough…) this should provide a performance boost, because it should consistently reduce the total number of comparisons to be done.Now, every “R” expert will tell you that if you need to perform fast subsetting over large datasets the way to go is to use properly indexeddata.tables, which provide lightning-speed subsetting capabilities.So, let’s see how we could code this in a functions:points_in_distance <- function(in_pts,
                               maxdist,
                               ncuts = 10) {
  require(data.table)
  require(sf)
  # convert points to data.table and create a unique identifier
  pts <-  data.table(in_pts)
  pts <- pts[, or_id := 1:dim(in_pts)[1]]
  # divide the extent in quadrants in ncuts*ncuts quadrants and assign each
  # point to a quadrant, then create the index over "x" to speed-up
  # the subsetting
  range_x  <- range(pts$x)
  limits_x <-(range_x[1] + (0:ncuts)*(range_x[2] - range_x[1])/ncuts)
  range_y  <- range(pts$y)
  limits_y <- range_y[1] + (0:ncuts)*(range_y[2] - range_y[1])/ncuts
  pts[, `:=`(xcut =  as.integer(cut(x, ncuts, labels = 1:ncuts)),
             ycut = as.integer(cut(y, ncuts, labels = 1:ncuts)))]  %>%
    setkey(x)
  results <- list()
  count <- 0
  # start cycling over quadrants
  for (cutx in seq_len(ncuts)) {
    # get the points included in a x-slice extended by `maxdist`, and build
    # an index over y to speed-up subsetting in the inner cycle
    min_x_comp    <- ifelse(cutx == 1,
                            limits_x[cutx],
                            (limits_x[cutx] - maxdist))
    max_x_comp    <- ifelse(cutx == ncuts,
                            limits_x[cutx + 1],
                            (limits_x[cutx + 1] + maxdist))
    subpts_x <- pts[x >= min_x_comp & x < max_x_comp] %>%
      setkey(y)
    for (cuty in seq_len(ncuts)) {
      count <- count + 1
      # subset over subpts_x to find the final set of points needed for the
      # comparisons
      min_y_comp  <- ifelse(cuty == 1,
                            limits_y[cuty],
                            (limits_x[cuty] - maxdist))
      max_y_comp  <- ifelse(cuty == ncuts,
                            limits_x[cuty + 1],
                            (limits_x[cuty + 1] + maxdist))
      subpts_comp <- subpts_x[y >= min_y_comp & y < max_y_comp]
      # subset over subpts_comp to get the points included in a x/y chunk,
      # which "neighbours" we want to find. Then buffer them by maxdist.
      subpts_buf <- subpts_comp[ycut == cuty & xcut == cutx] %>%
        sf::st_as_sf() %>% 
        sf::st_buffer(maxdist)
      # retransform to sf since data.tables lost the geometric attrributes
      subpts_comp <- sf::st_as_sf(subpts_comp)
      # compute the intersection and save results in a element of "results".
      # For each point, save its "or_id" and the "or_ids" of the points within "dist"
      inters <- sf::st_intersects(subpts_buf, subpts_comp)
      # save results
      results[[count]] <- data.table(
        id = subpts_buf$or_id,
        int_ids = lapply(inters, FUN = function(x) subpts_comp$or_id[x]))
    }
  }
  data.table::rbindlist(results)
}The function takes as input a points sf object, a target distance and a number of “cuts” to use to divide the extent in quadrants, and provides in output a data frame in which, for each original point, the “ids” of the points within maxdist are reported in the int_ids list column.Now, let’s see if this works:pts <- data.frame(x = runif(20000, 0, 100000),
                  y = runif(20000, 0, 100000),
                  id = 1:20000) %>%
  st_as_sf(coords = c("x", "y"), remove = FALSE)
maxdist <- 2000
out <- points_in_distance(pts, maxdist = maxdist, ncut = 10)
out
##           id                              int_ids
##     1: 15119  2054,18031, 9802, 8524, 4107, 7412,
##     2: 14392 12213,10696, 6399,14392, 4610, 1200,
##     3: 14675  2054,18031, 9802, 8524, 4107, 7412,
##     4:  3089 12293,18031, 8524, 4107,12727, 2726,
##     5: 17282       9802,8524,4107,7412,2726,1275,
##    ---                                           
## 19995:   248 16610, 7643, 8059,15998,16680, 1348,
## 19996:  8433 16821,15638,16680, 3876,13851, 1348,
## 19997: 17770 11060, 7643, 8059,19868, 7776,10146,
## 19998: 11963  9948, 9136,15956,18512, 9219, 8925,
## 19999: 15750  5291,18093,14462,15362,12575, 5189,
# get a random point
sel_id  <- sample(pts$id,1)
pt_sel  <- pts[sel_id, ]
pt_buff <- pt_sel %>%  sf::st_buffer(maxdist)
# get ids of points within maxdist
id_inters <- unlist(out[id == sel_id, ]$int_ids)
pt_inters <- pts[id_inters,]
#plot results
ggplot(pt_buff)  + theme_light() +
  geom_point(data = pts, aes(x = x, y = y), size = 1) +
  geom_sf(col = "blue", size = 1.2, fill = "transparent") +
  geom_sf(data = pt_inters, col = "red", size = 1.5) +
  geom_point(data = pt_sel, aes(x = x, y = y), size = 2, col = "green") +
  xlim(st_bbox(pt_buff)[1] - maxdist, st_bbox(pt_buff)[3] + maxdist) +
  ylim(st_bbox(pt_buff)[2] - maxdist, st_bbox(pt_buff)[4] + maxdist) + 
  ggtitle(paste0("id = ", sel_id, " - Number of points within distance = ", length(id_inters)))
So far, so good. Now, let’s do the same exercise with varying number of points to see how it behaves in term of speed:
Already not bad! In particular for the maxdist = 2000 case, we get a quite large speed improvement!However, a nice thing about the points_in_distance approach is that it is easily parallelizable. All is needed is to change some lines of the function so that the outer loop over the x “chunks” exploits a parallel backend of some kind. (You can find an example implementation exploiting foreachin this gist)Looking good! Some more skilled programmer could probably squeeze out even more speed from it by some additional data.table magic, but the improvement is very noticeable.In terms of execution time, extrapolating again to the “infamous” 12 Million points dataset, this would be what we get:
    Method     Maxdist     Expected completion time (hours)        st_intersect     500     15.00        points_in_distance – serial     500     2.50        points_in_distance – parallel     500     0.57        st_intersect     2000     85.00        points_in_distance – serial     2000     15.20        points_in_distance – parallel     2000     3.18     
So, we get a 5-6X speed improvement already on the “serial” implementation, and another 5X thanks to parallelization over 6 cores! On themaxdist = 2000 case, this means going from more than 3 days to about 3 hours. And if we had more cores and RAM to throw at it, it would finish in minutes!
Nice!
The idea here is to use a simple divide-and-conquer approach.
We first split the total spatial extent of the dataset in a certain number of regular quadrants. We then iterate over the quadrants and for each one we:
- Extract the points contained into the quadrant and apply a buffer to them;
- Extract the points contained in a slightly larger area, computed by expanding the quadrant by an amount equal to the maximum distance for which we want to identify the “neighbors”;
- Compute and save the intersection between the buffered points and the points contained in the “expanded” quadrant
“Graphically”, this translates to exploring the dataset like this:
, where the points included in the current “quadrant” are shown in green and the additional points needed to perform the analysis for that quadrant are shown in red.
Provided that the subsetting operations do not introduce an excessive overhead (i.e., they are fast enough…) this should provide a performance boost, because it should consistently reduce the total number of comparisons to be done.
, where the points included in the current “quadrant” are shown in green and the additional points needed to perform the analysis for that quadrant are shown in red.
Provided that the subsetting operations do not introduce an excessive overhead (i.e., they are fast enough…) this should provide a performance boost, because it should consistently reduce the total number of comparisons to be done.
Now, every “R” expert will tell you that if you need to perform fast subsetting over large datasets the way to go is to use properly indexed
data.tables, which provide lightning-speed subsetting capabilities.So, let’s see how we could code this in a functions:
points_in_distance <- function(in_pts,
                               maxdist,
                               ncuts = 10) {
  require(data.table)
  require(sf)
  # convert points to data.table and create a unique identifier
  pts <-  data.table(in_pts)
  pts <- pts[, or_id := 1:dim(in_pts)[1]]
  # divide the extent in quadrants in ncuts*ncuts quadrants and assign each
  # point to a quadrant, then create the index over "x" to speed-up
  # the subsetting
  range_x  <- range(pts$x)
  limits_x <-(range_x[1] + (0:ncuts)*(range_x[2] - range_x[1])/ncuts)
  range_y  <- range(pts$y)
  limits_y <- range_y[1] + (0:ncuts)*(range_y[2] - range_y[1])/ncuts
  pts[, `:=`(xcut =  as.integer(cut(x, ncuts, labels = 1:ncuts)),
             ycut = as.integer(cut(y, ncuts, labels = 1:ncuts)))]  %>%
    setkey(x)
  results <- list()
  count <- 0
  # start cycling over quadrants
  for (cutx in seq_len(ncuts)) {
    # get the points included in a x-slice extended by `maxdist`, and build
    # an index over y to speed-up subsetting in the inner cycle
    min_x_comp    <- ifelse(cutx == 1,
                            limits_x[cutx],
                            (limits_x[cutx] - maxdist))
    max_x_comp    <- ifelse(cutx == ncuts,
                            limits_x[cutx + 1],
                            (limits_x[cutx + 1] + maxdist))
    subpts_x <- pts[x >= min_x_comp & x < max_x_comp] %>%
      setkey(y)
    for (cuty in seq_len(ncuts)) {
      count <- count + 1
      # subset over subpts_x to find the final set of points needed for the
      # comparisons
      min_y_comp  <- ifelse(cuty == 1,
                            limits_y[cuty],
                            (limits_x[cuty] - maxdist))
      max_y_comp  <- ifelse(cuty == ncuts,
                            limits_x[cuty + 1],
                            (limits_x[cuty + 1] + maxdist))
      subpts_comp <- subpts_x[y >= min_y_comp & y < max_y_comp]
      # subset over subpts_comp to get the points included in a x/y chunk,
      # which "neighbours" we want to find. Then buffer them by maxdist.
      subpts_buf <- subpts_comp[ycut == cuty & xcut == cutx] %>%
        sf::st_as_sf() %>% 
        sf::st_buffer(maxdist)
      # retransform to sf since data.tables lost the geometric attrributes
      subpts_comp <- sf::st_as_sf(subpts_comp)
      # compute the intersection and save results in a element of "results".
      # For each point, save its "or_id" and the "or_ids" of the points within "dist"
      inters <- sf::st_intersects(subpts_buf, subpts_comp)
      # save results
      results[[count]] <- data.table(
        id = subpts_buf$or_id,
        int_ids = lapply(inters, FUN = function(x) subpts_comp$or_id[x]))
    }
  }
  data.table::rbindlist(results)
}The function takes as input a points 
sf object, a target distance and a number of “cuts” to use to divide the extent in quadrants, and provides in output a data frame in which, for each original point, the “ids” of the points within maxdist are reported in the int_ids list column.Now, let’s see if this works:
pts <- data.frame(x = runif(20000, 0, 100000),
                  y = runif(20000, 0, 100000),
                  id = 1:20000) %>%
  st_as_sf(coords = c("x", "y"), remove = FALSE)
maxdist <- 2000
out <- points_in_distance(pts, maxdist = maxdist, ncut = 10)
out
##           id                              int_ids
##     1: 15119  2054,18031, 9802, 8524, 4107, 7412,
##     2: 14392 12213,10696, 6399,14392, 4610, 1200,
##     3: 14675  2054,18031, 9802, 8524, 4107, 7412,
##     4:  3089 12293,18031, 8524, 4107,12727, 2726,
##     5: 17282       9802,8524,4107,7412,2726,1275,
##    ---                                           
## 19995:   248 16610, 7643, 8059,15998,16680, 1348,
## 19996:  8433 16821,15638,16680, 3876,13851, 1348,
## 19997: 17770 11060, 7643, 8059,19868, 7776,10146,
## 19998: 11963  9948, 9136,15956,18512, 9219, 8925,
## 19999: 15750  5291,18093,14462,15362,12575, 5189,
# get a random point
sel_id  <- sample(pts$id,1)
pt_sel  <- pts[sel_id, ]
pt_buff <- pt_sel %>%  sf::st_buffer(maxdist)
# get ids of points within maxdist
id_inters <- unlist(out[id == sel_id, ]$int_ids)
pt_inters <- pts[id_inters,]
#plot results
ggplot(pt_buff)  + theme_light() +
  geom_point(data = pts, aes(x = x, y = y), size = 1) +
  geom_sf(col = "blue", size = 1.2, fill = "transparent") +
  geom_sf(data = pt_inters, col = "red", size = 1.5) +
  geom_point(data = pt_sel, aes(x = x, y = y), size = 2, col = "green") +
  xlim(st_bbox(pt_buff)[1] - maxdist, st_bbox(pt_buff)[3] + maxdist) +
  ylim(st_bbox(pt_buff)[2] - maxdist, st_bbox(pt_buff)[4] + maxdist) + 
  ggtitle(paste0("id = ", sel_id, " - Number of points within distance = ", length(id_inters)))So far, so good. Now, let’s do the same exercise with varying number of points to see how it behaves in term of speed:
Already not bad! In particular for the 
maxdist = 2000 case, we get a quite large speed improvement!However, a nice thing about the 
points_in_distance approach is that it is easily parallelizable. All is needed is to change some lines of the function so that the outer loop over the x “chunks” exploits a parallel backend of some kind. (You can find an example implementation exploiting foreachin this gist)Looking good! Some more skilled programmer could probably squeeze out even more speed from it by some additional 
data.table magic, but the improvement is very noticeable.In terms of execution time, extrapolating again to the “infamous” 12 Million points dataset, this would be what we get:
| Method | Maxdist | Expected completion time (hours) | 
|---|---|---|
| st_intersect | 500 | 15.00 | 
| points_in_distance – serial | 500 | 2.50 | 
| points_in_distance – parallel | 500 | 0.57 | 
| st_intersect | 2000 | 85.00 | 
| points_in_distance – serial | 2000 | 15.20 | 
| points_in_distance – parallel | 2000 | 3.18 | 
So, we get a 5-6X speed improvement already on the “serial” implementation, and another 5X thanks to parallelization over 6 cores! On the
maxdist = 2000 case, this means going from more than 3 days to about 3 hours. And if we had more cores and RAM to throw at it, it would finish in minutes!Nice!
Final Notes
- The timings shown here are merely indicative, and related to the particular test-dataset we built. On a less uniformly distributed dataset I would expect a lower speed improvement.
- Some time is “wasted” because sfdoes not (yet) extenddata.tables, making it necessary to recreatesfobjects from thedata.tablesubsets.
- The parallel implementation is quick-and-dirty, and it is a bit of a memory-hog! Be careful before throwing at it 25 processors!
- Speed is influenced in a non-trivial way by the number of “cuts” used to subdivide the spatial extent. There may be a sweet-spot related to points distribution and maxdist allowing reaching maximum speed.
- A similar approach for parallelization could exploit repeatedly “cropping” the original sfpoints object over the extent of the chunk/extended chunk. Thedata.tableapproach seems however to be faster.
That’s all! Hope you liked this (rather long) post!
		
            
- The timings shown here are merely indicative, and related to the particular test-dataset we built. On a less uniformly distributed dataset I would expect a lower speed improvement.
- Some time is “wasted” becausesfdoes not (yet) extenddata.tables, making it necessary to recreatesfobjects from thedata.tablesubsets.
- The parallel implementation is quick-and-dirty, and it is a bit of a memory-hog! Be careful before throwing at it 25 processors!
- Speed is influenced in a non-trivial way by the number of “cuts” used to subdivide the spatial extent. There may be a sweet-spot related to points distribution and maxdist allowing reaching maximum speed.
- A similar approach for parallelization could exploit repeatedly “cropping” the originalsfpoints object over the extent of the chunk/extended chunk. Thedata.tableapproach seems however to be faster.
To leave a comment for the author, please follow the link and comment on their blog:  Spatial Processing in R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
