Comparison of Partition Around Medoid R programming Implementations

[This article was first published on mlampros, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Back in September 2016 I implemented the ClusterR package. One of the algorithms included in ClusterR was the ‘Partition Around Medoids’ (Cluster_Medoids) algorithm which was based on the paper “Anja Struyf, Mia Hubert, Peter J. Rousseeuw, (Feb. 1997), Clustering in an Object-Oriented Environment, Journal of Statistical Software, Vol 1, Issue 4” (at that time I didn’t have access to the book of Kaufman and Rousseeuw, Finding Groups in Data (1990) where the exact algorithm was described), thus I implemented the code and compared my results with the output of the cluster::pam() function, which was available at that time. Thus, my method was not an exact but an approximate one. Recently, a user of the ClusterR package opened an issue mentioning that the results were not optimal compared to the cluster::pam() function and this allowed me to go through my code once again and also to compare my results to new R packages that were not existent at that time. Most of these R packages include a new version of the ‘Partition Around Medoids’ algorithm, “Erich Schubert, Peter J. Rousseeuw,”Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms” 2019, ”.

In this blog-post, I’ll use the following R packages,

to compare between the (as of December 2022) existing ‘Partition Around Medoids’ implementations in terms of output dissimilarity cost and elapsed time.


# required R packages

require(ClusterR)

## Loading required package: ClusterR

## Loading required package: gtools

require(cluster)

## Loading required package: cluster

require(kmed)

## Loading required package: kmed

require(fastkmedoids)

## Loading required package: fastkmedoids

## 
## Attaching package: 'fastkmedoids'

## The following object is masked from 'package:cluster':
## 
##     pam

require(sf)

## Loading required package: sf

## Linking to GEOS 3.7.1, GDAL 2.2.2, PROJ 4.9.2; sf_use_s2() is TRUE

require(data.table)

## Loading required package: data.table

require(geodist)

## Loading required package: geodist

require(glue)

## Loading required package: glue

require(magrittr)

## Loading required package: magrittr

require(ggplot2)

## Loading required package: ggplot2

require(mapview)

## Loading required package: mapview

require(knitr)

## Loading required package: knitr


Datasets


For comparison purposes, I’ll use the following datasets,

  • a 2-column dataset using values of the Normal Distribution (standard deviation is equal to 0.25 and the mean parameter takes the value of 0 or 1)
  • the ‘dietary_survey_IBS’ dataset which exists in the ClusterR package
  • the ‘soybean’ dataset which exists in the ClusterR package
  • the ‘agriculture’ dataset which exists in the cluster package
  • a ‘geospatial’ dataset that shows how nicely the ‘Partition Around Medoids’ algorithm can cluster coordinate points based on a ‘geodesic’ distance matrix (assuming we can visually decide on the number of clusters)

The next function contains the previously mentioned datasets,


datasets_clust = function(data_name) {
  
  if (data_name == 'rnorm_data') {
    n = 100
    set.seed(1)
    x = rbind(matrix(rnorm(n, mean = 0, sd = 0.25), ncol = 2),
              matrix(rnorm(n, mean = 1, sd = 0.25), ncol = 2))
    lst_out = list(x = x, dataset = data_name)
  }
  else if (data_name == 'dietary_survey_IBS') {
    data(dietary_survey_IBS, package = 'ClusterR')
    x = dietary_survey_IBS[, -ncol(dietary_survey_IBS)]
    lst_out = list(x = x, dataset = data_name)
  }
  else if (data_name == 'soybean') {
    data(soybean, package = 'ClusterR')
    x = soybean[, -ncol(soybean)]
    lst_out = list(x = x, dataset = data_name)
  }
  else if (data_name == 'agriculture') {
    data(agriculture, package = 'cluster')
    x = agriculture
    lst_out = list(x = x, dataset = data_name)
  }
  else if (data_name == 'geospatial') {
    wkt_lst = list(black_sea = "POLYGON ((31.47957 43.64944, 31.47957 42.82356, 36.17885 42.82356, 36.17885 43.64944, 31.47957 43.64944))", 
                   caspian_sea = "POLYGON ((49.23243 42.75324, 49.23243 41.99335, 51.54848 41.99335, 51.54848 42.75324, 49.23243 42.75324))", 
                   mediterranean_sea = "POLYGON ((16.55062 35.08059, 16.55062 34.13804, 20.20771 34.13804, 20.20771 35.08059, 16.55062 35.08059))", 
                   red_sea = "POLYGON ((36.61694 23.61829, 36.61694 22.60845, 38.18655 22.60845, 38.18655 23.61829, 36.61694 23.61829))")
   
    nam_wkts = names(wkt_lst)
    CRS_wkt = 4326
    sample_point_size = 250              # sample that many random points from each sea WKT polygon
    
    count = 1
    all_dfs = list()
    
    for (nam in nam_wkts) {
      WKT = wkt_lst[[nam]]
      read_wkt_inp = sf::st_as_sfc(WKT, crs = sf::st_crs(CRS_wkt))
      # to sample random points see:  https://stackoverflow.com/a/70073632/8302386
      random_lat_lons = sf::st_sample(x = read_wkt_inp, size = sample_point_size, type = "random", crs = sf::st_crs(CRS_wkt))
      random_df = sf::st_coordinates(random_lat_lons)
      random_df = data.table::as.data.table(random_df)
      colnames(random_df) = c('lon', 'lat')
      random_df$sea = rep(x = nam, times = nrow(random_df))
      random_df$code = rep(x = count, times = nrow(random_df))
      all_dfs[[count]] = random_df
      
      count = count + 1
    }
    
    dat = data.table::rbindlist(all_dfs)
    dat_sf = sf::st_as_sf(dat, coords = c('lon', 'lat'), crs = CRS_wkt)

    # add also an outlier which is between the 'mediterranean', 'black' and 'red' sea
    # and will be assigned to the one that is closest in terms of distance
    outlier_lon = 34.8988917972
    outlier_lat = 35.0385655983
    dat_sf_update = data.frame(lon = outlier_lon, lat = outlier_lat)
    
    x_mt_update = geodist::geodist(x = dat[, 1:2], y = dat_sf_update, measure = 'geodesic')
    x_mt_update = data.table::data.table(x_mt_update)
    x_mt_update$sea = dat$sea
    x_mt_update = x_mt_update[order(x_mt_update$V1, decreasing = F), ]

    dat_sf_update_obs = data.frame(lon = outlier_lon, 
                                   lat = outlier_lat, 
                                   sea = x_mt_update$sea[1], 
                                   code = unique(subset(dat, sea == x_mt_update$sea[1])$code))   # it is assigned (based on distance) to the 'black sea' although it lies in the meditteranean
    dat_use = rbind(dat_sf_update_obs, dat)
    
    # leaflet map
    dat_sf_upd = sf::st_as_sf(dat_use, coords = c('lon', 'lat'), crs = CRS_wkt)
    mp = mapview::mapview(dat_sf_upd, zcol = 'sea', legend = TRUE)

    x = dat_use[, 1:2]
    lst_out = list(x = x, 
                   dataset = data_name, 
                   sea_names = dat_use$sea,
                   code = dat_use$code,
                   leaflet_map = mp)
  }
  else {
    stop(glue::glue("The dataset '{data_name}' does not exist!"), call. = FALSE)
  }
  
  return(lst_out)
}


‘Partion Around Medoids’ R package implementations


The next function includes the

  • cluster::pam(do.swap = TRUE, variant = ‘original’)
  • cluster::pam(do.swap = TRUE, variant = ‘faster’)
  • kmed::skm()
  • fastkmedoids::pam()
  • fastkmedoids::fastpam(initializer = “LAB”)
  • ClusterR::Cluster_Medoids(swap_phase = TRUE)

implementations and will allow comparing the dissimilarity cost and elapsed time for the mentioned datasets. In the ClusterR package, I included the ClusterR::cost_clusters_from_dissim_medoids() function that takes

  • a dissimilarity matrix
  • the output medoids of each implementation

and returns the total dissimilarity cost.


compare_medoid_implementations = function(x,
                                          y = NULL,
                                          max_k = 15,
                                          geodesic_dist = FALSE,
                                          round_digits = 5,
                                          compute_rand_index = FALSE) {
  if (!geodesic_dist) {
    x = ClusterR::center_scale(data = x, mean_center = TRUE, sd_scale = TRUE)
    x_mt = ClusterR::distance_matrix(x, method = "euclidean", upper = TRUE, diagonal = TRUE)
  }
  else {
    x_mt = geodist::geodist(x = x, measure = 'geodesic')
  }
  
  dv = as.vector(stats::dist(x))    # compute distance matrix for 'fastkmedoids' (lower triangular part)
  results = cluster_out = list()
  
  for (k in 2:max_k) {

    # cluster [ 'original' ]
    t_start = proc.time()
    set.seed(k)
    pm = cluster::pam(x = x_mt, k, metric = "euclidean", do.swap = TRUE, stand = FALSE, variant = 'original')
    pm_secs = as.numeric((proc.time() - t_start)["elapsed"])
    pm_cost = ClusterR::cost_clusters_from_dissim_medoids(data = x_mt, medoids = pm$id.med)$cost
    
    # cluster [ 'faster' ]
    t_start = proc.time()
    set.seed(k)
    pm_fast = cluster::pam(x = x_mt, k, metric = "euclidean", do.swap = TRUE, stand = FALSE, variant = 'faster')
    pm_fast_secs = as.numeric((proc.time() - t_start)["elapsed"])
    pm_fast_cost = ClusterR::cost_clusters_from_dissim_medoids(data = x_mt, medoids = pm_fast$id.med)$cost
    
    # kmed
    t_start = proc.time()
    km = kmed::skm(distdata = x_mt, ncluster = k, seeding = k, iterate = 10)
    km_secs = as.numeric((proc.time() - t_start)["elapsed"])
    km_cost = ClusterR::cost_clusters_from_dissim_medoids(data = x_mt, medoids = km$medoid)$cost
    
    # fastkmedoids  ('pam' function)
    t_start = proc.time()
    set.seed(k)
    fkm = fastkmedoids::pam(rdist = dv, n = nrow(x), k = k)
    fkm_secs = as.numeric((proc.time() - t_start)["elapsed"])
    fkm_cost = ClusterR::cost_clusters_from_dissim_medoids(data = x_mt, medoids = [email protected] + 1)$cost       # output indices correspond to Cpp indexing, thus add 1

    # fastkmedoids  ('fastpam' function with 'initializer' set to 'LAB')
    t_start = proc.time()
    fkm_lab = fastkmedoids::fastpam(rdist = dv, n = nrow(x), k = k, initializer = "LAB", seed = k)
    fkm_lab_secs = as.numeric((proc.time() - t_start)["elapsed"])
    fkm_lab_cost = ClusterR::cost_clusters_from_dissim_medoids(data = x_mt, medoids = [email protected] + 1)$cost       # output indices correspond to Cpp indexing, thus add 1
    
    # ClusterR
    t_start = proc.time()
    clst = ClusterR::Cluster_Medoids(data = x_mt, clusters = k, verbose = FALSE, swap_phase = TRUE, seed = k)
    clst_secs = as.numeric((proc.time() - t_start)["elapsed"])
    clst_cost = ClusterR::cost_clusters_from_dissim_medoids(data = x_mt, medoids = clst$medoids)$cost
    
    clust_algos = c('cluster_pam',
                    'cluster_pam_fast',
                    'kmed_skm',
                    'fastkmedoids_pam',
                    'fastkmedoids_fastpam',
                    'ClusterR_Medoids')
    
    dissim_costs = c(round(pm_cost, digits = round_digits),
                     round(pm_fast_cost, digits = round_digits),
                     round(km_cost, digits = round_digits),
                     round(fkm_cost, digits = round_digits),
                     round(fkm_lab_cost, digits = round_digits),
                     round(clst_cost, digits = round_digits))
    
    time_bench = c(round(pm_secs, digits = round_digits),
                   round(pm_fast_secs, digits = round_digits),
                   round(km_secs, digits = round_digits),
                   round(fkm_secs, digits = round_digits),
                   round(fkm_lab_secs, digits = round_digits),
                   round(clst_secs, digits = round_digits))
    
    dtbl = list(pam_R_function = clust_algos, 
                dissim_cost = dissim_costs,
                timing = time_bench,
                k = rep(k, length(clust_algos)))
    
    if (compute_rand_index) {
      # rand-index (or accuracy)
      clust_acc = list(k = k,
                       cluster_pam = ClusterR::external_validation(true_labels = y, clusters = pm$clustering, method = 'rand_index'),
                       cluster_pam_fast = ClusterR::external_validation(true_labels = y, clusters = pm_fast$clustering, method = 'rand_index'),
                       kmed_skm = ClusterR::external_validation(true_labels = y, clusters = km$cluster, method = 'rand_index'),
                       fastkmedoids_pam = ClusterR::external_validation(true_labels = y, clusters = [email protected], method = 'rand_index'),
                       fastkmedoids_fastpam = ClusterR::external_validation(true_labels = y, clusters = [email protected], method = 'rand_index'),
                       ClusterR_Medoids = ClusterR::external_validation(true_labels = y, clusters = clst$clusters, method = 'rand_index'))
      
      data.table::setDT(clust_acc)
      cluster_out[[k]] = clust_acc
    }
    
    data.table::setDT(dtbl)
    results[[k]] = dtbl
  }
  
  results = data.table::rbindlist(results)
  
  if (compute_rand_index) {
    return(list(results = results, cluster_out = cluster_out))
  }
  else {
   return(results) 
  }
}


In a for-loop, we’ll iterate over the datasets and the ‘partition around medoids’ implementations and we’ll save the output results to a list object,


dataset_names = c('rnorm_data', 'dietary_survey_IBS', 'soybean', 'agriculture', 'geospatial')
lst_all = list()
geo_flag = FALSE
y = NULL

for (dat_name in dataset_names) {
  
  iter_dat = datasets_clust(data_name = dat_name)
  cat(glue::glue("Dataset: '{dat_name}'  Number of rows: {nrow(iter_dat$x)}"), '\n')
  
  if (dat_name == 'geospatial') {
    mp_view = iter_dat$leaflet_map
    y = iter_dat$code
    geo_flag = TRUE
  }
  else {
    geo_flag = FALSE
  }

  iter_out = suppressWarnings(compare_medoid_implementations(x = iter_dat$x,
                                                             y = y,
                                                             max_k = 10, 
                                                             geodesic_dist = geo_flag,
                                                             round_digits = 5,
                                                             compute_rand_index = geo_flag))
  lst_all[[dat_name]] = iter_out
}

## Dataset: 'rnorm_data'  Number of rows: 100 
## Dataset: 'dietary_survey_IBS'  Number of rows: 400 
## Dataset: 'soybean'  Number of rows: 307 
## Dataset: 'agriculture'  Number of rows: 12 
## Dataset: 'geospatial'  Number of rows: 1001


then we’ll use a ggplot2 visualization function to visualize the dissimilarity cost and elapsed time for each k from 2 to 10


multi_plot_data = function(data_index, y_column, digits = 2, size_geom_bar_text = 7) {
  
  data_index$k = factor(data_index$k)               # convert the 'k' column to factor
  levels(data_index$k) = as.factor(glue::glue("k = {levels(data_index$k)}"))
  data_index[[y_column]] = round(data_index[[y_column]], digits = digits)

  plt = ggplot2::ggplot(data_index, ggplot2::aes(x = factor(pam_R_function), y = .data[[y_column]], fill = factor(pam_R_function))) +
    ggplot2::geom_bar(stat = "identity") +
    ggplot2::facet_wrap(~k, scales = "free_y") +
    ggplot2::geom_text(ggplot2::aes(label = .data[[y_column]]), position = ggplot2::position_dodge(width = 0.9), col = 'black', fontface = 'bold', size = size_geom_bar_text) +
    ggplot2::theme(panel.background = ggplot2::element_rect(fill = "white", colour = "#6D9EC1", size = 3, linetype = "solid"),
          strip.text.x = ggplot2::element_text(size = 20, colour = 'black', face = 'bold'),
          strip.background = element_rect(fill = 'orange', colour = 'black'),
          plot.title = ggplot2::element_text(size = 19, hjust = 0.5),
          axis.text = ggplot2::element_text(size = 19, face = 'bold'),
          axis.text.x = ggplot2::element_text(size = 25, face = 'bold', angle = 35, hjust = 1),
          axis.title = ggplot2::element_text(size = 19, face = 'bold'),
          axis.title.x = ggplot2::element_blank(),
          panel.grid = ggplot2::element_blank(),
          legend.position = "none")
  
  return(plt)
}


Visualization of the dissimilarity cost for each one of the datasets


rnorm_data dataset

y_axis = 'dissim_cost'

data_name = 'rnorm_data'
multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)

Alt Text


dietary_survey_IBS dataset

data_name = 'dietary_survey_IBS'
multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)

Alt Text


soybean dataset

data_name = 'soybean'
multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)

Alt Text


agriculture dataset

data_name = 'agriculture'
multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)

Alt Text


geospatial dataset

data_name = 'geospatial'
multi_plot_data(data_index = lst_all[[data_name]]$results, y_column = y_axis, size_geom_bar_text = 5)

Alt Text


For the ‘geospatial’ data I also computed the Rand Index (or accuracy) and for k = 4 (which is the actual number of clusters as the following leaflet map shows) we have a perfect clustering (or a rand-index of 1.0). I also included a single outlier (a pair of coordinates) between the ‘mediterranean’, ‘black’ and ‘red’ sea, which shall be assigned to the ‘black’ sea based on the distance (which happens in all implementations),

dtbl_rand_index = data.table::rbindlist(lst_all[[data_name]]$cluster_out)
knitr::kable(round(dtbl_rand_index, digits = 3))
k cluster_pam cluster_pam_fast kmed_skm fastkmedoids_pam fastkmedoids_fastpam ClusterR_Medoids
2 0.624 0.624 0.624 0.624 0.624 0.624
3 0.874 0.874 0.875 0.846 0.846 0.875
4 1.000 1.000 1.000 1.000 1.000 1.000
5 0.969 0.969 0.969 0.969 0.969 0.969
6 0.938 0.938 0.937 0.938 0.938 0.938
7 0.927 0.906 0.907 0.927 0.927 0.927
8 0.917 0.896 0.896 0.917 0.917 0.917
9 0.886 0.886 0.865 0.885 0.885 0.885
10 0.875 0.854 0.855 0.875 0.875 0.855


The Leaflet map shows the 4 clusters of the geospatial dataset which gives a rand-index of 1.0,

mp_view

Alt Text


Visualization of the elapsed time for each one of the datasets


rnorm_data dataset

y_axis = 'timing'

data_name = 'rnorm_data'
multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)

Alt Text


dietary_survey_IBS dataset

data_name = 'dietary_survey_IBS'
multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)

Alt Text


soybean dataset

data_name = 'soybean'
multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)

Alt Text


agriculture dataset

data_name = 'agriculture'
multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)

Alt Text


geospatial dataset

data_name = 'geospatial'
multi_plot_data(data_index = lst_all[[data_name]]$results, y_column = y_axis)

Alt Text


Conclusions

  • Using different datasets and a range of values for k (from 2 to 10) we see that even the exact algorithm does not always give the lowest dissimilarity cost. Moreover, there is a difference in the dissimilarity cost for the different ‘k’ values and the included ‘datasets’ between the various implementations. This probably has to do with the fact that the majority of these implementations are approximate and also differently implemented.
  • Regarding the elapsed time the bar-plots show that the fastkmedoids::fastpam() function returns the results faster (compared to the other implementations) in all cases and this is more obvious in the ‘geospatial’ dataset where we have a medium sized dataset consisting of 1000 observations.


Notes

All ‘Partition Around Medoid’ functions take a dissimilarity matrix as input and not the initial input data, therefore the elapsed time does not include the computation of the dissimilarity (or distance) matrix.


To leave a comment for the author, please follow the link and comment on their blog: mlampros.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)