Clustering using the ClusterR package

[This article was first published on mlampros, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This blog post is about clustering and specifically about my recently released package on CRAN, ClusterR. The following notes and examples are based mainly on the package Vignette.

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is the main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.

The most prominent examples of clustering algorithms are Connectivity-based clustering (hierarchical clustering), Centroid-based clustering (k-means, k-medoids,…), Distribution-based clustering (Gaussian mixture models) and Density-based clustering (DBSCAN, OPTICS,…).

The ClusterR package consists of centroid-based (k-means, mini-batch-kmeans, k-medoids) and distribution-based (GMM) clustering algorithms. Furthermore, the package offers functions to

  • validate the output using the true labels,
  • plot the results using either a silhouette or a 2-dimensional plot,
  • predict new observations,
  • estimate the optimal number of clusters for each algorithm separately.

Gaussian Mixture Models (GMM)

Gaussian Mixture Models are a probabilistic model for representing normally distributed subpopulations within an overall population. A Gaussian mixture model is parameterized by two types of values, the mixture component weights and the component means and covariances (for the multivariate case). If the number of components is known, expectation maximization is the technique most commonly used to estimate the mixture model’s parameters.

The GMM function in the ClusterR package is an R implementation of the Armadillo library class for modeling data as a Gaussian Mixture Model (GMM), under the assumption of diagonal covariance matrices. A number of function parameters can be tuned, among others the gaussian_comps, the dist_mode (eucl_dist, maha_dist), the seed_mode (static_subset, random_subset, static_spread, random_spread), the km_iter and the em_iter (more information about the parameters can be found in the package documentation). I’ll illustrate the GMM function using the synthetic data dietary_survey_IBS,

<span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ClusterR</span><span class="p">)</span><span class="w">

</span><span class="n">data</span><span class="p">(</span><span class="n">dietary_survey_IBS</span><span class="p">)</span><span class="w">

</span><span class="nf">dim</span><span class="p">(</span><span class="n">dietary_survey_IBS</span><span class="p">)</span><span class="w">

</span><span class="c1">## [1] 400  43
</span><span class="w">
</span><span class="n">X</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dietary_survey_IBS</span><span class="p">[,</span><span class="w"> </span><span class="o">-</span><span class="n">ncol</span><span class="p">(</span><span class="n">dietary_survey_IBS</span><span class="p">)]</span><span class="w">   </span><span class="c1"># data (excluding the response variable)
</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dietary_survey_IBS</span><span class="p">[,</span><span class="w"> </span><span class="n">ncol</span><span class="p">(</span><span class="n">dietary_survey_IBS</span><span class="p">)]</span><span class="w">    </span><span class="c1"># the response variable
</span><span class="w">
</span><span class="n">dat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">center_scale</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="w"> </span><span class="n">mean_center</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">,</span><span class="w"> </span><span class="n">sd_scale</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">)</span><span class="w">  </span><span class="c1"># centering and scaling the data
</span><span class="w">
</span><span class="n">gmm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">GMM</span><span class="p">(</span><span class="n">dat</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">dist_mode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"maha_dist"</span><span class="p">,</span><span class="w"> </span><span class="n">seed_mode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"random_subset"</span><span class="p">,</span><span class="w"> </span><span class="n">km_iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w">
          
          </span><span class="n">em_iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">verbose</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="w">          

</span><span class="c1"># predict centroids, covariance matrix and weights
</span><span class="w">
</span><span class="n">pr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">predict_GMM</span><span class="p">(</span><span class="n">dat</span><span class="p">,</span><span class="w"> </span><span class="n">gmm</span><span class="o">$</span><span class="n">centroids</span><span class="p">,</span><span class="w"> </span><span class="n">gmm</span><span class="o">$</span><span class="n">covariance_matrices</span><span class="p">,</span><span class="w"> </span><span class="n">gmm</span><span class="o">$</span><span class="n">weights</span><span class="p">)</span><span class="w">    

</span>

The GMM function, initially, returns the centroids, the covariance matrix ( where each row of the matrix represents a diagonal covariance matrix), the weights and the log-likelihoods for each gaussian component. Then, the predict_GMM function takes the output of the GMM model and returns the probable clusters.

In addition to the previous mentioned functions, the Optimal_Clusters_GMM can be utilized to estimate the number of clusters of the data using either the AIC (Akaike information) or the BIC (Bayesian information) criterion,

<span class="w">
</span><span class="n">opt_gmm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Optimal_Clusters_GMM</span><span class="p">(</span><span class="n">dat</span><span class="p">,</span><span class="w"> </span><span class="n">max_clusters</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">criterion</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"BIC"</span><span class="p">,</span><span class="w"> 
                               
                               </span><span class="n">dist_mode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"maha_dist"</span><span class="p">,</span><span class="w"> </span><span class="n">seed_mode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"random_subset"</span><span class="p">,</span><span class="w">
                               
                               </span><span class="n">km_iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">em_iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">var_floor</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1e-10</span><span class="p">,</span><span class="w"> 
                               
                               </span><span class="n">plot_data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">)</span><span class="w">

</span>

Alt text

In case of model selection, among a specific number of models, the model with the lowest BIC should be preferred, which is true here for a number of clusters equal to 2.

Assuming that true labels are available, then one could use the external_validation methods (rand_index, adjusted_rand_index, jaccard_index, fowlkes_Mallows_index, mirkin_metric, purity, entropy, nmi (normalized mutual information) and var_info (variation of information) to validate the output clusters,

<span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">external_validation</span><span class="p">(</span><span class="n">dietary_survey_IBS</span><span class="o">$</span><span class="n">class</span><span class="p">,</span><span class="w"> </span><span class="n">pr</span><span class="o">$</span><span class="n">cluster_labels</span><span class="p">,</span><span class="w"> 
                          
                          </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"adjusted_rand_index"</span><span class="p">,</span><span class="w"> </span><span class="n">summary_stats</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">)</span><span class="w">

</span><span class="n">res</span><span class="w">

</span><span class="c1">##  
## ---------------------------------------- 
## purity                         : 1 
## entropy                        : 0 
## normalized mutual information  : 1 
## variation of information       : 0 
## ---------------------------------------- 
## specificity                    : 1 
## sensitivity                    : 1 
## precision                      : 1 
## recall                         : 1 
## F-measure                      : 1 
## ---------------------------------------- 
## accuracy OR rand-index         : 1 
## adjusted-rand-index            : 1 
## jaccard-index                  : 1 
## fowlkes-mallows-index          : 1 
## mirkin-metric                  : 0 
## ----------------------------------------
</span><span class="w">
</span>

and if the summary_stats parameter is set to TRUE then it also returns the specificity, sensitivity, precision, recall and F-measure metrics.

k-means

k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. The most common algorithm uses an iterative refinement technique. Due to its ubiquity, it is often called the k-means algorithm; it is also referred to as Lloyd’s algorithm, particularly in the computer science community.

The ClusterR package provides two different k-means functions, the KMeans_arma, which is an R implementation of the k-means armadillo library and the KMeans_rcpp which uses the RcppArmadillo package. Both functions come to the same output results, however, they return different features which I’ll explain in the next code chunks.

KMeans_arma

The KMeans_arma is faster than the KMeans_rcpp function, however, it initially outputs only the centroids for a specific number of clusters. Furthermore, the number of columns of the data should be larger than the number of clusters, otherwise, it raises an error. The clustering will run faster on multi-core machines when OpenMP is enabled (eg. -fopenmp in GCC). The algorithm is initialized once and 10 iterations are typically sufficient for convergence. The initial centroids are seeded using one of keep_existing, static_subset, random_subset, static_spread and random_spread. If the seed_mode equals to keep_existing then the user should supply a matrix of centroids.

I’ll reduce the dimensions of the dietary_survey_IBS data using PCA and particularly the princomp function of the stats package, so that a 2-dimensional plot of the resulted clusters is possible,

<span class="w">
</span><span class="n">pca_dat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stats</span><span class="o">::</span><span class="n">princomp</span><span class="p">(</span><span class="n">dat</span><span class="p">)</span><span class="o">$</span><span class="n">scores</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">2</span><span class="p">]</span><span class="w">

</span><span class="n">km</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">KMeans_arma</span><span class="p">(</span><span class="n">pca_dat</span><span class="p">,</span><span class="w"> </span><span class="n">clusters</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">n_iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">seed_mode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"random_subset"</span><span class="p">,</span><span class="w"> 
                 
                 </span><span class="n">verbose</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">,</span><span class="w"> </span><span class="n">CENTROIDS</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">)</span><span class="w">

</span><span class="n">pr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">predict_KMeans</span><span class="p">(</span><span class="n">pca_dat</span><span class="p">,</span><span class="w"> </span><span class="n">km</span><span class="p">)</span><span class="w">

</span><span class="n">table</span><span class="p">(</span><span class="n">dietary_survey_IBS</span><span class="o">$</span><span class="n">class</span><span class="p">,</span><span class="w"> </span><span class="n">pr</span><span class="p">)</span><span class="w">

</span><span class="nf">class</span><span class="p">(</span><span class="n">km</span><span class="p">)</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'matrix'</span><span class="w">

</span><span class="n">plot_2d</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pca_dat</span><span class="p">,</span><span class="w"> </span><span class="n">clusters</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.vector</span><span class="p">(</span><span class="n">pr</span><span class="p">),</span><span class="w"> </span><span class="n">centroids_medoids</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">(</span><span class="n">km</span><span class="p">))</span><span class="w">
</span>

Alt text

KMeans_rcpp

As stated before the KMeans_rcpp function offers some additional features in comparison to the KMeans_arma function,

  • It allows for multiple initializations (which can be parallelized if Openmp is available)
  • Besides optimal_init, quantile_init, random and kmeans++ initilizations one can specify the centroids using the CENTROIDS parameter
  • The running time and convergence of the algorithm can be adjusted using the num_init, max_iters and tol parameters
  • If num_init > 1 then KMeans_rcpp returns the attributes of the best initialization using as criterion the within-cluster-sum-of-squared-error
  • The algorithm returns the following attributes: clusters, fuzzy_clusters (if fuzzy = TRUE), centroids, total_SSE, best_initialization, WCSS_per_cluster, obs_per_cluster, between.SS_DIV_total.SS

More details about KMeans_rcpp can be found in the package documentation. I’ll explain the various parameters of the KMeans_rcpp using a vector quantization example and the OpenImageR package,

<span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">OpenImageR</span><span class="p">)</span><span class="w">

</span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'elephant.jpg'</span><span class="w">

</span><span class="n">im</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">readImage</span><span class="p">(</span><span class="n">path</span><span class="p">)</span><span class="w">

</span><span class="c1"># first resize the image to reduce the dimensions
</span><span class="n">im</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">resizeImage</span><span class="p">(</span><span class="n">im</span><span class="p">,</span><span class="w"> </span><span class="m">75</span><span class="p">,</span><span class="w"> </span><span class="m">75</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'bilinear'</span><span class="p">)</span><span class="w">            

</span><span class="n">imageShow</span><span class="p">(</span><span class="n">im</span><span class="p">)</span><span class="w">                                                </span><span class="c1"># plot the original image
</span><span class="w">
</span><span class="n">im2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">apply</span><span class="p">(</span><span class="n">im</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">as.vector</span><span class="p">)</span><span class="w">                                </span><span class="c1"># vectorize RGB
</span>

Alt text

<span class="w">
</span><span class="c1"># perform KMeans_rcpp clustering
</span><span class="w">
</span><span class="n">km_rc</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">KMeans_rcpp</span><span class="p">(</span><span class="n">im2</span><span class="p">,</span><span class="w"> </span><span class="n">clusters</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">num_init</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">max_iters</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w"> 
                    
                    </span><span class="n">initializer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'optimal_init'</span><span class="p">,</span><span class="w"> </span><span class="n">threads</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">verbose</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="w">

</span><span class="n">km_rc</span><span class="o">$</span><span class="n">between.SS_DIV_total.SS</span><span class="w">

</span><span class="c1">## [1] 0.9873009
</span>

The attribute between.SS_DIV_total.SS is equal to (total_SSE – sum(WCSS_per_cluster)) / total_SSE. If there is no pattern of clustering then the between sum of squares will be a very small fraction of the total sum of squares, whereas if the between.SS_DIV_total.SS attribute is close to 1.0 then the observations cluster pretty well.

<span class="n">getcent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">km_rc</span><span class="o">$</span><span class="n">centroids</span><span class="w">

</span><span class="n">getclust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">km_rc</span><span class="o">$</span><span class="n">clusters</span><span class="w">

</span><span class="n">new_im</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">getcent</span><span class="p">[</span><span class="n">getclust</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w">     </span><span class="c1"># each observation is associated with the nearby centroid
</span><span class="w">
</span><span class="nf">dim</span><span class="p">(</span><span class="n">new_im</span><span class="p">)</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">nrow</span><span class="p">(</span><span class="n">im</span><span class="p">),</span><span class="w"> </span><span class="n">ncol</span><span class="p">(</span><span class="n">im</span><span class="p">),</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w">        </span><span class="c1"># back-convertion to a 3-dimensional image
</span><span class="w">
</span><span class="n">imageShow</span><span class="p">(</span><span class="n">new_im</span><span class="p">)</span><span class="w">
</span>

Alt text

As a follow-up one can take advantage of the Optimal_Clusters_KMeans function (which indirectly uses KMeans_rcpp) to estimate the optimal number of clusters. The available criteria are variance_explained, WCSSE (within-cluster-sum-of-squared-error), dissimilarity, silhouette, distortion_fK, AIC, BIC and Adjusted_Rsquared. More information on each criterion can be found in the package documentation.

In the next code chunk I’ll use the distortion_fK criterion, which is fully described in the “Selection of K in K-means clustering, Pham., Dimov., Nguyen., (2004)” paper,

<span class="w">
</span><span class="n">opt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Optimal_Clusters_KMeans</span><span class="p">(</span><span class="n">im2</span><span class="p">,</span><span class="w"> </span><span class="n">max_clusters</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">plot_clusters</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">,</span><span class="w"> 
                              
                              </span><span class="n">verbose</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">,</span><span class="w"> </span><span class="n">criterion</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'distortion_fK'</span><span class="p">,</span><span class="w"> </span><span class="n">fK_threshold</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.85</span><span class="p">)</span><span class="w">

</span>

Alt text

Values below the fixed threshold (here fK_threshold = 0.85) could be recommended for clustering, however there are multiple optimal clusterings and this highlights the fact that f(K) should only be used to suggest a guide value for the number of clusters and the final decision as to which value to adopt has to be left at the discretion of the user.

Mini-batch-kmeans

Mini-batch-kmeans is a variation of the classical k-means algorithm. It is particularly useful for big data sets because rather than using the whole data (as k-means does) it uses mini-batches from random data samples to optimize the objective function.

The parameters of the MiniBatchKmeans algorithm are almost the same as for the KMeans_rcpp function in the ClusterR package. The most important differences are the batch_size (the size of the mini batches) and the init_fraction (the percentage of data to use for the initialized centroids, which applies if the initializer equals to ‘kmeans++’ or ‘quantile_init’).

I’ll take advantage of the vector quantization example to show the differences in computation time and output quality between the KMeans_rcpp and MiniBatchKmeans functions,

<span class="w">
</span><span class="n">path_d</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'dog.jpg'</span><span class="w">

</span><span class="n">im_d</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">readImage</span><span class="p">(</span><span class="n">path_d</span><span class="p">)</span><span class="w">

</span><span class="c1"># first resize the image to reduce the dimensions
</span><span class="n">im_d</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">resizeImage</span><span class="p">(</span><span class="n">im_d</span><span class="p">,</span><span class="w"> </span><span class="m">350</span><span class="p">,</span><span class="w"> </span><span class="m">350</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'bilinear'</span><span class="p">)</span><span class="w">            

</span><span class="n">imageShow</span><span class="p">(</span><span class="n">im_d</span><span class="p">)</span><span class="w">                                                </span><span class="c1"># plot the original image
</span>

Alt text

<span class="n">im3</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">apply</span><span class="p">(</span><span class="n">im_d</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">as.vector</span><span class="p">)</span><span class="w">                                </span><span class="c1"># vectorize RGB
</span><span class="w">
</span><span class="nf">dim</span><span class="p">(</span><span class="n">im3</span><span class="p">)</span><span class="w">                                              </span><span class="c1"># initial dimensions of the data
</span><span class="w">
</span><span class="c1"># 122500      3
</span>

First, we perform a k-means clustering,

<span class="w">
</span><span class="n">start</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Sys.time</span><span class="p">()</span><span class="w">

</span><span class="n">km_init</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">KMeans_rcpp</span><span class="p">(</span><span class="n">im3</span><span class="p">,</span><span class="w"> </span><span class="n">clusters</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">num_init</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">max_iters</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w"> 
                    
                    </span><span class="n">initializer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'kmeans++'</span><span class="p">,</span><span class="w"> </span><span class="n">threads</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">verbose</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="w">

</span><span class="n">end</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Sys.time</span><span class="p">()</span><span class="w">

</span><span class="n">t</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">end</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">start</span><span class="w">
  
</span><span class="n">cat</span><span class="p">(</span><span class="s1">'time to complete :'</span><span class="p">,</span><span class="w"> </span><span class="n">t</span><span class="p">,</span><span class="w"> </span><span class="nf">attributes</span><span class="p">(</span><span class="n">t</span><span class="p">)</span><span class="o">$</span><span class="n">units</span><span class="p">,</span><span class="w"> </span><span class="s1">'\n'</span><span class="p">)</span><span class="w">

</span><span class="c1"># time to complete : 2.44029 secs
</span><span class="w">
</span><span class="n">getcent_init</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">km_init</span><span class="o">$</span><span class="n">centroids</span><span class="w">

</span><span class="n">getclust_init</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">km_init</span><span class="o">$</span><span class="n">clusters</span><span class="w">

</span><span class="n">new_im_init</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">getcent_init</span><span class="p">[</span><span class="n">getclust_init</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w">  </span><span class="c1"># each observation is associated with the nearby centroid
</span><span class="w">
</span><span class="nf">dim</span><span class="p">(</span><span class="n">new_im_init</span><span class="p">)</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">nrow</span><span class="p">(</span><span class="n">im_d</span><span class="p">),</span><span class="w"> </span><span class="n">ncol</span><span class="p">(</span><span class="n">im_d</span><span class="p">),</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w">     </span><span class="c1"># back-convertion to a 3-dimensional image
</span><span class="w">
</span><span class="n">imageShow</span><span class="p">(</span><span class="n">new_im_init</span><span class="p">)</span><span class="w">
</span>

Alt text

and then a mini-batch-kmeans clustering,

<span class="w">
</span><span class="n">start</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Sys.time</span><span class="p">()</span><span class="w">

</span><span class="n">km_mb</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">MiniBatchKmeans</span><span class="p">(</span><span class="n">im3</span><span class="p">,</span><span class="w"> </span><span class="n">clusters</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">batch_size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="n">num_init</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">max_iters</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w"> 
                        
                        </span><span class="n">init_fraction</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.2</span><span class="p">,</span><span class="w"> </span><span class="n">initializer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'kmeans++'</span><span class="p">,</span><span class="w"> </span><span class="n">early_stop_iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w">
                        
                        </span><span class="n">verbose</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="w">

</span><span class="n">pr_mb</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">predict_MBatchKMeans</span><span class="p">(</span><span class="n">im3</span><span class="p">,</span><span class="w"> </span><span class="n">km_mb</span><span class="o">$</span><span class="n">centroids</span><span class="p">)</span><span class="w">

</span><span class="n">end</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Sys.time</span><span class="p">()</span><span class="w">

</span><span class="n">t</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">end</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">start</span><span class="w">
  
</span><span class="n">cat</span><span class="p">(</span><span class="s1">'time to complete :'</span><span class="p">,</span><span class="w"> </span><span class="n">t</span><span class="p">,</span><span class="w"> </span><span class="nf">attributes</span><span class="p">(</span><span class="n">t</span><span class="p">)</span><span class="o">$</span><span class="n">units</span><span class="p">,</span><span class="w"> </span><span class="s1">'\n'</span><span class="p">)</span><span class="w">

</span><span class="c1"># time to complete : 0.8346727 secs 
</span><span class="w">
</span><span class="n">getcent_mb</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">km_mb</span><span class="o">$</span><span class="n">centroids</span><span class="w">

</span><span class="n">new_im_mb</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">getcent_mb</span><span class="p">[</span><span class="n">pr_mb</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w">   </span><span class="c1"># each observation is associated with the nearby centroid
</span><span class="w">
</span><span class="nf">dim</span><span class="p">(</span><span class="n">new_im_mb</span><span class="p">)</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">nrow</span><span class="p">(</span><span class="n">im_d</span><span class="p">),</span><span class="w"> </span><span class="n">ncol</span><span class="p">(</span><span class="n">im_d</span><span class="p">),</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w">     </span><span class="c1"># back-convertion to a 3-dimensional image
</span><span class="w">
</span><span class="n">imageShow</span><span class="p">(</span><span class="n">new_im_mb</span><span class="p">)</span><span class="w">
</span>

Alt text

For a slight difference in the output quality, the mini-batch-kmeans returns the output in average more than twice as fast as the classical k-means for this data set.

In order to implement the mini-batch-kmeans I consulted the following web-resources:

K-Medoids

The k-medoids algorithm (Kaufman, L., Rousseeuw, P., 1987) is a clustering algorithm related to the k-means algorithm and the medoid shift algorithm. Both the k-means and k-medoids algorithms are partitional and both attempt to minimize the distance between points labeled to be in a cluster and a point designated as the center of that cluster. In contrast to the k-means algorithm, k-medoids chooses data points as centers (medoids or exemplars) and works with an arbitrary metrics of distances between data points. A useful tool for determining k is the silhouette width. K-medoids is more robust to noise and outliers in comparison to k-means, because it minimizes a sum of pairwise dissimilarities instead of the sum of squared Euclidean distances. A medoid can be defined as the object of a cluster whose average dissimilarity to all the objects in the cluster is minimal, i.e. it is a most centrally located point in the cluster.

The most common realization of the k-medoid clustering is the Partitioning Around Medoids (PAM) algorithm. PAM proceeds in two phases: BUILD and SWAP. In the BUILD phase, the algorithm searches for a good set of initial medoids and in the SWAP phase all possible swaps between the BUILD-medoids and the observations take place so that there is no further decrease of the objective (Clustering in an Object-Oriented Environment, A.Struyf, M. Hubert, P. Rousseeuw., 1997).

In the ClusterR package, the Cluster_Medoids and Clara_Medoids functions correspond to PAM (partitioning around medoids) and CLARA (clustering large applications) algorithms.

In the following code chunk, I’ll make use of the mushroom data to illustrate how k-medoids work with a distance metric other than the euclidean. The mushroom data consist of 23 categorical attributes (including the class) and 8124 instances. More information about the data can be found in the package documentation.

Cluster_Medoids

The Cluster_Medoids function can also take – besides a matrix or data frame – a dissimilarity matrix as input. In the case of the mushroom data, where all the features are categorical (with two or more unique values) it would be meaningful to use the gower distance. The gower distance applies a different function to each predictor depending on its type (numeric, ordered, factor). This dissimilarity measure is implemented in many R packages, among others in the cluster package (daisy function) and in the FD package (gowdis function). I’ll take advantage of the gowdis function of the FD package as it also allows user-defined weights for each separate predictor,

<span class="w">
</span><span class="n">data</span><span class="p">(</span><span class="n">mushroom</span><span class="p">)</span><span class="w">

</span><span class="n">X</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mushroom</span><span class="p">[,</span><span class="w"> </span><span class="m">-1</span><span class="p">]</span><span class="w">

</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">mushroom</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">])</span><span class="w">            </span><span class="c1"># convert the labels to numeric
</span><span class="w">
</span><span class="n">gwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">FD</span><span class="o">::</span><span class="n">gowdis</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="w">           </span><span class="c1"># calculate the 'gower' distance for the factor variables
</span><span class="w">
</span><span class="n">gwd_mat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">(</span><span class="n">gwd</span><span class="p">)</span><span class="w">                 </span><span class="c1"># convert the distances to a matrix
</span><span class="w">
</span><span class="n">cm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Cluster_Medoids</span><span class="p">(</span><span class="n">gwd_mat</span><span class="p">,</span><span class="w"> </span><span class="n">clusters</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">swap_phase</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">verbose</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="w">

</span>

adusted_rand_index avg_silhouette_width
0.5733587 0.2545221

As mentioned before the gowdis function of the FD package allows the user to give different weights to each separate variable. The weights parameter can be tuned, for example by using random search, in order to achieve better clustering results. For instance, by using the following weights for each separate variable one can improve both the adjusted-rand-index (external validation) and the average silhouette width (internal validation),

predictors weights
cap_shape 4.626
cap_surface 38.323
cap_color 55.899
bruises 34.028
odor 169.608
gill_attachment 6.643
gill_spacing 42.08
gill_size 57.366
gill_color 37.938
stalk_shape 33.081
stalk_root 65.105
stalk_surface_above_ring 18.718
stalk_surface_below_ring 76.165
stalk_color_above_ring 27.596
stalk_color_below_ring 26.238
veil_type 0.0
veil_color 1.507
ring_number 37.314
ring_type 32.685
spore_print_color 127.87
population 64.019
habitat 44.519
<span class="w">
</span><span class="n">weights</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">4.626</span><span class="p">,</span><span class="w"> </span><span class="m">38.323</span><span class="p">,</span><span class="w"> </span><span class="m">55.899</span><span class="p">,</span><span class="w"> </span><span class="m">34.028</span><span class="p">,</span><span class="w"> </span><span class="m">169.608</span><span class="p">,</span><span class="w"> </span><span class="m">6.643</span><span class="p">,</span><span class="w"> </span><span class="m">42.08</span><span class="p">,</span><span class="w"> </span><span class="m">57.366</span><span class="p">,</span><span class="w"> </span><span class="m">37.938</span><span class="p">,</span><span class="w"> 
            
            </span><span class="m">33.081</span><span class="p">,</span><span class="w"> </span><span class="m">65.105</span><span class="p">,</span><span class="w"> </span><span class="m">18.718</span><span class="p">,</span><span class="w"> </span><span class="m">76.165</span><span class="p">,</span><span class="w"> </span><span class="m">27.596</span><span class="p">,</span><span class="w"> </span><span class="m">26.238</span><span class="p">,</span><span class="w"> </span><span class="m">0.0</span><span class="p">,</span><span class="w"> </span><span class="m">1.507</span><span class="p">,</span><span class="w"> </span><span class="m">37.314</span><span class="p">,</span><span class="w"> 
            
            </span><span class="m">32.685</span><span class="p">,</span><span class="w"> </span><span class="m">127.87</span><span class="p">,</span><span class="w"> </span><span class="m">64.019</span><span class="p">,</span><span class="w"> </span><span class="m">44.519</span><span class="p">)</span><span class="w">

</span><span class="n">gwd_w</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">FD</span><span class="o">::</span><span class="n">gowdis</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="w"> </span><span class="n">w</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">weights</span><span class="p">)</span><span class="w">       </span><span class="c1"># 'gower' distance using weights
</span><span class="w">
</span><span class="n">gwd_mat_w</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">(</span><span class="n">gwd_w</span><span class="p">)</span><span class="w">                 </span><span class="c1"># convert the distances to a matrix
</span><span class="w">
</span><span class="n">cm_w</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Cluster_Medoids</span><span class="p">(</span><span class="n">gwd_mat_w</span><span class="p">,</span><span class="w"> </span><span class="n">clusters</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">swap_phase</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">verbose</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="w">

</span>

adusted_rand_index avg_silhouette_width
0.6197672 0.3000048

Clara_Medoids

CLARA (CLustering LARge Applications) is an obvious way to cluster larger datasets. Instead of finding medoids for the entire data set – it would be also infeasible to calculate the dissimilarity matrix – CLARA draws a small sample from the data and applies the PAM algorithm to generate an optimal set of medoids for the sample. The quality of the resulting medoids is measured by the average dissimilarity between every object in the entire data set and the medoid of its cluster.

The Clara_Medoids function in the ClusterR package follows the same logic by applying the Cluster_Medoids function to each selected sample. The Clara_Medoids takes two additional parameters, the samples, and the sample_size. The first one indicates the number of samples to draw from the data set, while the second one the fraction of the data to draw in each sample iteration (a float number between 0.0 and 1.0). I have to point out that the Clara_Medoids function does not take a dissimilarity matrix as input, as the Cluster_Medoids function does.

I’ll apply the Clara_Medoids function to the previously used mushroom data set by using the hamming distance as a dissimilarity metric and I’ll compare the computation time and output with the results of the Cluster_Medoids function. The hamming distance is appropriate for the mushroom data as it’s applicable to discrete variables and it’s defined as the number of attributes that take different values for two compared instances (Data Mining Algorithms: Explained using R, Pawel Cichosz, 2015, page 318).

<span class="w">
</span><span class="n">cl_X</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">X</span><span class="w">        </span><span class="c1"># copy initial data 
</span><span class="w">
</span><span class="c1"># the Clara_Medoids function allows only numeric attributes
# so first convert to numeric
</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">ncol</span><span class="p">(</span><span class="n">cl_X</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">cl_X</span><span class="p">[,</span><span class="w"> </span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">cl_X</span><span class="p">[,</span><span class="w"> </span><span class="n">i</span><span class="p">])</span><span class="w"> </span><span class="p">}</span><span class="w">

</span><span class="n">start</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Sys.time</span><span class="p">()</span><span class="w">

</span><span class="n">cl_f</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Clara_Medoids</span><span class="p">(</span><span class="n">cl_X</span><span class="p">,</span><span class="w"> </span><span class="n">clusters</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">distance_metric</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'hamming'</span><span class="p">,</span><span class="w"> </span><span class="n">samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> 
                     
                     </span><span class="n">sample_size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.2</span><span class="p">,</span><span class="w"> </span><span class="n">swap_phase</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">verbose</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">,</span><span class="w"> </span><span class="n">threads</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">

</span><span class="n">end</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Sys.time</span><span class="p">()</span><span class="w">

</span><span class="n">t</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">end</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">start</span><span class="w">
  
</span><span class="n">cat</span><span class="p">(</span><span class="s1">'time to complete :'</span><span class="p">,</span><span class="w"> </span><span class="n">t</span><span class="p">,</span><span class="w"> </span><span class="nf">attributes</span><span class="p">(</span><span class="n">t</span><span class="p">)</span><span class="o">$</span><span class="n">units</span><span class="p">,</span><span class="w"> </span><span class="s1">'\n'</span><span class="p">)</span><span class="w">

</span><span class="c1"># time to complete : 3.104652 secs 
</span><span class="w">
</span>

adusted_rand_index avg_silhouette_width
0.5944456 0.2678507

<span class="w">
</span><span class="n">start</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Sys.time</span><span class="p">()</span><span class="w">

</span><span class="n">cl_e</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Cluster_Medoids</span><span class="p">(</span><span class="n">cl_X</span><span class="p">,</span><span class="w"> </span><span class="n">clusters</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">distance_metric</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'hamming'</span><span class="p">,</span><span class="w"> </span><span class="n">swap_phase</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> 
                       
                       </span><span class="n">verbose</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">,</span><span class="w"> </span><span class="n">threads</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">

</span><span class="n">end</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Sys.time</span><span class="p">()</span><span class="w">

</span><span class="n">t</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">end</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">start</span><span class="w">
  
</span><span class="n">cat</span><span class="p">(</span><span class="s1">'time to complete :'</span><span class="p">,</span><span class="w"> </span><span class="n">t</span><span class="p">,</span><span class="w"> </span><span class="nf">attributes</span><span class="p">(</span><span class="n">t</span><span class="p">)</span><span class="o">$</span><span class="n">units</span><span class="p">,</span><span class="w"> </span><span class="s1">'\n'</span><span class="p">)</span><span class="w">

</span><span class="c1"># time to complete : 14.93423 secs 
</span>

adusted_rand_index avg_silhouette_width
0.5733587 0.2545221

Using the hamming distance, both the Clara_Medoids and the Cluster_Medoids functions return approximately the same result (comparable also with the gower distance results), only that the Clara_Medoids function outputs more than four times faster than the Cluster_Medoids for this data set.

By using the object results of the last two code chunks one can also plot the silhouette widths using the Silhouette_Dissimilarity_Plot function. Worth mentioning here is that the dissimilarities and silhouette widths of the Clara_Medoids function are based on the best-selected sample and not on the entire data set, as is the case for the Cluster_Medoids function.

<span class="w">
</span><span class="c1"># Silhouette Plot for the "Clara_Medoids" object
</span><span class="w">
</span><span class="n">Silhouette_Dissimilarity_Plot</span><span class="p">(</span><span class="n">cl_f</span><span class="p">,</span><span class="w"> </span><span class="n">silhouette</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">

</span>

Alt text

<span class="w">
</span><span class="c1"># Silhouette Plot for the "Cluster_Medoids" object
</span><span class="w">
</span><span class="n">Silhouette_Dissimilarity_Plot</span><span class="p">(</span><span class="n">cl_e</span><span class="p">,</span><span class="w"> </span><span class="n">silhouette</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">

</span>

Alt text

References for the K-Medoids functions

It took me a while to figure out how to implement the K-medoids (Cluster_Medoids and Clara_Medoids) functions, as I initially thought that the start medoids are initiallized in the same way the centroids are initiallized in the k-means algorithm. Due to the fact that I didn’t have access to the book of Kaufman and Rousseeuw (but only to the article: Anja Struyf, Mia Hubert, Peter J. Rousseeuw, (Feb. 1997), Clustering in an Object-Oriented Environment, Journal of Statistical Software, Vol 1, Issue 4) the cluster package was a big help as it has a very detailed package documentation. It includes both the partioning around medoids (pam) and the Clustering LARge Applications (clara) algorithms and in that way one can compare the output results.

In the following code chunk I’ll benchmark the pam function of the cluster package and the Cluster_Medoids function of the ClusterR package, based on the previously used vector quantization example, and I’ll compare the output medoids,

<span class="w">
</span><span class="c1">#====================================
# pam function of the cluster package
#====================================
</span><span class="w">
</span><span class="n">start_pm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Sys.time</span><span class="p">()</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">pm_vq</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cluster</span><span class="o">::</span><span class="n">pam</span><span class="p">(</span><span class="n">im2</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">metric</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'euclidean'</span><span class="p">,</span><span class="w"> </span><span class="n">do.swap</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">end_pm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Sys.time</span><span class="p">()</span><span class="w">

</span><span class="n">t_pm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">end_pm</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">start_pm</span><span class="w">
</span><span class="n">cat</span><span class="p">(</span><span class="s1">'time to complete :'</span><span class="p">,</span><span class="w"> </span><span class="n">t_pm</span><span class="p">,</span><span class="w"> </span><span class="nf">attributes</span><span class="p">(</span><span class="n">t_pm</span><span class="p">)</span><span class="o">$</span><span class="n">units</span><span class="p">,</span><span class="w"> </span><span class="s1">'\n'</span><span class="p">)</span><span class="w">

</span><span class="c1"># time to complete : 12.05006 secs 
</span><span class="w">

</span><span class="n">pm_vq</span><span class="o">$</span><span class="n">medoids</span><span class="w">

</span><span class="c1">#           [,1]      [,2]      [,3]
# [1,] 1.0000000 1.0000000 1.0000000
# [2,] 0.6325856 0.6210824 0.5758536
# [3,] 0.4480000 0.4467451 0.4191373
# [4,] 0.2884769 0.2884769 0.2806337
# [5,] 0.1058824 0.1058824 0.1058824
</span><span class="w">

</span><span class="c1">#===========================================================================================================
# Cluster_Medoids function of the ClusterR package using 6 threads (parallelization if OpenMP is available)
#===========================================================================================================
</span><span class="w">
</span><span class="n">start_vq</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Sys.time</span><span class="p">()</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">cl_vq</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Cluster_Medoids</span><span class="p">(</span><span class="n">im2</span><span class="p">,</span><span class="w"> </span><span class="n">clusters</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">distance_metric</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'euclidean'</span><span class="p">,</span><span class="w"> 

                        </span><span class="n">swap_phase</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">verbose</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">,</span><span class="w"> </span><span class="n">threads</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">6</span><span class="p">)</span><span class="w">
</span><span class="n">end_vq</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Sys.time</span><span class="p">()</span><span class="w">

</span><span class="n">t_vq</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">end_vq</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">start_vq</span><span class="w">
</span><span class="n">cat</span><span class="p">(</span><span class="s1">'time to complete :'</span><span class="p">,</span><span class="w"> </span><span class="n">t_vq</span><span class="p">,</span><span class="w"> </span><span class="nf">attributes</span><span class="p">(</span><span class="n">t_vq</span><span class="p">)</span><span class="o">$</span><span class="n">units</span><span class="p">,</span><span class="w"> </span><span class="s1">'\n'</span><span class="p">)</span><span class="w">

</span><span class="c1&...

To leave a comment for the author, please follow the link and comment on their blog: mlampros.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)