How to Report the Distribution of Attributes per Cluster

Posted on January 15, 2021 by George Pipis in R bloggers | 0 Comments

[This article was first published on R – Predictive Hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Let’s say that you have applied your Clustering algorithm and you would like to report the distribution of the categorical variables per cluster in a “tidy” report. Below you can see a suggestion of how you can do it in R.

Generate the Data

Let’s assume that we came up with 3 clusters such as “C1, C2 and C3” and that we have 3 attributes such as:

Gender: “M”, “F”
Type: “A”, “B”, “C”, “D”
Category: “High”, “Medium”, “Low”

library(tidyverse)

set.seed(5)

df1<-tibble(ID=seq_len(500))%>%
     mutate(Cluster = "C1",
            Gender=sample(c("M", "F"), n(), replace=TRUE, prob=c(0.6, 0.4)),
            Type=sample(c("A", "B", "C", "D"), n(), replace=TRUE, prob=c(0.20, 0.3, 0.4, 0.1)),
            Category=sample(c("High", "Medium", "Low"), n(), replace=TRUE, prob=c(0.1, 0.6, 0.3)))

df2<-tibble(ID=seq_len(300))%>%
  mutate(Cluster = "C2",
         Gender=sample(c("M", "F"), n(), replace=TRUE, prob=c(0.4, 0.6)),
         Type=sample(c("A", "B", "C", "D"), n(), replace=TRUE, prob=c(0.40, 0.1, 0.2, 0.3)),
         Category=sample(c("High", "Medium", "Low"), n(), replace=TRUE, prob=c(0.7, 0.2, 0.1)))

df3<-tibble(ID=seq_len(200))%>%
  mutate(Cluster = "C3",
         Gender=sample(c("M", "F"), n(), replace=TRUE, prob=c(0.2, 0.8)),
         Type=sample(c("A", "B", "C", "D"), n(), replace=TRUE, prob=c(0.5, 0.3, 0.1, 0.1)),
         Category=sample(c("High", "Medium", "Low"), n(), replace=TRUE, prob=c(0.1, 0.2, 0.7)))

df<-rbind.data.frame(df1, df2, df3)

df
 

# A tibble: 1,000 x 5
      ID Cluster Gender Type  Category
   <int> <chr>   <chr>  <chr> <chr>   
 1     1 C1      M      C     Medium  
 2     2 C1      F      C     Medium  
 3     3 C1      F      C     Medium  
 4     4 C1      M      B     Low     
 5     5 C1      M      B     Low     
 6     6 C1      F      C     Medium  
 7     7 C1      M      C     Medium  
 8     8 C1      F      B     High    
 9     9 C1      F      C     Medium  
10    10 C1      M      A     Medium  
# ... with 990 more rows

Report the Distribution of Attributes



attributes <- names(df[3:dim(df)[2]])


output<-NULL

for (a in attributes) {
  
  tmp<-df%>%group_by_(a, "Cluster")%>% summarise(n = n())%>%
    group_by(Cluster)%>%mutate(Prop=n/(sum(n)))%>%
    ungroup()%>%select(-n)%>%
    spread(Cluster, Prop)%>%mutate(Attribute = a)%>%select(Attribute, everything())
  colnames(tmp)[1:2]<-c("attribute", "values")
  
  output<-rbind(output, tmp)
  
}

output
 

# A tibble: 9 x 5
  attribute values    C1    C2    C3
  <chr>     <chr>  <dbl> <dbl> <dbl>
1 Gender    F      0.398 0.593 0.78 
2 Gender    M      0.602 0.407 0.22 
3 Type      A      0.188 0.413 0.425
4 Type      B      0.318 0.1   0.365
5 Type      C      0.39  0.193 0.105
6 Type      D      0.104 0.293 0.105
7 Category  High   0.114 0.683 0.065
8 Category  Low    0.312 0.103 0.75 
9 Category  Medium 0.574 0.213 0.185

To leave a comment for the author, please follow the link and comment on their blog: R – Predictive Hacks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

How to Report the Distribution of Attributes per Cluster

Generate the Data

Report the Distribution of Attributes

Related

Generate the Data

Report the Distribution of Attributes

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)