Reasonable Inheritance of Cluster Identities in Repetitive Clustering

[This article was first published on joy of data » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

… or Inferring Identity from Observations

cluster-identityLet’s assume the following application:

A conservation organisation starts a project to geographically catalogue the remaining representatives of an endangered plant species. For that purpose hikers are encouraged to communicate the location of the plant if they encounter it. Due to those hikers using GPS technology ranging from cheap smartphones to highend GPS devices and weather as well as environmental circumstances the measurements are of varying accuracy. The goal of the conservation organisation is to build up a map locating all found plants with an ID assigned to them. Now every time a new location measurement is entered into the system a clustering is applied to identify related measurements – i.e. belonging to the same plant.

“I am he as you are he as you are me …

(… And we are all together” – I am the Walrus / Beatles) So far so good – but where it gets a bit tricky is when it comes to decide how to deal with IDs of clusters / plants when a newly introduced location estimate good-point-bad-pointnot just humbly joins an established cluster but causes trouble by messing up previously identified clusters / plants. Take the picture to the right. So far we had two plants with separate IDs – in the good case they stay separate and the new one is assigned to the red cluster. In the bad case the new one causes red and blue to merge and poses the question whether the new cluster is red or blue or something new itself. Here we are dealing with a clear draw and very few points and clusters – but it easy to come up with more ambiguous cases like f.x. the one described above. To make reasonable decisions for those cases well-chosen – and if possible mathematically at least plausibilzed – heuristcs are needed.

“Who Cares?”

Fair question as one might argue that an ID only serves the purpose of differentiating and there is no need for maintaining a family tree of clusters. Also in above use case this argument is not easily denied. But a stable inheritance of IDs might simplify understanding dynamics of how clustering takes place – a large number of representatives might render a cluster and its represented entity “important” and it would be weird if you have no stable way to refer to it. And some other possible motivations come to my mind. Maybe the organisation will send to selected plants researchers to perform an examination on them and henceforth intends to refer to those ones specifically.

 “Take arms!”

# calculates the contingency table described below
cross <- function(c0, cx) {
  uc0 <- unique(c0[c0 != "?"])
  ucx <- unique(cx)
  
  cross <- matrix(0, 
      ncol=length(ucx), 
      nrow=length(uc0), 
      dimnames=list(uc0, ucx)
    )
  
  for(id_c0 in uc0) {
    for(id_cx in ucx) {
      cross[id_c0, id_cx] <- length(intersect(
          which(c0 == id_c0),
          which(cx == id_cx)
        ))
    }
  }
  
  return(cross)
}

# helper function: "A B" -> c("A","B")
sv <- function(str) {
  strsplit(str," +")[[1]]
}

set-theoretic-contingency-tableSo how might we approach this almost philosophical problem? I guess what is needed first is a handy way to represent the relations. And for that purpose something one might be inclined to refer to as a “set theoretic contingency table” might make sense. Rows represent so far identified clusters, columns represent the result of the performed clustering and the values are the number of elements the respective clusters have in common. Take the illustration on the right hand side for an example - the new clustering leading to a temporary cluster with ID 2 has 1 element in common with cluster C. Now to choose A for clustering set 3 is an obvious choice but choosing B for 2 and C for 1 is not so evident but probably an obvious choice for a human being.

\text{E(x), E(i) are elements containtd in cluster x and clustered set i.}
\newline
(\text{Contingency Table})_{(x,i)} = \#(E(x) \cap E(i))

> c0 <-  sv("A A B B C C C ?")
> cx  <- sv("3 3 2 2 2 1 1 2")
> 
> cross(c0,cx)
  3 2 1
A 2 0 0
B 0 2 0
C 0 1 2

Choosing a Label for a Mixed Set

2vsnContinuing with above example. Clustered set 2 contains elements of type B and C. In this case one might say: “The choice of B is most reasonable as there are two Bs, one A and one unsettled element”. Fair enough – but what if we face a draw? Or if we would have two Bs and five more elements of different types, like C,D,E,F,G? Might seem odd but in a contradictionspace of high dimensionality this is, I guess, a possibility.

Or take the situation illustrated to the right. For set 1 the label is a clear choice. But with above democratic labeling heuristic we would have to choose the same label for 2 and this would lead to a conflict. :/

A Conservative Approach to Restore Peace

To make a long story short a possible way to go might be to take a very conservative stance and expect from a cluster to properly tend its flock if it would like to keep its label. Id est, a cluster looses an element or gains one, then its new label is chosen randomly. This can be told by checking the contingency table – the condition is met if one and only one field in a row is non-zero and the corresponding column is as well non-zero exclusively for that field.

# determines unambiguous cluster labeling cases
labeling <- function(cross) {
  labels <- c()
  
  for(id_cx in colnames(cross)) {
    if(sum(cross[,id_cx]) == max(cross[,id_cx])){
      
      id_c0 <- which.max(cross[,id_cx])
      
      if(sum(cross[id_c0,]) == max(cross[id_c0,])) {
        labels[id_cx] <- names(id_c0)
      } else {
        labels[id_cx] <- "+"
      }
    } else {
      labels[id_cx] <- "+"
    }
  }
  
  return(labels)
}

And now in action:

> c0 <-  sv("A A B B C C C D D ?")
> cx  <- sv("3 3 2 2 1 1 1 1 4 2")
> 
> x <- cross(c0,cx)
> x
  3 2 1 4
A 2 0 0 0
B 0 2 0 0
C 0 0 3 0
D 0 0 1 1
> 
> labeling(x)
  3   2   1   4 
"A" "B" "+" "+"

 Much Ado about Something

Congratulations for making it to this point – you are now part of a small distinguished circle! Write me a mail and I will organize for you a session so you will receive the fierce looking joyofdata-tattoo on your forehead which will grant you bargains in bio supermarkets all over the world and will facilitate meeting people at night clubs. Okay, seriously, I’d be interested in input!


(original article published on www.joyofdata.de)

To leave a comment for the author, please follow the link and comment on their blog: joy of data » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)