Fast Estimates for Clustering Similarity
The literature for clustering similarity measures is rich, with many different measures proposed in the last fifty years, starting with the well-known Rand index (1971). Most similarity measures can be classified as either pair-counting or information-theoretic. The former type focus in counting how much two clusterings (that is, two partitions of a support set) A and B, which we want to compare, agree or disagree when considering pairs of elements of the support set, whereas the second type of measures quantify the amount of mutual information between the two clusterings. We often have adjusted measures to discount for the amount of similarity between two clusterings due to pure chance.
In this talk, I will present my join work with Hugo Sanz (UPC) on how to estimate the similarity of two clusterings using small random samples. This work is a follow-up of my previous work with Alfredo Viola and Jun Wang on similarity estimates for sets. Our proposal for clustering similarity does apply for a general and large class of pair-counting measures, including the most popular, like the Rand index, the adjusted Rand index, and the correlation coefficient. Some provisions must be made in order to avoid the bias, since the simplest ideas, namely, working with the clusterings induced by a single random sample of the support set, do not work. We show that our estimates are unbiased, and we also quantify the accuracy (standard error) of the estimates. I will also describe our ongoing work on information-theoretic measures.