066: Distance-Based Constrained Clustering

Proposed by Thi-Bich-Hanh Dao

Cluster Analysis is a Data Mining task that aims at partitioning a given set of objects into clusters, such that the objects inside the same cluster are similar, while being different from the objects belonging to other clusters. We consider a dataset of objects and a dissimilarity measure between any two objects. The homogeneity of the cluster is usually expressed by an optimization criterion, which can be among other:

Maximizing the minimal split between clusters, the minimal split between clusters is the smallest dissimilarity between two objects of different clusters;
Minimizing the maximal diameter of clusters, the maximal diameter is the largest dissimilarity between two objects in the same cluster;
Minimizing the within-cluster sum of dissimilarities;
Minimizing the within-cluster sum of squares: in a Euclidean space the within-cluster sum of squares is the sum of the squared Euclidean distances between each object and the centroid of the cluster containing the object.
etc.

User previous knowledge can be integrated to clustering, which leads to Constrained Clustering. User constraints can be instance-level constraints or cluster-level constraints. Instance-level constraints are must-link or cannot-link constraints, which state that two objects must be or cannot be in the same cluster. Cluster-level constraints state requirements on the size, the diameter, the density, etc. of the clusters. All of the criteria except the split one are NP-Hard. The split criterion which is polynomial becomes NP-Hard with user constraints.