Cluster points so that within each cluster holds a certain maximum distance between points

Question

Currently I'm struggling with a (for me) new field, namely clustering. I would really appreciate any help I could get!

The starting situation is that a data set $(x_k)_{k\in\{1,\dots,n\}} \subseteq \mathbb{R}^N$ is given. The task is to partition this set into clusters $C_1,\dots, C_m$ (where $m$ is not preset) so that with a given $c \in \mathbb{R_{>0}}$ holds $$ \forall i \in \{1,\dots,m\} \ \forall x,y \in C_i \colon \ \Vert x-y \Vert \leq c \\ \forall i,j \in \{1,\dots,m\} \ \forall x \in C_i \ \forall y \in C_j \colon \ i \neq j \ \Longrightarrow \ \Vert x - y \Vert > c $$ and so that $m$ is minimal. In other words, what I'm looking for is: How can I divide the initial data set into as few clusters as possible so that the elements within each cluster have at most distance $c$ and so that two elements of distinct clusters have at least distance $c$? (One maybe could also ask this question where distances are replaced by similarities.)

Does anybody know some keywords I could look for? It would be great if there already was an algorithm or easy implementation for that. I'm also happy if somebody knows something that solves a problem which is close to mine.

Also, is there a method which would allow to replace the "$\Vert x-y\Vert$" by an arbitrary distance measure $d(x,y)$ and which would only rely on the distances between already given points and no others? By that I mean that some of my ideas would use custom distances (or similarities) where it would be too expensive to calculate the distance for new points (like for example for the mean of some of the given points to another point).

Regards Murp

score 2 · Accepted Answer · answered Aug 12 '14 at 18:39

It seems that you are interested in partitional clustering; I would start having a look at the kmeans partitional clustering algorithm. The distance used by the algorithm is Euclidean and the number of clusters $m$ is an input given at the very beignning by the user. Typically one runs kmeans different times, with different numbers of clusters and assesses the quality of the partition with ad hoc methods. If you want to implement more general distances on your discrete dataset, then I would suggest to have a look at PAM (Partition Around Medoids): the R-implementation of PAM allows the user to give as input dissimilarity matrices produced with any user defined distance.

Density based clustering algos are interesting too: dbscan is a rather famous one.

On software: I believe R has all you need to explore the possibilities of partitional / hierarchical clustering. Additional packages can provide you with more sofisticated and recent algorithms.

For a related topic with an example, please have a look at my answer in here: http://math.stackexchange.com/questions/887314/finding-similarity-between-elements-using-statistics/887357#887357 — Avitus, Aug 12 '14 at 18:42
Thank you very much for your suggestions! Especially the PAM was useful. I implemented it in MATLAB using http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/Partitioning_Around_Medoids_%28PAM%29 and it seems to work just fine. It's not exactly what I was looking for but at least it doesn't make me calculate extra distances. Also it's still pretty close to what I wanted. — Murp, Aug 13 '14 at 16:59

Cluster points so that within each cluster holds a certain maximum distance between points

1 Answers1