0

I have a symmetric matrix of non-Euclidean distances of size $N$ (say, 500) and I would like to select one cluster of a fixed size $K$ (say, 25), so that it has the smallest average distance within this cluster. What is a good algorithm for doing that given combinatorial complexity of the problem?

Currently I have implemented the following algorithm, which is not perfect in finding the optimum:

  1. Take $K$ points at random, form the cluster
  2. Find $K$ points with smallest average distance to the points in the cluster at step 1). Call these $K$ points the new cluster
  3. Repeat 1) and 2) until selected $K$ points are the same in both steps or until the new cluster has the larger average distance than the old cluster.

1 Answers1

0

Seems like you're re-inventing the k-means algorithm that has been here for a while (Lloyd's algorithm from 1957). Although the problem normally minimizes Euclidean distances, it shouldn't be a problem as long as your distance function is a metric.

k-means++ due to careful seeding provides more stable results in less iterations (original paper).

Pang
  • 399
  • 5
  • 8
Tombart
  • 116
  • Thank you for the answer! It applies well to the problem with one exception. What I aim to find is just one optimal cluster with minimum distance, whereas k-means and k-means++ would minimize the sum of distances over all clusters. Are you aware of a modification for k-means or separate algorithms in which we just aim to find one cluster with minimum within distance? – Andrey Zubanov Apr 19 '19 at 20:58