7

Let's say you have a set of numbers S and you want to obtain subsets of $S$, $s_1,s_2,\ldots, s_i$, where $i < N$.

Is there a particular operation that will group the numbers that are "close to each other"?

I will give an example to clarify the question:

Let's say you have a set $S$ which contains $\{1.4, 2, 2.5, 2.7, 14, 16, 49, 57, 58\}$

And let's say that you can have maximum of $N=4$ subsets.

So, you might end up with a result that looks like this:

$s_1 = \{1.4, 2, 2.5, 2.7\}$
$s_2 = \{14, 16\}$
$s_3 = \{49\}$
$s_4 = \{57, 58\}$

I am looking for either the name what such a problem would be called (which I can then use to research and write an algorithm). If you can provide a simple solution, even better.

Thank you for any help with this.

Stahl
  • 23,212
chaimp
  • 173
  • set-theory is not the tag. The algorithm will depend on what you mean by "close". – Aryabhata Oct 05 '10 at 14:59
  • I agree that it depends on what I mean by "close". I am looking for an algorithm that at least generally does what I'm asking so I have a starting point. Right now, I don't even know what to call this. Do you have any suggestions? – chaimp Oct 05 '10 at 15:01
  • You are just looking for the $N$ maximal values of $x_{i+1}-x_i$ after sorting your set $S = {x_1, \dots, x_n}$. I doubt this has some name. – j.p. Oct 05 '10 at 15:04
  • 3
    Sounds like "clustering"... – J. M. ain't a mathematician Oct 05 '10 at 15:05
  • The differences belonging to your example are 0.6, 0.5, 0.2, 11.3, 2, 33, 8, 1. The 4 biggest values are 33 (= 49-16), 11.3 (= 14-2.7) and 8 (= 57-49) giving your result. – j.p. Oct 05 '10 at 15:12
  • Thank J.M. for the name "clustering". Makes perfect sense and sometimes giving something a name can make all the difference. – chaimp Oct 05 '10 at 15:15
  • 1
    And big thank you to jug for providing the answer. That is a very elegant solution. It does not require any recursion or special analysis. I think that's exactly what I'm looking for. – chaimp Oct 05 '10 at 15:17
  • 3
    Jug's answer makes sense, but it's not "the" answer. In some applications it would not be optimal. For instance, let S have 100 values in the interval [0,1], 100 values in [2,3], 100 values in [4,5], 100 values in [6,7], and one value of 10. If the objective is to minimize the sums of squares to the centers of the clusters, 10 must be clustered with the values in [6,7] even though the gap is largest. It merely turns out in the particular example of the OP, the clustering is so strong that any remotely reasonable algorithm will work. – whuber Oct 05 '10 at 15:57
  • 1
    @whuber: I agree. That's why I posted it just as a comment. It's maybe "the" most trivial approach (which might be anyway enough for jeffp). @jeffp: Even if you won't implement the methods in the answer of whuber, I'd recommend to accept his answer (or whatever better answer will be posted), as it provides a better solution in general. – j.p. Oct 05 '10 at 16:13
  • Thanks - I have implemented jug's comment and it is working for some cases, but as whuber pointed it there are cases where it does not do what I would consider to be ideal. Either way it is a great starting point. I will probably end up going with what whuber suggested for my final implementation. Thanks everybody! – chaimp Oct 06 '10 at 02:06

2 Answers2

10

There are many solutions to this problem extant, each rediscovered many times. Search "statistical clustering methods" or "cluster analysis." A particularly nice one for your problem is called "K-means". This was rediscovered by cartographers (!) where it is called "Jenks' method" (and is worth mentioning because it's used by some popular mapping software, so perhaps more people have heard of this than of any other method). The idea behind K-means is to minimize the population-weighted sum of variances of the subsets.

BTW, the challenge in most clustering algorithms lies in determining the number of clusters; usually that needs to be specified (as $N = 4$ in this example) or at least hinted at by the user. For instance, there always exists a minimum K-means solution: partition $S$ into singletons. If you require $i < n$, merge the two nearest neighbors but leave all the other singletons as they are: the sum of variances is then just one-quarter the square of the smallest difference among the $x_i$.

whuber
  • 8,531
0

There is a somewhat related operation, condensation, in the theory of linear orderings. For example, you can take a linear ordering and identify any two elements $x,y$ which are connected by some finite chain, i.e. there are elements $x = v_1,\ldots,v_n = y$ such that $v_1 < \cdots < v_n$ and there are no elements 'in between', i.e. no $z$ such that $v_i < z < v_{i+1}$. This will compress all copies of $\mathbb{Z}$ and $\mathbb{N}$ (or its opposite) to a single point.

This particular condensation can be used to (de)construct all countable linear orders, if I remember correctly. There are also some other condensations, take a look at Rosenstein's Linear orderings.

Yuval Filmus
  • 57,157