0

I'm not particularly well-educated in mathematics. I've been reading about the, 'curse of dimensionality,' and how distance metrics become meaningless in high-dimensional spaces.

I then thought: What if you treat each dimension in a high-dimensional vector as a 1D scalar, and find the vector with the nearest scalar value for the same dimension. Then, you add the index for this, 'single-dimension nearest neighbor,' to a new vector. You iterate through this process for every dimension in the vector and in the end, you define the nearest neighbor as the vector that appears most often in the vector of single-dimensional nearest neighbors.

Is this utter nonsense from a mathematical perspective? I have no idea how to even begin analyzing the efficacy of this mathematically.

Rushabh Mehta
  • 13,663
  • " I've been reading about the, 'curse of dimensionality,' and how distance metrics become meaningless in high-dimensional spaces." Never heard of it. Can you fill me in? Distance metrics are no problem from finite dimensions and the Euclid metric $\sqrt{\sum_{k=1}^n (x_k-y_k)^2}$ is fairly standard. Now, infinite dimensions may have an issue, which yours doesn't solve. – fleablood Jan 21 '21 at 00:42
  • I don't really understand the approach. Could you do an example for us to see? Also, @fleablood is right: we do have distance metrics in finite dimensions, however large (in fact, we have infinitely many of them). Also, don't pick random tags for your question. I've edited the tags to something more appropriate. – Rushabh Mehta Jan 21 '21 at 00:43
  • Ah... I read up. https://en.wikipedia.org/wiki/Curse_of_dimensionality. The issue isn't that it's hard to have metrics. Its that statistically the distances between two points will be very insignificant due to the high number of dimensions will make variation minimal. – fleablood Jan 21 '21 at 00:47
  • @fleablood: the question https://stats.stackexchange.com/questions/99171/why-is-euclidean-distance-not-a-good-metric-in-high-dimensions has some quotes that show where this is coming from. The Euclidean metric is well defined in high dimensions, but the volume distribution makes some of our usual concepts not useful – Ross Millikan Jan 21 '21 at 00:47
  • @RossMillikan gotcha! Okay, but the OP's method (which seems to be a variation of the Euclidean metric.... good job for the OP; creative analysis and (perceived) problem solving...) unfortunately doesn't address the issue of volume distribution, sadly. – fleablood Jan 21 '21 at 00:50
  • Just so you know, there are some algorithms to try to mitigate the "curse of dimensionality". See Laplacian Eigenmaps, for example. – Ty Jensen Jan 21 '21 at 01:03

1 Answers1

1

When you make a definition like this there are two criteria of interest. The first is whether it is well defined-is it clear what vector it will choose as the nearest neighbor. The second is whether it is useful. I think well defined fails here. If you choose one vector out of a group of $10,000$ in $100$ dimensional space, it sounds like you will find the vectors that are closest in each coordinate direction to the chosen vector and take a plurality vote among them. If you have many more vectors than dimensions, the nearest vector in each dimension may be different, so each of them gets one vote. Which is the nearest? Maybe you just want to make a list of the several nearest vectors. That will clearly be well defined.

Whether it is interesting depends on whether the vector or vectors you select as nearest neighbor(s) are similar to the original one in the sense of interest to you and your community. I can't comment on that.

Ross Millikan
  • 374,822