Finding approximate list of nearest neighbors for high-dimensional vectors

Jim Meyer

unread,

Jul 29, 2004, 2:04:23 PM7/29/04

to

Hi, I guess this one is for those with a firm knowledge of data structures.
Basically the problem is that of finding an approximate list of nearest
neighbors for high-dimensional vectors (1.000 plus dimensions, 10.000 plus
vectors).

A simple brute-force approach takes O(d * n^2) for n vectors with d
dimensions, in order to determine the k nearest neighbors for each vector.
Instead, I'm looking for a smarter and faster approach, where I don't have
to examine each unique pair of vectors, in order to get an approximate list
of nearest neighbors. The list does not have to be the exact nearest
neighbors, but should of course contains those that are good candidates.

Currently, each vector is stored in its own hash table, which has an entry
for every non-zero dimension. Also, I have hash table with an inverted
index, which allows me to lookup the vectors that have a non-zero value for
a certain dimension. This information could be used to create a large
matrix, but since the vectors seldom have values for even 1/10 the
dimensions, the matrix is very sparse and has a lot of zero-values.

I know of spatial access trees, but have been unable to find any that can
handle this many dimensions effectively.

Does anyone know any smart algorithms, data structures or heuristics that
can be used?

I would appreciate any help I can get.

Regards, Jim.

rossum

unread,

Jul 29, 2004, 5:21:46 PM7/29/04

to

On Thu, 29 Jul 2004 20:04:23 +0200, "Jim Meyer" <jimm...@email.dk>
wrote:

>Hi, I guess this one is for those with a firm knowledge of data structures.
>Basically the problem is that of finding an approximate list of nearest
>neighbors for high-dimensional vectors (1.000 plus dimensions, 10.000 plus
>vectors).
>

Point of clarification, are you using "." as a thousands separator
here? THat is do you have ten thousand plus vectors of one thousand
dimensions each?

rossum

>
>
>A simple brute-force approach takes O(d * n^2) for n vectors with d
>dimensions, in order to determine the k nearest neighbors for each vector.
>Instead, I'm looking for a smarter and faster approach, where I don't have
>to examine each unique pair of vectors, in order to get an approximate list
>of nearest neighbors. The list does not have to be the exact nearest
>neighbors, but should of course contains those that are good candidates.
>
>
>
>Currently, each vector is stored in its own hash table, which has an entry
>for every non-zero dimension. Also, I have hash table with an inverted
>index, which allows me to lookup the vectors that have a non-zero value for
>a certain dimension. This information could be used to create a large
>matrix, but since the vectors seldom have values for even 1/10 the
>dimensions, the matrix is very sparse and has a lot of zero-values.
>
>
>
>I know of spatial access trees, but have been unable to find any that can
>handle this many dimensions effectively.
>
>
>
>Does anyone know any smart algorithms, data structures or heuristics that
>can be used?
>
>
>
>I would appreciate any help I can get.
>
>
>
>Regards, Jim.
>

--

The ultimate truth is that there is no Ultimate Truth

Dr Chaos

unread,

Jul 29, 2004, 7:49:06 PM7/29/04

to

There are a number of algorithms for higher-dimensional searching, but
dimension=1000 is certainly too high for any method, if the data are
'generic' and really fill a 1000 dimensional space.

The 'k-d tree' algorithm is prominent, for d=2 to 15 or so. I've just
finished a new implementation myself.

For your problem I think you will have to work harder, using knowledge
about the data.

In particular, in higher dimensional spaces, unlike our intuitive sense
of 2-d and 3-d, almost all the volume is close to the edge.

I.e. in a 1000 dimensional sphere, almost all the volume is close to 999
dimensional annulus. And the distance to the nearest neighbor is not
much less than the distance to a random neighbor.

The result there is that the smarter algorithms (like k-d trees) which
can find near neighbors in Euclidean and similar spaces.

I would think about devising projections of your high-dimensional space,
using domain specific knowledge to minimize important topological
problems, find possible near neighbors in medium-dimensional spaces with
kd trees,
using multiple projections, and then confirm the distances back in the
high dimensional spaces.

Jim Meyer

unread,

Jul 29, 2004, 8:03:47 PM7/29/04

to

Yes, it's ten thousand plus vectors of thousand plus dimensions each.

"rossum" <ross...@coldmail.com> wrote in message
news:sgqig0h5aljasnrmg...@4ax.com...

Jim Meyer

unread,

Jul 30, 2004, 4:17:12 AM7/30/04

to

"Dr Chaos" <mbNOSPA...@yahoo.dot.com> wrote in message
news:cec2dh$f69$1...@news1.ucsd.edu...

In fact, there is sometimes up to 20000 dimensions or so, but I'm using a
weighting function to select up to a 1000 which are the most important
dimensions.

>
> The 'k-d tree' algorithm is prominent, for d=2 to 15 or so. I've just
> finished a new implementation myself.
>
> For your problem I think you will have to work harder, using knowledge
> about the data.
>
> In particular, in higher dimensional spaces, unlike our intuitive sense
> of 2-d and 3-d, almost all the volume is close to the edge.
>
> I.e. in a 1000 dimensional sphere, almost all the volume is close to 999
> dimensional annulus. And the distance to the nearest neighbor is not
> much less than the distance to a random neighbor.
>

Yes, if presented in a matrix, the matrix will be very, very sparse - around
90% of the values are 0 (all my other values are positive). My inverse index
allows me to effectively calculate the distance, by only evaluating the
union of non-zero dimensions for a pair of vectors.

> The result there is that the smarter algorithms (like k-d trees) which
> can find near neighbors in Euclidean and similar spaces.
>
> I would think about devising projections of your high-dimensional space,
> using domain specific knowledge to minimize important topological
> problems, find possible near neighbors in medium-dimensional spaces with
> kd trees,
> using multiple projections, and then confirm the distances back in the
> high dimensional spaces.

Does the kd tree utilize the fact, that most vectors only have a fraction of
non-zero dimensions?

In my experience projections are computationally expensive, typically
O(n^2) - atleast methods such as Singular Value Decomposition and most other
matrix rank reduction techniques. This "preprocessing" step will be too
expensive, unless this can be done in a smarter way. Suggestions?.

Paul E. Black

unread,

Jul 30, 2004, 1:17:20 PM7/30/04

to

"Jim Meyer" <jimm...@email.dk> writes:

> Basically the problem is that of finding an approximate list of nearest
> neighbors for high-dimensional vectors (1.000 plus dimensions, 10.000 plus
> vectors).

If the vectors typically have most of their magnitude in just a few
dimensions, you might classify the vectors by their major component(s).
The (more) exact matching is done in much smaller groups.

A really crazy possibility is using a genetic algorithm/simulated
annealing search to construct an approximate solution to a traveling
salesman problem.

Naw, you'd probably have to make order of n^2 swaps to get anything
close to an optimal path. Even then translating an optimal path into
nearest neighbors is fraught with difficulties.

-paul-
--
Paul E. Black (p.b...@acm.org)

Thad Smith

unread,

Jul 31, 2004, 11:53:49 AM7/31/04

to

Jim Meyer wrote:
>
> Hi, I guess this one is for those with a firm knowledge of data structures.
> Basically the problem is that of finding an approximate list of nearest
> neighbors for high-dimensional vectors (1.000 plus dimensions, 10.000 plus
> vectors).
>
> A simple brute-force approach takes O(d * n^2) for n vectors with d
> dimensions, in order to determine the k nearest neighbors for each vector.
> Instead, I'm looking for a smarter and faster approach, where I don't have
> to examine each unique pair of vectors, in order to get an approximate list
> of nearest neighbors. The list does not have to be the exact nearest
> neighbors, but should of course contains those that are good candidates.

One approach would be to choose one point, then compute the distance to
all other points, giving its neighbors. Points close to the first point
will tend to have the same neighbors. In general, if the distance from
S to P is pd, and S to Q is qd, the distance between P and Q < (pd+qd).
From that, you can use the list of the first point to restrict the
points that you consider as neighbors.

After you have exhausted the neighborhood of the original point, choose
another point distant from it to repeat the process for another
cluster. For following clusters, attempt to find new points which are
distant from all the primary points so far.

Thad

Rober...@yahoogroups.com

unread,

Jul 31, 2004, 9:12:19 PM7/31/04

to

> From: "Jim Meyer" <jimm...@email.dk>

> Basically the problem is that of finding an approximate list of
> nearest neighbors for high-dimensional vectors (1.000 plus dimensions,
> 10.000 plus vectors).

(where period is the thousands comma in your notation)

Use a fixed but pseudorandom linear-combination function to condense
each 1000-vector down to a much smaller number of dimensions, for
example I use 15-dimensional vectors typically. Then compute the
distribution of coordinate-values of each single dimension (of the 15)
separately, to decide how to bin them efficiently (appx. ten or so
equally-filled bins across the span of each such dimension). Then build
the vectors one by one into a decision tree based on successive
coordinates binned in that way, but before inserting each new vector,
use dynamic programming to find nearest neighbor(s) to the new vector
among all those previously in the decision tree. (I assume by "nearest"
you mean smallest value of distance, per some metric, presumably the
usual "Euclidean" metric.)

For your purpose, if you have all the data at the start, rather than
incrementally inserting new data, it may be better to fold *all* the
condensed vectors into the decision tree at once, and then for each
vector use dynamic programming to find nearest neighbors among all
other vectors rather than only among those that arrived earlier.

With sufficiently large dimension in the condensed space, and with
original data not deliberately contrived to break the condensing
method, the number of false collisions (condensed pair very near
despite original pair very far apart) will be small enough to be
manageable.

I call my method "ProxHash" i.e. "proximity hashing". Typically I apply
it to ASCII text strings/messages, generating statistics of
character-sequences as the very-high-dimension vectors, then running
the ProxHash function to condense each such vector to 15 dimensions.
But whatever your original high-dimension vectors are, the same method
should work.

Note the dynamic programming algorithm gives you exactly the nearest
neighbors in condensed space, which is approximately nearest neighbors
in original space.

You may wish to use dynamic programming to generate only one or two or
three nearest neighbors for each point, and build a nearest-neighbor
graph connecting them, and traverse that graph to get additional near
neighbors if you want more than that many from any given vector. But if
you fall into a tight cluster whose nearest neighbors are all
each-other, that portion of the graph disconnected from all the rest,
and you need one additional vector near them all, then you may need to
revisit the dynamic-programming algorithm to find the next nearest
neighbor beyond the ones you already have and thereby link that cluster
to some outside vector.