[SciPy-User] efficient computation of point cloud nearest neighbors

Christian Jauvin

unread,

May 27, 2011, 7:13:19 PM5/27/11

to scipy...@scipy.org

Hi,

I need to compute the k nearest neighbors of every point in a point
cloud of at least a million points.

I've been looking at the documentation for the scipy.spatial.KDTree
and cKDTree classes, but it's not clear to me how I should use their
query() method (and possibly their distance_upper_bound parameter) to
optimize the computation.

I've also been looking at the ANN and FLANN C++ libraries, but in both
cases I've had trouble compiling/installing their Python bindings on
my Ubuntu system.

I'd appreciate some advice as to what would be the ideal strategy to
solve this problem (either with Scipy or some other package that I
wouldn't know about).

Thanks in advance,

Christian
_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user

Ralf Gommers

unread,

May 29, 2011, 1:59:37 PM5/29/11

to SciPy Users List

On Sat, May 28, 2011 at 1:13 AM, Christian Jauvin <cja...@gmail.com> wrote:

Hi,

I need to compute the k nearest neighbors of every point in a point
cloud of at least a million points.

I've been looking at the documentation for the scipy.spatial.KDTree
and cKDTree classes, but it's not clear to me how I should use their
query() method (and possibly their distance_upper_bound parameter) to
optimize the computation.

I've also been looking at the ANN and FLANN C++ libraries, but in both
cases I've had trouble compiling/installing their Python bindings on
my Ubuntu system.

This is the second issue with ANN bindings reported in a week, so I had a look at scikits.ann. Then I found http://blog.physionconsulting.com/?p=17. So it looks like there should be a big "this is deprecated" warning on scikits.ann. It would be helpful if someone can confirm that KdTree/cKdTree in scikits.spatial is about as fast as ANN/FLANN.

Thanks,
Ralf

Gael Varoquaux

unread,

May 29, 2011, 2:15:38 PM5/29/11

to SciPy Users List

On Sun, May 29, 2011 at 07:59:37PM +0200, Ralf Gommers wrote:
> This is the second issue with ANN bindings reported in a week, so I had a
> look at scikits.ann. Then I found

> [2]http://blog.physionconsulting.com/?p=17. So it looks like there should

> be a big "this is deprecated" warning on scikits.ann. It would be helpful
> if someone can confirm that KdTree/cKdTree in scikits.spatial is about as
> fast as ANN/FLANN.

Regarding speed of nearest neighbors, a big caveat is that it depends a
lot on the dimensionality of the search space.

For low dimensionality, KDTree is super fast. It breaks down at around
10d, because the space starts becoming too 'empty': splitting it with
plane to separate in half-spaces with equal partitions of points ends up
quickly creating as many planes as they are points. The next thing is the
BallTree, that creates nested balls. In low dimensions it is as fast as
the KDTree, and it scales a bit higher, up to 20d is a good guess. A
BallTree implementation, contributed by Jake VanderPlas, can be found in
the scikits.learn. For references, see
http://scikit-learn.sourceforge.net/modules/neighbors.html#efficient-implementation-the-ball-tree

Above d ~ 20, a brute force search is quicker if you want exact nearest
neighbor. The scikits.learn's nearest neighbor search implements by
default an automatic switch.

I have never tried ANN (approximate nearest neighbor) but I wouldn't be
surprised if it were faster than a brute force in this regime.

All that to say the ANN probably has strong usecases and cannot always be
replaced by a KDTree.

Gael

Ralf Gommers

unread,

May 29, 2011, 2:32:58 PM5/29/11

to SciPy Users List, barr...@gmail.com

scikits.ann exposes a single class, kdtree. scipy.spatial.KDTree.query seems to be able to do approximate nearest neighbors:

    KDTree.query(self, x, k=1, eps=0, p=2, distance_upper_bound=inf)
    .....
    eps : nonnegative float
        Return approximate nearest neighbors; the kth returned value
        is guaranteed to be no further than (1+eps) times the
        distance to the real kth nearest neighbor.

So it looks to me like scipy.spatial has things covered. Which is also what Barry Wark (scikits.ann author) seems to say in the blog post I linked to.

Barry, could you confirm this and if appropriate put up some deprecation warnings?

Thanks,
Ralf

denis

unread,

May 30, 2011, 1:01:27 PM5/30/11

to scipy...@scipy.org

Christian, folks,
a couple of comments:

FLANN gets a lot of speed by quitting early, after looking at e.g.
.1N or .01N or (FLANN default) 32*leafsize points.
Accuracy *may* decrease -- guarantees are gone --
but I've found big speedup for ~ same accuracy,
especially in dimensions say > 20 where "distance whiteout" sets in.
I've added cutoff= to Anne Archibald's nice cython ckdtree.pyx,
also verbose= to help use it;
Would like a friendly proofreader
or else post some data, let me try it here.

What's your metric ?
(Choice of metric is *really* important for clustering -- Hastie p.
506).
FLANN does Euclidean, L2, only (and returns dist^2);
ANN can be compiled 3 ways for L1 L2 Lmax;
cKDTree does any Lp metric.

cheers
-- denis

On May 28, 1:13 am, Christian Jauvin <cjau...@gmail.com> wrote:
> Hi,
>
> I need to compute the k nearest neighbors of every point in a point
> cloud of at least a million points.

Anne Archibald

unread,

May 30, 2011, 2:20:09 PM5/30/11

to SciPy Users List

Hi,

In fact, I wrote KDTree and cKDTree based on a description of the
algorithm used in ANN, so it should have the same O behaviour. cKDTree
has certainly not seen the amount of tuning and polishing ANN has, but
it should be pretty clean.

For the OP's problem, the easiest way to get an answer is to make a
cKDTree of the point cloud and then just call query with an array of
all the points. This runs a separate query for each point, but each
query uses the same tree and the looping is done in C, so it should be
fairly fast.

If this is not fast enough, it might be worth trying a two-tree query
- that is, putting both the query points and the potential neighbours
in kd-trees. Then there's an algorithm that saves a lot of tree
traversal by using the spatial structure of the query points. (In this
case the two trees are even the same.) Such an algorithm is even
implemented, but unfortunately only in the pure python KDTree. If the
OP really needs this to be fast, then the best thing to do would
probably be to port KDTree.query_tree to cython. The algorithm is a
little messy but not too complicated.

I don't know how the FLANN optimizations fit into all this.

Anne

Sturla Molden

unread,

May 31, 2011, 11:27:29 AM5/31/11

to SciPy Users List

Den 30.05.2011 20:20, skrev Anne Archibald:
> If this is not fast enough, it might be worth trying a two-tree query
> - that is, putting both the query points and the potential neighbours
> in kd-trees. Then there's an algorithm that saves a lot of tree
> traversal by using the spatial structure of the query points. (In this
> case the two trees are even the same.) Such an algorithm is even
> implemented, but unfortunately only in the pure python KDTree. If the
> OP really needs this to be fast, then the best thing to do would
> probably be to port KDTree.query_tree to cython. The algorithm is a
> little messy but not too complicated.

In this case we just need one kd-tree. Instead of starting from the
root, we begin with the leaf containing the query point and work our way
downwards. We then find a better branching point from which to start
than the root. That is not messy at all :-)

Another thing to note is that making the kd-tree is very fast whereas
searching it is slow. So using multiprocessing is an option.

Sturla

Anne Archibald

unread,

Jun 1, 2011, 4:07:28 PM6/1/11

to SciPy Users List

On 31 May 2011 11:27, Sturla Molden <stu...@molden.no> wrote:
> Den 30.05.2011 20:20, skrev Anne Archibald:
>> If this is not fast enough, it might be worth trying a two-tree query
>> - that is, putting both the query points and the potential neighbours
>> in kd-trees. Then there's an algorithm that saves a lot of tree
>> traversal by using the spatial structure of the query points. (In this
>> case the two trees are even the same.) Such an algorithm is even
>> implemented, but unfortunately only in the pure python KDTree. If the
>> OP really needs this to be fast, then the best thing to do would
>> probably be to port KDTree.query_tree to cython. The algorithm is a
>> little messy but not too complicated.
>
> In this case we just need one kd-tree. Instead of starting from the
> root, we begin with the leaf containing the query point and work our way
> downwards. We then find a better branching point from which to start
> than the root. That is not messy at all :-)

But we can sometimes do better - all the leaves in a leaf node will
have very similar neighbour sets, for example, so in principle one can
avoid traversing (part of) the tree once for each. I'm not sure how
much speedup is really possible, though; since there are kn neighbours
to be listed, you're never going to beat O(kn), and the simple
query-everything approach is only O(kn log n) or so.

> Another thing to note is that making the kd-tree is very fast whereas
> searching it is slow. So using multiprocessing is an option.

cKDTrees cannot currently be copied, but it would be simple to
implement. This would save a bit of time when multiprocessing. That
said, they are also immutable, so multiple threads/processes can
happily operate on the same one.

Anne

Sturla Molden

unread,

Jun 1, 2011, 5:47:40 PM6/1/11

to scipy...@scipy.org

Den 01.06.2011 22:07, skrev Anne Archibald:
> cKDTrees cannot currently be copied, but it would be simple to
> implement. This would save a bit of time when multiprocessing.

There is not much to save here, constructing a kd-tree is very fast
compared to searching it. At least it is in the common case of finding k
nearest neightbours for each point in a cloud of n points.

> That
> said, they are also immutable, so multiple threads/processes can
> happily operate on the same one.

Shared memory do not have the same base address in different processes,
so all pointers are invalid. Thus the kd-tree must be built with integer
offsets instead of pointers, like my first Python version did.

You'll get the same problem if you want to serialize a kd-tree: pointers
must be saved as offsets.

os.fork() will copy-on-write on Linux though.

Sturla Molden

unread,

Jun 1, 2011, 6:17:10 PM6/1/11

to scipy...@scipy.org

Den 01.06.2011 23:47, skrev Sturla Molden:
>
> os.fork() will copy-on-write on Linux though.
>

On Windows we can allocate shared memory as private copy-on-write pages.
It is not as useful as Linux' os.fork(), as pointers cannot be shared,
but still a nice way to save some RAM if we want to share (and
write-protect) large ndarrays. I might add this option to my shared
memory ndarrays one day, but not today :)

Reply all

Reply to author

Forward