pairwise distance of row vectors with missing (nan) values

193 views
Skip to first unread message

Moritz Beber

unread,
Jul 23, 2014, 10:28:43 AM7/23/14
to bottl...@googlegroups.com
Dear all,

I recently wanted to compute the pairwise distance of vectors in a roughly 3k x 3k matrix. scipy.spatial.distance.pdist is usually very good at this and offers a nice variety of distance measures. My problem, however, is that the matrix contains nan values, which scipy cannot deal with appropriately, i.e., I would like a joint mask for the two vectors in question. Please also see http://stackoverflow.com/questions/24781461/compute-the-pairwise-distance-in-scipy-with-missing-values for more details.

Today I remembered that bottleneck is great with nans and immediately found the `nn` function. Looking at the code is very hard for me, though, due to the template system. I found bottleneck/src/template/func/nn.py but is there a way to look at the generated Cython code? And can parts of the code be re-used elsewhere without all the bells and whistles of bottleneck?

Thank you for any pointers,
Moritz

Steven Troxler

unread,
Jul 23, 2014, 2:32:11 PM7/23/14
to bottl...@googlegroups.com
Hi Moritz,

The cython code isn't a part of the repository, but if you build the code, the template system will generate pyx files (which in turn generate C files and then .so files). So after building, you should be able to find the Cython code here:
  •  bottleneck/bottleneck/src/func/nn.pyx

Hopefully you'll find that much easier to read than the template file.

Good luck with your work! If it works out and you wind up distributing code adapted from bottleneck, remember to include the license.

-Steven

Moritz Beber

unread,
Jul 23, 2014, 2:50:30 PM7/23/14
to bottl...@googlegroups.com
Hey Steven,

Thank you for your reply.


On Wed, Jul 23, 2014 at 8:32 PM, Steven Troxler <steven....@gmail.com> wrote:
Hi Moritz,

The cython code isn't a part of the repository, but if you build the code, the template system will generate pyx files (which in turn generate C files and then .so files). So after building, you should be able to find the Cython code here:
  •  bottleneck/bottleneck/src/func/nn.pyx

Hopefully you'll find that much easier to read than the template file.


I thought that would be the case but after running `python setup.py build_ext --inplace` and `python setup.py build` a `find . -name \*.pyx` gets me only ./sandbox/nanmean.pyx which is kinda weird. Any ideas what I'm doing wrong?
 

Good luck with your work! If it works out and you wind up distributing code adapted from bottleneck, remember to include the license.

Absolutely!
 

-Steven


Moritz

Steven Troxler

unread,
Jul 23, 2014, 2:58:33 PM7/23/14
to bottl...@googlegroups.com
Okay, I see. Bottleneck gets distributed with the C files (so that cython isn't a requirement), so `python setup.py build` will just use those C files.

To build the pyx files from the templates, try running `make pyx`. You can check the Makefile for details if you want.

-Steven

Moritz Beber

unread,
Jul 23, 2014, 3:32:55 PM7/23/14
to bottl...@googlegroups.com
On Wed, Jul 23, 2014 at 8:58 PM, Steven Troxler <steven....@gmail.com> wrote:
Okay, I see. Bottleneck gets distributed with the C files (so that cython isn't a requirement), so `python setup.py build` will just use those C files.

To build the pyx files from the templates, try running `make pyx`. You can check the Makefile for details if you want.

I grabbed the repo from github but that did it anyway. Thanks.

So if the next question is silly feel free to point me to numpy C documentation but I'm surprised that the little test `if ai == ai` (in `nanmean_1d_float64_axisNone`, for instance) is really the fastest way to ignore NaNs. So when I loop over two vectors and want to apply the same mask to them, do you reckon that `if ai == ai and bi == bi` is still the fastest way to go?

Keith Goodman

unread,
Jul 23, 2014, 4:14:53 PM7/23/14
to bottl...@googlegroups.com
On Wed, Jul 23, 2014 at 12:32 PM, Moritz Beber <moritz...@gmail.com> wrote:

So if the next question is silly feel free to point me to numpy C documentation but I'm surprised that the little test `if ai == ai` (in `nanmean_1d_float64_axisNone`, for instance) is really the fastest way to ignore NaNs. So when I loop over two vectors and want to apply the same mask to them, do you reckon that `if ai == ai and bi == bi` is still the fastest way to go?

Try to improve it, then time it.

Moritz Beber

unread,
Jul 24, 2014, 6:10:38 AM7/24/14
to bottl...@googlegroups.com
So I tried a few different version and you can see the results in this notebook: http://nbviewer.ipython.org/gist/Midnighter/b81d5732a0ef88f2e185

In there I test mainly two versions:

1.) nan_pdist2 which has the fast loop over axis 1.
2.) nan_pdist3 which has the fast loop over axis 0.

I like about nan_pdist2 that the distance measure is another function (so that I can easily plug in different distance measures) but I'm confused by the increased running times for higher NaN-content. Can anyone explain that? I like that nan_pdist3 builds sums going over axis 1 so that entire columns can be skipped. I do not see, however, how to disentangle the distance measure then. Maybe with a template system similar to Bottleneck that doesn't matter, though.

Thank you for your input,
Moritz
Reply all
Reply to author
Forward
0 new messages