Distance.jl

278 views
Skip to first unread message

Dahua

unread,
Feb 8, 2013, 11:25:40 AM2/8/13
to juli...@googlegroups.com
I am starting porting our machine learning codes to Julia, and will contribute those parts that I think will be useful to others as packages.

I just completed the Distance package (https://github.com/lindahua/Distance.jl), which provides optimized functions to evaluate distances between vectors. 

This package provides specialized routines to compute column-wise and pair-wise distances, which is remarkably faster (sometimes over 100x) than straightforward loop implementation (see benchmark in the readme). Many algorithms involve computation of distances, so I think other projects may benefit from this.

The Devectorize package is extensively used here -- which is quite useful in accelerating computations -- actually one of the motivation that I wrote Devectorize is that I found some vectorized codes are too slow when I tried to implement distance functions in Julia.

Dahua

unread,
Feb 8, 2013, 11:28:05 AM2/8/13
to juli...@googlegroups.com
Please let me know if you have any feedback or suggestions.

I will submit it to METADATA.jl later if there is no need for substantial change.

John Myles White

unread,
Feb 8, 2013, 11:30:38 AM2/8/13
to juli...@googlegroups.com
This package looks really promising. I think that some of this material should be incorporated into Base after a little review given that we've been discussing the fate of a pairwise distances function for some time.

 -- John

Diego Javier Zea

unread,
Feb 8, 2013, 12:11:08 PM2/8/13
to juli...@googlegroups.com
I use to calculate euclidean distance between atoms a lot. It's going to be very useful to me :)

How do you get the speed up with respect to the straightforward loop implementation?

Stefan Karpinski

unread,
Feb 8, 2013, 12:27:12 PM2/8/13
to Julia Dev
This looks great. Thanks, Dahua! Perhaps distance computations should just live in this package for now and not be in base at all. There's really quite a lot going on here and it might be better to just say "hey, check out this package if you need to compute lots of distances."

Dahua

unread,
Feb 8, 2013, 12:51:28 PM2/8/13
to juli...@googlegroups.com
The trick here is to turn this into matrix-multiplication.

Let X and Y be the input matrices, respectively of size d x m and d x n. Each column is a vector of length d, and you want to compute pairwise squared Euclidean distances between these columns and generate an m x n matrix R.

Then R can be expressed as

R[i, j] 
= sum_k (X(k, i) - Y(k, j))^2 
= sum_k X(k, i))^2  + sum_k Y(k, j)^2 - 2 * X(k, i) * Y(k, j)

Let M[i, j] be a matrix such that M[i, j] = sum_k X(k, i) * Y(k, j), then you can express the matrix M as X' * Y -- this is a matrix multiplication, which can be computed very fast using BLAS.

According to the analysis above, the computation can be re-organized into three steps

1 -- compute the sum of squares of each column in X and Y, resulting in sx and sy, this is column-wise sums
2 -- compute the matrix M, using matrix multiplication. Julia provides the At_mul_B function, so that you don't even need to store the transposed version of X.
3 -- combine the terms into R, as R[i, j] = sx[i] + sy[j] - M[i, j]

For squared Euclidean, R is just the result. If you want the Euclidean, just take the square roots.

The same decomposition can be applied to a variety of distances whose core part is a quadratic form. In Distance.jl, I used this strategy for many distances.

Stefan Karpinski

unread,
Feb 8, 2013, 12:56:17 PM2/8/13
to Julia Dev
Great stuff.

Tim Holy

unread,
Feb 8, 2013, 1:01:12 PM2/8/13
to juli...@googlegroups.com
This is great, and will be immensely useful to me. I particularly appreciate
the attention to performance---pairwise distance computation is famously a
bottleneck---and I'm pleased you've gone with a cache-friendly approach.

Since you're interested in performance, I'll explore it a little more. I'm
curious how this does better than a for loop. Have you tried linear indexing?
I.e., just

m = size(A,1)
joffset = (j-1)*m # the offset to the jth column
for j = joffset+1:joffset+m
# now do stuff with A[j]
end

This is usually much faster than using two indices. I use this strategy all
the time.

Alternatively, might you use sub() rather than b[:,j] when snipping out
columns? b[:,j] copies the data, which is not ideal, whereas sub() does not.
However, I bet linear indexing would beat both---it's perfectly in-place with
no temporary creation.

Finally, have you considered using BLAS? It's possible that the combination of
copy! (to a temp buffer) and Blas.axpy! might be a faster way to do the
coordinatewise subtraction when the dimensionality of your data points is
large (although it does require a copy, which might cause it to lose). For
Euclidean you can presumably also use Blas.nrm2. I used BLAS.axpy! in my Grid
package (the restrict/prolong operators), and it's a pretty remarkable
performance gain.

Documentation on BLAS:
http://www.netlib.org/lapack/lug/node145.html

Best,
--Tim

Dahua

unread,
Feb 8, 2013, 1:10:39 PM2/8/13
to juli...@googlegroups.com
Tim,

Level-3 BLAS (in particular GEMM) is used in a variety of pairwise distance computation routines in Distance.jl -- that's why it can achieve over 100x performance gain.

I actually talked about a little bit above (in my reply to Diego above) on how I make it possible to apply GEMM for pairwise Euclidean distance computation.

Dahua

unread,
Feb 8, 2013, 1:14:17 PM2/8/13
to juli...@googlegroups.com
Perhaps there are places which can be further improved using BLAS level-1 routines and linear indexing. Will review the codes again to see.

Tim Holy

unread,
Feb 8, 2013, 1:22:01 PM2/8/13
to juli...@googlegroups.com
On Friday, February 08, 2013 10:10:39 AM Dahua wrote:
> I actually talked about a little bit above (in my reply to Diego above) on
> how I make it possible to apply GEMM for pairwise Euclidean distance
> computation.

Caught me between email downloads. Sorry about that!

I didn't read as far into your code as I should have, indeed the GEMM trick is
great.

--Tim

Diego Javier Zea

unread,
Feb 8, 2013, 1:57:49 PM2/8/13
to juli...@googlegroups.com
Thanks Tim & Dahua for the answer and information :)
Reply all
Reply to author
Forward
0 new messages