updated Clustering.jl

Dahua

unread,

Mar 10, 2013, 2:34:11 PM3/10/13

to juli...@googlegroups.com

I have been working with John on improving machine learning packages.

As part of the efforts, I rewrote the k-means algorithm in Clustering.jl.

It is now substantially faster (100x - 200x) than before. In a benchmark, it takes 0.5s in my macbook pro to cluster 10000 samples (of dimension 100) into 50 clusters (about 0.01 second per iteration).

Several key modifications are used to make it fast:

1. Change from row-major to column-major. So it is more cache friendly w.r.t. Julia's memory layout.

2. Use the Distance.jl package to evaluate pairwise distances (which internally uses BLAS-3 routines for speed)

3. Remember which clusters were affected during re-assignment, so as to reduce the computation in ensuing updates of centers and distances.

4. Reuse memory carefully, which substantially reduces re-allocation of arrays at each iteration.

In terms of functionality, it now supports more options. Refer to the README of Clustering.jl for details.

There will be updates in several machine learning packages (e.g. Clustering.jl, kNN.jl, Classification.jl, SVM.jl, etc) in coming weeks.

Tim Holy

unread,

Mar 10, 2013, 2:53:32 PM3/10/13

to juli...@googlegroups.com

Very nice!

Viral Shah

unread,

Mar 11, 2013, 3:54:35 AM3/11/13

to juli...@googlegroups.com

That is pretty amazing. Looking forward to having high quality machine learning tools in julia.

Given that machine learning on large datasets is on everyone's minds - is there anything we can do at this stage in terms of API design, parallellism etc., so that we can process large datasets in distributed memory?

-viral

Reply all

Reply to author

Forward