MLBase.jl package for machine learning

269 views
Skip to first unread message

Dahua

unread,
Jun 16, 2013, 9:16:40 PM6/16/13
to juli...@googlegroups.com
This package has been on the METADATA list for quite a while and serves as the basis for several other packages (e.g. Clustering.jl). I re-organized this package recently and added a series of stuff. I feel that it would be useful to announce it here so that people don't have to reinvent the wheels.

I consider this package as an extension of the Julia base, with focus on efficient implementation of some functions commonly used in machine learning:

* Inplace vector arithmetics 
* Broadcasting matrix/vector arithmetics
* Efficient column-wise or row-wise reduction
* Computation of column-wise or row-wise norms, and normalization
* Integer related statistics
* Positive definite matrix related computation

Features that I think are worth highlighting:

* Column-wise / row-wise reduction functions are implemented with special care of efficiency. For example, it provides a function ``vsqsum`` to compute column/row-wise computation of sum of squares. Benchmark shows that ``vsqsum(x, 2)`` is nearly 10x faster than ``sum(abs2(x), 2)``. A table of detailed benchmarks is given in the project readme.

* It defines three types of positive definite matrices (PDMat, PDiagMat, and ScalMat) to represent full positive definite matrices, positive diagonal matrices, and matrices in the form of s * eye(d). Specialized methods (exploiting special structures of specific types) are implemented for them with uniform interfaces. This provides a generic framework for writing machine learning algorithms that use positive definite matrices (e.g. Gaussian models) while ensuring the most efficient implementation is used in actual computation.


This package has been listed on metadata, you can check it out by Pkg.add("MLBase").

John Myles White

unread,
Jun 17, 2013, 11:40:15 AM6/17/13
to juli...@googlegroups.com
I think a good chunk of this material should be in Base, especially basic things like add!.

 -- John

Dahua

unread,
Jun 17, 2013, 11:54:04 AM6/17/13
to juli...@googlegroups.com
I developed MLBase.jl as a transitional layer between Julia base and machine learning algorithms (and other codes).

In writing many such algorithms, I usually came into the situation to rewrite some basic computational routines such as computing the sum of squares for each row/column again and again. One can, of course, write 

sum(abs2(x), 2)

This is short enough, but as you can see from my benchmarks, it is nearly 8x - 10x slower than the devectorized version that I implemented in MLBase. So I created this package to host these supporting routines. 

John, you are right that some of these things may probably go into Julia Base -- this was what I thought: I first experiment some stuff in MLBase, and if I feel some  of them is ready enough to go to Julia Base, I will propose a pull request.

Stefan Karpinski

unread,
Jun 17, 2013, 12:11:35 PM6/17/13
to Julia Dev
This is great stuff, Dahua. I fully agree – a fair amount of this should migrate into Base.

Simon Kornblith

unread,
Jun 17, 2013, 1:17:03 PM6/17/13
to juli...@googlegroups.com
Ideally, reducedim() would be fast enough that the reductions could be implemented as one-liners, but that's clearly not the case at the moment due to lack of inlining for all but a few functions in Base. I've run into similar performance issues in my code, but I'm reducing across 3D arrays rather than matrices, so MLBase.jl sadly doesn't help me.

Simon

Tim Holy

unread,
Jun 17, 2013, 1:48:52 PM6/17/13
to juli...@googlegroups.com
I agree that it's wonderful functionality, and in my own work I've gotten some
good value out of this and other packages of yours. You write great,
performant code, Dahua.

But some fodder for thought in terms of which operations should migrate to
base and which should stay in packages: some of this may not be as necessary
as it once was, given broadcast and friends. Alternatively, the combination of
array views (hopefully coming soon) and fast cartesian iteration may obviate
some of the need for such a large number of algorithms specialized on two-
dimensional data sets. (E.g., see a trial run in
https://github.com/JuliaLang/julia/pull/3224,
which got more than a 2x boost even though the inner loop was decidedly
nontrivial; it would have been much higher for a simple inner loop).

Of course, if views are not coming soon and broadcast doesn't solve all
problems, then it might be motivation to contribute a larger subset of this to
base.

Best,
--Tim
> >> https://github.com/**lindahua/MLBase.jl<https://github.com/lindahua/MLBa
> >> se.jl>

Dahua

unread,
Jun 17, 2013, 2:03:48 PM6/17/13
to juli...@googlegroups.com
I filed issue #3424 for improved inplace vector operations. When this get sorted out, the functions for vector arithmetics may be deprecated.
Reply all
Reply to author
Forward
0 new messages