Hi all,
I'm a grad student at University of Washington in NLP. I've been using Breeze/Nak for my research, and have been developing some additions within the Nak framework. I'd like to contribute that work back to the project, and am interested in contributing to base development work of making Nak a well-rounded ML library as well. First I'll give a little update on what I've done so far, and then get to the more important point of the post, which is to continue a discussion on how to make Nak rad(der).
I've been doing some clean-up and updating work in passing, and a few minor modifications and additions. I currently have one branch with these changes (which is a little messy, since it covers a variety of different things.. :(.., and needs a bit of clean-up before PR) but basically the changes in this branch are:
* updated to breeze 0.8, SBT 0.13.5, scallop 0.9.5, Scala 11.1, scala-logging 2.1.2, scalatest 2.2.0
* updated to SBT v 0.13.5
* added an implementation of kNN (with test)
* added some distance metrics as UFunc implicits
* updated generic type argument to DataMatrix for Labels and updated features to be DenseVectors, so instead of representing a Seq[Example[Int,Seq[Double]], it is a Seq[Example[L,DenseVector[Double]].
*** This seemed reasonable to me, but honestly DataMatrix should probably be brought more in line with a R-style DataFrame. Discussion on this point below.
* Added AveragedPerceptronTrainer to Perceptron to average learned weights, since this is often better. (includes test)
* Added Iris dataset for tests
I have another branch that has an implementation of Neighborhood Components Analysis (NCA) for kNN on DenseVectors.
I'll submit these as PRs at some point, but as I mentioned, I need to clean them up a bit first and wanted to get some feedback on whether any of the random changes in the first branch should be removed.
You can check out the branches on my github @
https://github.com/gabeos/nak/tree/devel and
https://github.com/gabeos/nak/tree/feature/NCA respectively
Anyway, that's just a little update on what I've been doing. Really the main point I wanted to bring to the list for discussion is nailing down some immediate directions, and hopefully generate some interest in contributions from some more folks on the list. I've just given some smaller chunks of thought for discussion, basically the things I noticed when developing in Nak that could be unified within the current framework. maybe we can break them out into individual posts if needed. I think there's a larger discussion of how Nak can unify around one or a small set of goals/concepts to become a really interesting, easy-to-use, and useful library for ML in Scala. I'm not sure it needs to be a one-size-fits all solution for ML, but I think there are a few different directions the library could go that would be really great. I've just given a few questions that I think would be good to consider in this discussion.
Improvements:
1) Liblinear/Breeze-ML integration.
A pretty obvious first step, which has already been discussed on this board is integrating the two classification hierarchies currently in Nak. Liblinear has it's own hierarchy, and integrating it into the Breeze-ML hierarchy would be a great start towards making Nak more unified. I think the Breeze-ML hierarchy is a bit more general and better integrated with Breeze so my guess is that the Liblinear classifier should be subsumed into that hierarchy, however, I think bringing over some things, like the FeaturizedClassifier, could be useful.
2) Improving Data representation
Some basic improvements I can see happening quickly would be:
* rewrite DataMatrix as a fully-featured DataFrame, building on Breeze's data structures to get good slicing, sparsity, vector operations, etc.
* refactor current Datasets object into stats package and integrate stats with classifiers so it's easy to run different types of tests
Unification:
1) Approach to classification pipeline i.e. Framework vs Solution
Basically, whether the project should focus on providing easy access to common ML algorithms for people with data that need "Solutions", or whether it should gear itself towards providing a framework for specializing ML algorithms and writing new ones. This may be somewhat of a false dichotomy, but I think the two considerations offer somewhat different directions to focus dev effort. It seems that right now the library is geared a little bit more towards solutions since it primarily offers the set of algorithms. I would tend to lean towards the latter, making the library a framework offering excellent data structures and class hierarchies to roll your own ML algorithms and a solid set of baseline implementations.
2) Should there be a focus on a specific framing of ML problems, a la factorie? What would that be?
3) ...? There's plenty more to discuss, but I'd prefer to leave the discussion a bit open ended, and honestly, I don't think I have a clear picture of what all of the options are, so I'd love to hear more questions :)
Anyway, sorry for the long post, I've been meaning to write this up for a while and things kept coming in.. I'm excited to be working with Nak and I look forward to moving the library forward!