Moving Nak Forward (Cont..)

66 views
Skip to first unread message

Gabriel Schubiner

unread,
Jul 3, 2014, 7:05:40 PM7/3/14
to scalanlp...@googlegroups.com
Hi all,

I'm a grad student at University of Washington in NLP. I've been using Breeze/Nak for my research, and have been developing some additions within the Nak framework. I'd like to contribute that work back to the project, and am interested in contributing to base development work of making Nak a well-rounded ML library as well. First I'll give a little update on what I've done so far, and then get to the more important point of the post, which is to continue a discussion on how to make Nak rad(der).

I've been doing some clean-up and updating work in passing, and a few minor modifications and additions. I currently have one branch with these changes (which is a little messy, since it covers a variety of different things.. :(.., and needs a bit of clean-up before PR) but basically the changes in this branch are:
* updated to breeze 0.8, SBT 0.13.5, scallop 0.9.5, Scala 11.1, scala-logging 2.1.2, scalatest 2.2.0
* updated to SBT v 0.13.5
* added an implementation of kNN (with test)
* added some distance metrics as UFunc implicits
* updated generic type argument to DataMatrix for Labels and updated features to be DenseVectors, so instead of representing a Seq[Example[Int,Seq[Double]], it is a Seq[Example[L,DenseVector[Double]].
*** This seemed reasonable to me, but honestly DataMatrix should probably be brought more in line with a R-style DataFrame. Discussion on this point below.
* Added AveragedPerceptronTrainer to Perceptron to average learned weights, since this is often better. (includes test)
* Added Iris dataset for tests

I have another branch that has an implementation of Neighborhood Components Analysis (NCA) for kNN on DenseVectors.

I'll submit these as PRs at some point, but as I mentioned, I need to clean them up a bit first and wanted to get some feedback on whether any of the random changes in the first branch should be removed.
You can check out the branches on my github @ https://github.com/gabeos/nak/tree/devel and https://github.com/gabeos/nak/tree/feature/NCA respectively


Anyway, that's just a little update on what I've been doing. Really the main point I wanted to bring to the list for discussion is nailing down some immediate directions, and hopefully generate some interest in contributions from some more folks on the list. I've just given some smaller chunks of thought for discussion, basically the things I noticed when developing in Nak that could be unified within the current framework. maybe we can break them out into individual posts if needed. I think there's a larger discussion of how Nak can unify around one or a small set of goals/concepts to become a really interesting, easy-to-use, and useful library for ML in Scala. I'm not sure it needs to be a one-size-fits all solution for ML, but I think there are a few different directions the library could go that would be really great. I've just given a few questions that I think would be good to consider in this discussion.

Improvements:
1) Liblinear/Breeze-ML integration.
    A pretty obvious first step, which has already been discussed on this board is integrating the two classification hierarchies currently in Nak. Liblinear has it's own hierarchy, and integrating it into the Breeze-ML hierarchy would be a great start towards making Nak more unified. I think the Breeze-ML hierarchy is a bit more general and better integrated with Breeze so my guess is that the Liblinear classifier should be subsumed into that hierarchy, however, I think bringing over some things, like the FeaturizedClassifier, could be useful.

2) Improving Data representation
    Some basic improvements I can see happening quickly would be:
    * rewrite DataMatrix as a fully-featured DataFrame, building on Breeze's data structures to get good slicing, sparsity, vector operations, etc.
    * refactor current Datasets object into stats package and integrate stats with classifiers so it's easy to run different types of tests

Unification:

1) Approach to classification pipeline i.e. Framework vs Solution
    Basically, whether the project should focus on providing easy access to common ML algorithms for people with data that need "Solutions", or whether it should gear itself towards providing a framework for specializing ML algorithms and writing new ones. This may be somewhat of a false dichotomy, but I think the two considerations offer somewhat different directions to focus dev effort. It seems that right now the library is geared a little bit more towards solutions since it primarily offers the set of algorithms. I would tend to lean towards the latter, making the library a framework offering excellent data structures and class hierarchies to roll your own ML algorithms and a solid set of baseline implementations.

2) Should there be a focus on a specific framing of ML problems, a la factorie? What would that be?

3) ...? There's plenty more to discuss, but I'd prefer to leave the discussion a bit open ended, and honestly, I don't think I have a clear picture of what all of the options are, so I'd love to hear more questions :)

Anyway, sorry for the long post, I've been meaning to write this up for a while and things kept coming in.. I'm excited to be working with Nak and I look forward to moving the library forward!

David Hall

unread,
Jul 3, 2014, 11:30:12 PM7/3/14
to scalanlp...@googlegroups.com
On Thu, Jul 3, 2014 at 6:05 PM, Gabriel Schubiner <gab...@cs.washington.edu> wrote:
Hi all,

I'm a grad student at University of Washington in NLP. I've been using Breeze/Nak for my research, and have been developing some additions within the Nak framework. I'd like to contribute that work back to the project, and am interested in contributing to base development work of making Nak a well-rounded ML library as well. First I'll give a little update on what I've done so far, and then get to the more important point of the post, which is to continue a discussion on how to make Nak rad(der).

Great!
 

I've been doing some clean-up and updating work in passing, and a few minor modifications and additions. I currently have one branch with these changes (which is a little messy, since it covers a variety of different things.. :(.., and needs a bit of clean-up before PR) but basically the changes in this branch are:
* updated to breeze 0.8, SBT 0.13.5, scallop 0.9.5, Scala 11.1, scala-logging 2.1.2, scalatest 2.2.0
* updated to SBT v 0.13.5
* added an implementation of kNN (with test)
* added some distance metrics as UFunc implicits
* updated generic type argument to DataMatrix for Labels and updated features to be DenseVectors, so instead of representing a Seq[Example[Int,Seq[Double]], it is a Seq[Example[L,DenseVector[Double]].
*** This seemed reasonable to me, but honestly DataMatrix should probably be brought more in line with a R-style DataFrame. Discussion on this point below.

agreed.
 
* Added AveragedPerceptronTrainer to Perceptron to average learned weights, since this is often better. (includes test)

Possibly we should have something in StochasticGradient in Breeze to support this as well.
 
* Added Iris dataset for tests

I have another branch that has an implementation of Neighborhood Components Analysis (NCA) for kNN on DenseVectors.

I'll submit these as PRs at some point, but as I mentioned, I need to clean them up a bit first and wanted to get some feedback on whether any of the random changes in the first branch should be removed.
You can check out the branches on my github @ https://github.com/gabeos/nak/tree/devel and https://github.com/gabeos/nak/tree/feature/NCA respectively

Cool! I'll take a look later.
 


Anyway, that's just a little update on what I've been doing. Really the main point I wanted to bring to the list for discussion is nailing down some immediate directions, and hopefully generate some interest in contributions from some more folks on the list. I've just given some smaller chunks of thought for discussion, basically the things I noticed when developing in Nak that could be unified within the current framework. maybe we can break them out into individual posts if needed. I think there's a larger discussion of how Nak can unify around one or a small set of goals/concepts to become a really interesting, easy-to-use, and useful library for ML in Scala. I'm not sure it needs to be a one-size-fits all solution for ML, but I think there are a few different directions the library could go that would be really great. I've just given a few questions that I think would be good to consider in this discussion.

Improvements:
1) Liblinear/Breeze-ML integration.
    A pretty obvious first step, which has already been discussed on this board is integrating the two classification hierarchies currently in Nak. Liblinear has it's own hierarchy, and integrating it into the Breeze-ML hierarchy would be a great start towards making Nak more unified. I think the Breeze-ML hierarchy is a bit more general and better integrated with Breeze so my guess is that the Liblinear classifier should be subsumed into that hierarchy, however, I think bringing over some things, like the FeaturizedClassifier, could be useful.

That sounds good.  At this point you probably know what's going on in the Nak package better than I do. I do agree ideally that any hierarchy that we design should have liblinear as just one of several possible backends, even if in practice we only provide support for it out-of-the-box.

2) Improving Data representation
    Some basic improvements I can see happening quickly would be:
    * rewrite DataMatrix as a fully-featured DataFrame, building on Breeze's data structures to get good slicing, sparsity, vector operations, etc.

Agreed. Have you seen saddle? http://saddle.github.io/ 

Development is basically stalled from what I can tell but they were trying to do something like data frame. It's probably worth seeing which ideas are worth copying in breeze.
 
    * refactor current Datasets object into stats package and integrate stats with classifiers so it's easy to run different types of tests

Sounds good. Any pure statistical testing (i.e. not machine learning specific) should probably be available in breeze, to the extent that it is possible.
 

Unification:

1) Approach to classification pipeline i.e. Framework vs Solution
    Basically, whether the project should focus on providing easy access to common ML algorithms for people with data that need "Solutions", or whether it should gear itself towards providing a framework for specializing ML algorithms and writing new ones. This may be somewhat of a false dichotomy, but I think the two considerations offer somewhat different directions to focus dev effort. It seems that right now the library is geared a little bit more towards solutions since it primarily offers the set of algorithms. I would tend to lean towards the latter, making the library a framework offering excellent data structures and class hierarchies to roll your own ML algorithms and a solid set of baseline implementations.

I think that makes sense. That said I think it's important to not step on factorie's toes; we should be careful to strike that balance.
 

2) Should there be a focus on a specific framing of ML problems, a la factorie? What would that be?

The one thing I don't want is a bag of algorithms. I guess that is consistent with the idea of being a framework and not a solution. I think I mentioned off-list that I really like the R glm-style model specification. Basically, I'd like to be more declarative/functional as compared to factorie's imperative. I'm of course open to other ideas. There's always something like trying to replicate infer.net.
 

3) ...? There's plenty more to discuss, but I'd prefer to leave the discussion a bit open ended, and honestly, I don't think I have a clear picture of what all of the options are, so I'd love to hear more questions :)

Anyway, sorry for the long post, I've been meaning to write this up for a while and things kept coming in.. I'm excited to be working with Nak and I look forward to moving the library forward!

Awesome!

--David 

Jason Baldridge

unread,
Jul 4, 2014, 8:18:06 AM7/4/14
to scalanlp...@googlegroups.com
It's really great to see this! A few comments, which I'll just put here rather than in-lining. (FWIW, we'd probably want separate threads if we want to discuss different things in detail.)

Happy with all the stuff in the first branch -- really appreciate the cleanup and update!

It would be great to unify the two parts of Nak. The reason liblinear is in there is that it was faster than equivalent model training in breeze-ml, so I'd like it to stick around until we have an alternative. But, the liblinear bit is only partly "breezy", so unifying those would make things more coherent.

Regarding framework vs solution, I tend to fall more toward the latter. I like it when a person can come to a toolkit and, based on a simple example, start building their own classifiers in short order. In other words, folks often just need a few straightforward ways to build a few popular models for most things. It would be obviously nice to have both -- and perhaps there is some way of doing that with Nak, e.g. with a nak-core and nak-solution, or something of that nature.

Though I haven't used it yet myself, several developers at my startup are using scikit-learn, and it seems it is well-designed in terms of having good core data and model representations that make it very straightforward to try out several different modeling options with very few code changes. This seems to me like a very useful property, but perhaps that gets into the bag-of-algorithms situation that David would like to avoid.

It's really great that you are taking this on! FWIW, I've been contacted by various folks about contributing to Nak over the past year and hope they are following the list and can chime in to start helping out.

-Jason

--
Jason Baldridge
Associate Professor, Dept. of Linguistics, UT Austin
Co-founder & Chief Scientist, People Pattern
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge

Elias Ponvert

unread,
Jul 4, 2014, 11:42:30 AM7/4/14
to scalanlp...@googlegroups.com
These all sound like really positive contributions! +1

FWIW I also vote for "solution" -- there's Factorie and also Figaro for ML development frameworks in Scala

David Hall

unread,
Jul 4, 2014, 12:39:17 PM7/4/14
to scalanlp...@googlegroups.com
On Fri, Jul 4, 2014 at 7:18 AM, Jason Baldridge <jasonba...@gmail.com> wrote:
Though I haven't used it yet myself, several developers at my startup are using scikit-learn, and it seems it is well-designed in terms of having good core data and model representations that make it very straightforward to try out several different modeling options with very few code changes. This seems to me like a very useful property, but perhaps that gets into the bag-of-algorithms situation that David would like to avoid.

I definitely think there should be some out of the box solutions, but I'd rather not have an architecture that consists of a bunch of free-standing implementations. Since I'm unlikely to build it, I guess it doesn't matter quite as much. 

IMHO, supervised ML is best thought of as consisting of a feature function (which might factorize for structured prediction), an optional transform of that feature function (e.g. neural nets, decision trees), a loss function, and a regularizer. With standard (sub)gradient optimization, you can reproduce pretty much every classification algorithm[1]. We should have something that is just logistic regression (identity feature function, no transform, log loss, and l2), and SVMs (same, but hinge loss). 

-- David

[1] Kernels don't quite fall into this framework, but honestly I don't have much love for kernels. I guess you could do those cholesky decomposition things and turn the kernel matrix into a codebook, or something.

Gabriel Schubiner

unread,
Jul 5, 2014, 1:58:17 PM7/5/14
to scalanlp...@googlegroups.com, dl...@cs.berkeley.edu
Fantastic, thanks for the responses all. I'm thinking that we can break out into a couple threads now, perhaps just two to start, one on the data model, which I think can be relatively separate from the discussion of what the API looks like (which I think basically encompasses the sol'n vs. framework debate.)

I agree with many of the points brought up, I think it'd be great to offer the scikit-learn style ease of access for basic algorithms, but (and this is with a bias here, since I'll be using Nak for my own ML-y research stuff,) I'd love to ensure that there is enough flexibility to extend the framework with new implementations and variations easily. I think having breeze as a separate library from Nak will actually encourage this since the optimization code is necessarily decoupled from the algorithms, etc.

Okay, on to the new threads..
Reply all
Reply to author
Forward
0 new messages