Nak Data Pipeline/Representation

44 views

Skip to first unread message

Gabriel Schubiner

unread,

Jul 5, 2014, 2:09:12 PM7/5/14

to scalanlp...@googlegroups.com

Continuing break out discussion from https://groups.google.com/forum/#!topic/scalanlp-discuss/51_T6N7FVpI

For the data representation, I think one of the great things Nak could offer would be an easy way to implement feature extraction for novel problem sets. I wrote an implementation feature extraction using stacked traits which worked out well, but was also a bit sensitive. A purely functional style would work too, but maybe would be more difficult to optimize for global features. If we can knock out a good API, I think it would also help Chalk a lot (although I'm not sure who or how regularly that is being developed).

The things I see needing immediate attention here are:

1) Data Structures -- Nailing down what the underlying data structures of the ML algorithms would help a lot towards standardizing the API and making it easier to extend the library.
    Things to think about:
** Flexible representation for datasets. Does a Data Frame model cover this?
** Ensure that the implicits from breeze for the linear algebra data structures that will likely underlie any Nak structures are easily accessible in training methods.
    I say this mostly because I had a tough time writing my NCA code generically, and ended up just writing separate Dense and Sparse versions.
** What functionality should these have? I think it'd be great if we could keep the data structures very accessible through functional code without losing too much efficiency

2) Feature extraction pipeline
    Things:
** How to define extractors: two options off the top of my head are stacked traits and as functions that can be aggregated (registered and cached)
** How featurization should be called: explicitly, or more implicitly as a part of classifier training/testing?

Okay, I have to run now, but this should be enough to get us started. It'd be great if we could nail down some concrete directions to guide and support contributions!
Thanks!
Gabe

David Hall

unread,

Jul 5, 2014, 6:23:44 PM7/5/14

to scalanlp...@googlegroups.com

On Sat, Jul 5, 2014 at 1:09 PM, Gabriel Schubiner <gab...@cs.washington.edu> wrote:

Continuing break out discussion from https://groups.google.com/forum/#!topic/scalanlp-discuss/51_T6N7FVpI

For the data representation, I think one of the great things Nak could offer would be an easy way to implement feature extraction for novel problem sets. I wrote an implementation feature extraction using stacked traits which worked out well, but was also a bit sensitive. A purely functional style would work too, but maybe would be more difficult to optimize for global features. If we can knock out a good API, I think it would also help Chalk a lot (although I'm not sure who or how regularly that is being developed).

Sounds good. FWIW, Chalk is dead. It's been moved into epic (https://github.com/dlwh/epic), though I haven't updated Chalk to that effect yet.

I have not had good experience with anything that looks like the cake pattern. Regular old composition is almost always better. I don't see your work lying around in your nak branch.

The things I see needing immediate attention here are:

1) Data Structures -- Nailing down what the underlying data structures of the ML algorithms would help a lot towards standardizing the API and making it easier to extend the library.
Things to think about:
** Flexible representation for datasets. Does a Data Frame model cover this?

For unstructured classification, I think this more or less works. I don't know that it makes as much sense when we're doing structured prediction, but that's ok, I think.

** Ensure that the implicits from breeze for the linear algebra data structures that will likely underlie any Nak structures are easily accessible in training methods.
I say this mostly because I had a tough time writing my NCA code generically, and ended up just writing separate Dense and Sparse versions.

Hrm, I spent a lot of time trying to make sure that it's reasonably easy to do this (mostly via the InnerProductSpace-type implicits) I don't see many references to Sparse in your NCA branch, but I'd be happy to help work through that. In particular, the Adagrad stochastic gradient implementations should work with arbitrary Vector types.

** What functionality should these have? I think it'd be great if we could keep the data structures very accessible through functional code without losing too much efficiency

2) Feature extraction pipeline
Things:
** How to define extractors: two options off the top of my head are stacked traits and as functions that can be aggregated (registered and cached)

So, what we've done in Epic is to define a DSL object that provides "base" featurizers (e.g. identity, suffixes, prefixes) and then combinators either in the form of + or *, or as offset-calcultors that turn, e.g., "identity" into "identity of the previous word" via the (-1) affix. Individual featurizers are responsible for caching. At the end of the day, in Epic featurizers produce an Array[Feature], though there's no reason it couldn't be a counter or something.

As an example of what it looks like, this is the set of features for the POS tagger we use: https://github.com/dlwh/epic/blob/master/src/main/scala/epic/features/WordFeaturizer.scala#L31 There are some quirks in the implementation that I'd like to fix, but eventually.

What I'd really like is to be able to write this in an R-glm style, where you say something like tag ~ word + suffixes(-1), (tag(-1) * tag) ~ word + word(-1) and it will automatically give you a featurizer for a CRF. Or something.

** How featurization should be called: explicitly, or more implicitly as a part of classifier training/testing?

I think I like implicitly more. (Of course, there's an extent to which some preprocessing should be done explicitly, e.g. tokenization, or reading in the file, if nothing else)

Gabriel Schubiner

unread,

Jul 6, 2014, 7:54:29 PM7/6/14

to scalanlp...@googlegroups.com, dl...@cs.berkeley.edu

On Saturday, July 5, 2014 3:23:44 PM UTC-7, David Hall wrote:

On Sat, Jul 5, 2014 at 1:09 PM, Gabriel Schubiner <gab...@cs.washington.edu> wrote:

Continuing break out discussion from https://groups.google.com/forum/#!topic/scalanlp-discuss/51_T6N7FVpI

For the data representation, I think one of the great things Nak could offer would be an easy way to implement feature extraction for novel problem sets. I wrote an implementation feature extraction using stacked traits which worked out well, but was also a bit sensitive. A purely functional style would work too, but maybe would be more difficult to optimize for global features. If we can knock out a good API, I think it would also help Chalk a lot (although I'm not sure who or how regularly that is being developed).

Sounds good. FWIW, Chalk is dead. It's been moved into epic (https://github.com/dlwh/epic), though I haven't updated Chalk to that effect yet.

I have not had good experience with anything that looks like the cake pattern. Regular old composition is almost always better. I don't see your work lying around in your nak branch.

Yeah, that code is all in my private research repository, but basically there's a FeatureSet class that handles caching, and is composed with any number of mixin traits that take in a datum and give back an Array[Double], which are all concatenated in the base mixin trait. I mostly did it this way to play around with and understand self-typing and cake, but it's actually kind of nice. although it would require anyone who wants to add a feature use the abstract override, which might confuse some people, and it requires the traits to be mixed in statically, rather than using a registry.

The things I see needing immediate attention here are:

1) Data Structures -- Nailing down what the underlying data structures of the ML algorithms would help a lot towards standardizing the API and making it easier to extend the library.
Things to think about:
** Flexible representation for datasets. Does a Data Frame model cover this?

For unstructured classification, I think this more or less works. I don't know that it makes as much sense when we're doing structured prediction, but that's ok, I think.

** Ensure that the implicits from breeze for the linear algebra data structures that will likely underlie any Nak structures are easily accessible in training methods.
I say this mostly because I had a tough time writing my NCA code generically, and ended up just writing separate Dense and Sparse versions.

Hrm, I spent a lot of time trying to make sure that it's reasonably easy to do this (mostly via the InnerProductSpace-type implicits) I don't see many references to Sparse in your NCA branch, but I'd be happy to help work through that. In particular, the Adagrad stochastic gradient implementations should work with arbitrary Vector types.

Yeah, I think it was partially that I didn't totally grok the implicits implentation details quite yet, but I just pushed some initial attempts at a Sparse implementation. It'd be great to combine the dense and sparse trainers, but I don't know how to get the right implicits in scope since the generic types don't specify that the feature vector has to be a sparse vector or a dense vector..
Right now I'm having the problem that even with the separate implementations, the types are aggregated into one generic type in NCA, which uses the decomposedMahalanobis distance, which requires a sparse vector or a dense vector. Perhaps it could be solved by requiring some Can...'s but really the issue is that there has to be a way to restrict the feature vector type to one of the two supported vectors, which I'm not quite sure how to do. Any suggestions would be lovely :)
There's also still some CSCMatrix functionality missing to make the Objective work, namely a MulMatrix and a MulScalar, so I'll probably add those in and submit a PR. Also, just to note, this code also depends on my new reshape implicit, which seems to be working, so I'll submit a PR for that once I clean it up a bit.

** What functionality should these have? I think it'd be great if we could keep the data structures very accessible through functional code without losing too much efficiency

2) Feature extraction pipeline
Things:
** How to define extractors: two options off the top of my head are stacked traits and as functions that can be aggregated (registered and cached)

So, what we've done in Epic is to define a DSL object that provides "base" featurizers (e.g. identity, suffixes, prefixes) and then combinators either in the form of + or *, or as offset-calcultors that turn, e.g., "identity" into "identity of the previous word" via the (-1) affix. Individual featurizers are responsible for caching. At the end of the day, in Epic featurizers produce an Array[Feature], though there's no reason it couldn't be a counter or something.

As an example of what it looks like, this is the set of features for the POS tagger we use: https://github.com/dlwh/epic/blob/master/src/main/scala/epic/features/WordFeaturizer.scala#L31 There are some quirks in the implementation that I'd like to fix, but eventually.

What I'd really like is to be able to write this in an R-glm style, where you say something like tag ~ word + suffixes(-1), (tag(-1) * tag) ~ word + word(-1) and it will automatically give you a featurizer for a CRF. Or something.

I like the idea of doing something like R's glm as well, it seems like it'd be a nice way to reconcile some of the linear models currently in nak with a feature extraction concept. I'm not super familiar with R though, so I'll have to take a look at what they're doing.

Another question this brings up, after looking through some of your Epic code, is whether Nak can get itself into shape to the point where Epic would depend on Nak a bit more. I noticed you have implementations of StructurePerceptron and SVM, which could be refactored into a more general form in Nak. Also, the way that Epic is structured seems like it would offer some good guidance on how Nak could/would/should be used and maybe a goal could be to develop Nak with the intent of making it useful/natural to use in Epic. Not sure how you feel about that, David, since Epic is a great project on it's own, and I don't want to compromise anything there. Mostly I was thinking that we could use Epic as a way to inform Nak's development, e.g. making sure that the data model is easily extensible to Slabs since Slabs should remain in Epic, being NLP specific, but it would be great if Nak could offer a framework in which Slab's could be easily coded and used as data in Nak's ML API.

** How featurization should be called: explicitly, or more implicitly as a part of classifier training/testing?

I think I like implicitly more. (Of course, there's an extent to which some preprocessing should be done explicitly, e.g. tokenization, or reading in the file, if nothing else)

agreed :)

David Hall

unread,

Jul 7, 2014, 2:01:44 AM7/7/14

to scalanlp...@googlegroups.com

On Sun, Jul 6, 2014 at 6:54 PM, Gabriel Schubiner <gab...@cs.washington.edu> wrote:

On Saturday, July 5, 2014 3:23:44 PM UTC-7, David Hall wrote:

On Sat, Jul 5, 2014 at 1:09 PM, Gabriel Schubiner <gab...@cs.washington.edu> wrote:

I have not had good experience with anything that looks like the cake pattern. Regular old composition is almost always better. I don't see your work lying around in your nak branch.

Yeah, that code is all in my private research repository, but basically there's a FeatureSet class that handles caching, and is composed with any number of mixin traits that take in a datum and give back an Array[Double], which are all concatenated in the base mixin trait. I mostly did it this way to play around with and understand self-typing and cake, but it's actually kind of nice. although it would require anyone who wants to add a feature use the abstract override, which might confuse some people, and it requires the traits to be mixed in statically, rather than using a registry.

Fair enough. I honestly have found that the cake pattern just doesn't scale. If you can use composition, it just works better. But maybe there's something. I'd like to see what you have, if you can distill it to something you can share.

The things I see needing immediate attention here are:

1) Data Structures -- Nailing down what the underlying data structures of the ML algorithms would help a lot towards standardizing the API and making it easier to extend the library.
Things to think about:
** Flexible representation for datasets. Does a Data Frame model cover this?

For unstructured classification, I think this more or less works. I don't know that it makes as much sense when we're doing structured prediction, but that's ok, I think.

** Ensure that the implicits from breeze for the linear algebra data structures that will likely underlie any Nak structures are easily accessible in training methods.
I say this mostly because I had a tough time writing my NCA code generically, and ended up just writing separate Dense and Sparse versions.

Hrm, I spent a lot of time trying to make sure that it's reasonably easy to do this (mostly via the InnerProductSpace-type implicits) I don't see many references to Sparse in your NCA branch, but I'd be happy to help work through that. In particular, the Adagrad stochastic gradient implementations should work with arbitrary Vector types.

Yeah, I think it was partially that I didn't totally grok the implicits implentation details quite yet, but I just pushed some initial attempts at a Sparse implementation. It'd be great to combine the dense and sparse trainers, but I don't know how to get the right implicits in scope since the generic types don't specify that the feature vector has to be a sparse vector or a dense vector..

Right now I'm having the problem that even with the separate implementations, the types are aggregated into one generic type in NCA, which uses the decomposedMahalanobis distance, which requires a sparse vector or a dense vector. Perhaps it could be solved by requiring some Can...'s but really the issue is that there has to be a way to restrict the feature vector type to one of the two supported vectors, which I'm not quite sure how to do. Any suggestions would be lovely :)

Ok, I'll take a look tomorrow hopefully.

There's also still some CSCMatrix functionality missing to make the Objective work, namely a MulMatrix and a MulScalar, so I'll probably add those in and submit a PR. Also, just to note, this code also depends on my new reshape implicit, which seems to be working, so I'll submit a PR for that once I clean it up a bit.

great!

** What functionality should these have? I think it'd be great if we could keep the data structures very accessible through functional code without losing too much efficiency

2) Feature extraction pipeline
Things:
** How to define extractors: two options off the top of my head are stacked traits and as functions that can be aggregated (registered and cached)

So, what we've done in Epic is to define a DSL object that provides "base" featurizers (e.g. identity, suffixes, prefixes) and then combinators either in the form of + or *, or as offset-calcultors that turn, e.g., "identity" into "identity of the previous word" via the (-1) affix. Individual featurizers are responsible for caching. At the end of the day, in Epic featurizers produce an Array[Feature], though there's no reason it couldn't be a counter or something.

As an example of what it looks like, this is the set of features for the POS tagger we use: https://github.com/dlwh/epic/blob/master/src/main/scala/epic/features/WordFeaturizer.scala#L31 There are some quirks in the implementation that I'd like to fix, but eventually.

What I'd really like is to be able to write this in an R-glm style, where you say something like tag ~ word + suffixes(-1), (tag(-1) * tag) ~ word + word(-1) and it will automatically give you a featurizer for a CRF. Or something.

I like the idea of doing something like R's glm as well, it seems like it'd be a nice way to reconcile some of the linear models currently in nak with a feature extraction concept. I'm not super familiar with R though, so I'll have to take a look at what they're doing.

I honestly don't really either. I just like the syntax and I'm generally pro-declarative for models/features.

Another question this brings up, after looking through some of your Epic code, is whether Nak can get itself into shape to the point where Epic would depend on Nak a bit more. I noticed you have implementations of StructurePerceptron and SVM, which could be refactored into a more general form in Nak. Also, the way that Epic is structured seems like it would offer some good guidance on how Nak could/would/should be used and maybe a goal could be to develop Nak with the intent of making it useful/natural to use in Epic. Not sure how you feel about that, David, since Epic is a great project on it's own, and I don't want to compromise anything there. Mostly I was thinking that we could use Epic as a way to inform Nak's development, e.g. making sure that the data model is easily extensible to Slabs since Slabs should remain in Epic, being NLP specific, but it would be great if Nak could offer a framework in which Slab's could be easily coded and used as data in Nak's ML API.

(The SVM isn't fully baked. I don't know that I ever actually checked that the structured perceptron implementation works. I mostly use likelihood.)

Anyway, I'm not entirely sure of the right relationship between Nak and Epic. I generally like the way Epic is structured (as one might expect). One thing I don't particularly have a good solution for (within the framework) is handling data where there are more efficient ways to process big batches (e.g. gpus). I am willing to offload the framework onto Nak, but maybe the thing to do for now is to fork/borrow ideas from the Epic implementation and see how to plug in Epic once the dust is settled?