Fwd: [scalanlp] Passing type arguments

41 views
Skip to first unread message

David Hall

unread,
Mar 3, 2015, 9:31:02 PM3/3/15
to scalanlp...@googlegroups.com, Simon Hafner

---------- Forwarded message ----------
From: Simon Hafner <hafne...@gmail.com>
Date: Tue, Mar 3, 2015 at 6:19 PM
Subject: [scalanlp] Passing type arguments
To: scalanlp <scal...@googlegroups.com>


The problem: How to pass the input types to an annotator. Possible solutions:

a) Actually don't have different input types. Force everything to be
Sentence, Token, etc.
Drawback: No more object hierarchies.

b) Make the compiler find out the types. That's what I did originally.
Makes the best API.

    val sentenceSegmenter = epic.preprocess.MLSentenceSegmenter.bundled().get
    val tokenizer = epic.preprocess.TreebankTokenizer
    val slabs = documents.map(Slab(_))
      .map(sentenceSegmenter(_))
      .map(tokenizer(_))

Drawback: ~ 20-30s compiletime per pipeline. Unacceptable.

c) Pass the types as arguments to the constructor.

    val sentenceSegmenter = epic.preprocess.MLSentenceSegmenter.bundled().get
    val tokenizer = epic.preprocess.TreebankTokenizer.slab[Sentence]
    val parser =
epic.models.ParserSelector.loadParser("en").get.slab[Sentence,
ContentToken]
    val slabs = documents.map(Slab(_))
      .map(sentenceSegmenter(_))
      .map(tokenizer(_))
      .map(parser(_))

Drawback: Doesn't work for the parser, because the apply for slabs is
via implicit class. Would need to do it explicitly, shouldn't be too
big of a problem.

d) Via apply method.

    val sentenceSegmenter = epic.preprocess.MLSentenceSegmenter.bundled().get
    val tokenizer = epic.preprocess.TreebankTokenizer
    val parser = epic.models.ParserSelector.loadParser("en").get
    val slabs = documents.map(Slab(_))
      .map(sentenceSegmenter(_))
      .map(tokenizer[Sentence](_))
      .map(parser[Sentence, ContentToken](_))

Drawback: Doesn't really work, because the type information is passed
around at class level. Would require some major rewrite of slab code.

e) Your solution.

--
You received this message because you are subscribed to the Google Groups "ScalaNLP" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scalanlp+u...@googlegroups.com.
To post to this group, send email to scal...@googlegroups.com.
Visit this group at http://groups.google.com/group/scalanlp.
For more options, visit https://groups.google.com/d/optout.

David Hall

unread,
Mar 3, 2015, 9:35:37 PM3/3/15
to scalanlp...@googlegroups.com, Simon Hafner
So obviously I'd love if (b) could work, but that compile time is insane. Any idea if we can help the compiler out a little by adding some extra well placed implicits to hold its hand?

How much of an advantage are the hlists (vs hacky intersection type), given these compile times? Intersection types work with (b) and don't have bad compile time properties.

I don't totally understand (c). What are the acceptable things to put in epic.preprocess.TreebankTokenizer.slab[???]

(a) and (d) are not ideal.

Simon Hafner

unread,
Mar 3, 2015, 10:11:20 PM3/3/15
to scalanlp...@googlegroups.com
2015-03-03 20:35 GMT-06:00 David Hall <david....@gmail.com>:
> So obviously I'd love if (b) could work, but that compile time is insane.
> Any idea if we can help the compiler out a little by adding some extra well
> placed implicits to hold its hand?

The code I used before:
https://github.com/reactormonk/epic/blob/170e8cf273a15c855b1f47610810b7305de6fea0/slab/src/main/scala/slab/typeclasses.scala#L39-L63

it basically select ALL the subtypes and merges all matches via monad.
Maybe switch it to only select the first type? It should speed things
up (to be tested), but drops anything after the first find of a
sentence. Honestly this behavior would probably be preferred to
merging all found objects. It should also be faster.

> How much of an advantage are the hlists (vs hacky intersection type), given
> these compile times? Intersection types work with (b) and don't have bad
> compile time properties.
So basically Sentence with Token in this case with a lot of case
statements? Might work, but I haven't messed with intersection types
enough to see the pros and cons.

> I don't totally understand (c). What are the acceptable things to put in
> epic.preprocess.TreebankTokenizer.slab[???]
as specified by the slab method, which needs to be defined on an
annotator basis. in the case of the TreebankTokenizer, it would be
S <: Sentence.

> (a) and (d) are not ideal.
I agree. Dropped.

David Hall

unread,
Mar 4, 2015, 12:06:36 PM3/4/15
to scalanlp...@googlegroups.com
On Tue, Mar 3, 2015 at 7:11 PM, Simon Hafner <hafne...@gmail.com> wrote:
2015-03-03 20:35 GMT-06:00 David Hall <david....@gmail.com>:
> So obviously I'd love if (b) could work, but that compile time is insane.
> Any idea if we can help the compiler out a little by adding some extra well
> placed implicits to hold its hand?

The code I used before:
https://github.com/reactormonk/epic/blob/170e8cf273a15c855b1f47610810b7305de6fea0/slab/src/main/scala/slab/typeclasses.scala#L39-L63

it basically select ALL the subtypes and merges all matches via monad.
Maybe switch it to only select the first type? It should speed things
up (to be tested), but drops anything after the first find of a
sentence. Honestly this behavior would probably be preferred to
merging all found objects. It should also be faster.

Cool, let's try that out!
 

> How much of an advantage are the hlists (vs hacky intersection type), given
> these compile times? Intersection types work with (b) and don't have bad
> compile time properties.
So basically Sentence with Token in this case with a lot of case
statements? Might work, but I haven't messed with intersection types
enough to see the pros and cons.

I don't actually see why there are case statements? The way I had it before we just used the ClassTag to index the spans of different types, and keep them in sorted order for fast lookup. 
 

> I don't totally understand (c). What are the acceptable things to put in
> epic.preprocess.TreebankTokenizer.slab[???]
as specified by the slab method, which needs to be defined on an
annotator basis. in the case of the TreebankTokenizer, it would be
S <: Sentence.

Right. Yeah, I don't like this either.

Simon Hafner

unread,
Mar 4, 2015, 10:49:15 PM3/4/15
to scalanlp...@googlegroups.com
2015-03-03 21:11 GMT-06:00 Simon Hafner <hafne...@gmail.com>:
> 2015-03-03 20:35 GMT-06:00 David Hall <david....@gmail.com>:
>> So obviously I'd love if (b) could work, but that compile time is insane.
>> Any idea if we can help the compiler out a little by adding some extra well
>> placed implicits to hold its hand?
>
> The code I used before:
> https://github.com/reactormonk/epic/blob/170e8cf273a15c855b1f47610810b7305de6fea0/slab/src/main/scala/slab/typeclasses.scala#L39-L63
>
> it basically select ALL the subtypes and merges all matches via monad.
> Maybe switch it to only select the first type? It should speed things
> up (to be tested), but drops anything after the first find of a
> sentence. Honestly this behavior would probably be preferred to
> merging all found objects. It should also be faster.
It is faster. Works as expected. Code:
https://github.com/reactormonk/epic/blob/master/src/test/scala/epic/documentation/ReadmeExamplesTest.scala#L22-L32

I'll also adapt the PoS tagger and the NER, anything else that needs a
slab adapter?

Simon Hafner

unread,
Mar 5, 2015, 9:01:26 AM3/5/15
to scalanlp...@googlegroups.com
The NER model uses `Any` as Label, which isn't really helpful. Is
there a way to further specify that?

https://github.com/reactormonk/epic/blob/master/src/main/scala/epic/models/NerModelLoader.scala#L12

Also, the AnnotatedLabel by
https://github.com/reactormonk/epic/blob/master/src/main/scala/epic/models/PosTagModelLoader.scala#L13
might conflict with other AnnotatedLabels, which will all get selected
together. But I don't see a way out of this, because it's a case
class, so creating a new subclass isn't too easy either.
Reply all
Reply to author
Forward
0 new messages