Announcing Epic, a Natural Language Processing Parser/POS Tagger, and more

David Hall

unread,

May 29, 2014, 1:44:56 AM5/29/14

to scala-...@googlegroups.com, scalanlp...@googlegroups.com

Hi everyone,

I'm pleased to finally actually release my parser, Epic. Epic is, first and foremost, a statistical parser for natural language texts. It also has a part-of-speech tagger and a named entity recognition system. It is also a framework for building "structured prediction" systems, which is just a fancy way of saying machine learning classifiers that produce things like parse trees or part-of-speech tag sequences.

I'm releasing parsing models for 8 languages: English, Basque, French, German, Hungarian, Korean, and Polish, and Swedish. (It is one of the better parsers available for all of these languages, though not quite the best in anything except Polish.) I am also releasing part of speech taggers for these languages (except Korean). They are also around state of the art, I think. Finally, I'm releasing a pretty ok NER system for English.

All of these models assume well-edited text, like newswire or other formal writing. They are unlikely to work on Tweets, for example. I would be happy to assist in the construction of more models for other languages, or models that are more portable to other domains. (Especially if you're interested in hiring a consultant...)

Documentation and more information is available on the github page: github.com/dlwh/epic

I'm tagging this release as 0.1, but it's been under development for a very long time. That said, it's not been used in the field yet, so I want to reserve the right to change the API, and plan on doing so.

-- David

Jason Baldridge

unread,

May 30, 2014, 4:32:10 PM5/30/14

to scalanlp...@googlegroups.com, scala-...@googlegroups.com

Congrats on getting that out! I haven't looked at it other than the announcement, but from that it sounds like you've got a lot of the core bits in there than one needs for NLP tools. I'm wondering if we should consider retiring Chalk and pulling out anything Epic needs and let Epic be the NLP toolkit? Just a thought.

--

Jason Baldridge
Associate Professor, Dept. of Linguistics, UT Austin
Co-founder & Chief Scientist, People Pattern
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge

David Hall

unread,

May 30, 2014, 8:59:43 PM5/30/14

to scalanlp...@googlegroups.com, scala-...@googlegroups.com

I think that's a reasonable idea at this point, since it's under more active development. One argument against is that we could let Chalk can be more... agnostic about backend (e.g. making a Stanford CoreNLP backend), but I'm open to whatever.

Jason Baldridge

unread,

May 31, 2014, 11:40:13 AM5/31/14

to scalanlp...@googlegroups.com, scala-...@googlegroups.com

Chalk has been mostly on ice, and Epic has momentum and a cooler name, so let's go with that. Unless you think Chalk could be good as a thin layer defining things like Slabs and such -- not so much as a way to create a backend-agnostic package but more if it makes sense for modularity's sake.

Elias Ponvert

unread,

Jun 1, 2014, 8:31:21 AM6/1/14

to scalanlp...@googlegroups.com

One very nice feature of OpenNLP is the *even* more low-level stuff -- sentence identification, tokenization -- using MaxEnt with models available for various languages. These are great for getting started with the technology with raw text. Would you all think including that sort of stuff in Epic is useful, or would Chalk be a home for that functionality?

-Eli

David Hall

unread,

Jun 1, 2014, 1:51:43 PM6/1/14

to scalanlp...@googlegroups.com

I think it's definitely useful. I actually built a max ent sentence splitter as part of this release. (Currently only trained on English, and potentially not very good). I have a hand-written jflex treebank tokenizer as well, but that could be turned into something ML based.

-- David

Elias Ponvert

unread,

Jun 1, 2014, 9:14:50 PM6/1/14

to scalanlp...@googlegroups.com

Wow, that's great! For the record, I do not believe tokenization should *necessarily* be ML-based.

David Hall

unread,

Jun 1, 2014, 11:34:45 PM6/1/14

to scalanlp...@googlegroups.com

Ok. I'll probably do this then.

I'd like to get discussion going on a few things. I'm going to enumerate things that I think are worth keeping, and then argue that we should throw a few things out.

Keep (and Modify)

Slabs. I like slabs. Maybe they should be shapeless-ized, and the base implementation needs to be improved, as has been noted, but it's a good abstraction.
A few questions on Slab:

What ContentTypes are we envisioning, apart from strings? I guess really long texts that need to be ropes or file backed or something?
Epic has Span as one of its primitives. Because of this, I'm inclined to separate the label of a span and the spans' range that the label applies to. Thoughts?
Better slab implementation : anyone have a better idea than bucketed annotations by type and then sorted within a bucket?

Corpus Readers: MASC and CoNLL
One of the porter stemmer implementations.
Tokenizers, Segmenters, etc. need to be Slabified.
EnglishWordClassGenerator, maybe CaseFolder
dramage's HTML stuff

Remove

LDA: I threw it together quickly and didn't test it carefully. We'd want to revisit these/put them in a separate project, anyway. (I get emails asking me about the stanford topic modeling toolkit every once in a while. Possibly worth making something like that.)
Twokenize? Or rather, I'd rather make this into a jflex tokenizer, possibly even integrating it directly into the standard treebank tokenizer. (Most of the processing is probably safe to do everywhere, right?)
LanguagePacks: bad abstraction
Text/LabeledText. Don't think I ever used these.

Thoughts?

-- David

David Hall

unread,

Jun 2, 2014, 12:26:17 AM6/2/14

to scalanlp...@googlegroups.com

Oh, one other thing: I'll not bring the akka stuff over. Akka always leaves a bad taste in my mouth anytime I get anywhere near it. We can build a layer that interfaces with akka later, but I'd rather not lock us into a particular abstraction right now. (One of the things I dislike about akka in particular is that by using it you're committing yourself to its world: its configuration system, its (lack of) type system, etc.)

Elias Ponvert

unread,

Jun 2, 2014, 10:23:46 AM6/2/14

to scalanlp...@googlegroups.com

+1 I like it a lot.

WRT Akka I totally agree. I even like Akka a lot (though I'm hardly an expert), but it's better suited for a services layer on top of a library like Epic or Chalk.

I'd only note briefly that Twokenize is pretty good, I've been happy with it. It'd take some work to convert it to use slabs or offsets, more work still to convert to a jflex tokenizer. It'd be great if it could make it in, and eventually I think it should be (and I'll help convert it), but I wouldn't want it to block the merging you're suggesting.

Jason Baldridge

unread,

Jun 2, 2014, 12:46:55 PM6/2/14

to scalanlp...@googlegroups.com

+1 too, this all makes a lot of sense.

Like Elias, I agree Akka has some great uses, but it makes sense to leave it out of Epic.

Regarding other content types, I mainly was thinking of images, or mixtures of text and images.

No thoughts on better slab implementation -- let's get a sensible easy thing going and then iterate and improve.

Agree on the removal suggestions. Perhaps for stuff like Twokenize we could have a minimal library that provides those and that Epic doesn't depend on. Or we could even re-purpose Chalk as a package for storing or managing language/domain specific models for Epic and include code like Twokenize?

Regarding spans, that sounds good. Perhaps we can do with with traits and mix-ins to get LabeledSpans when we want them?

Steven Bethard

unread,

Jun 2, 2014, 1:52:52 PM6/2/14

to scalanlp...@googlegroups.com, dl...@cs.berkeley.edu

On Sunday, June 1, 2014 11:34:45 PM UTC-4, David Hall wrote:

Keep (and Modify)
Slabs. I like slabs. Maybe they should be shapeless-ized, and the base implementation needs to be improved, as has been noted, but it's a good abstraction.
A few questions on Slab:
What ContentTypes are we envisioning, apart from strings? I guess really long texts that need to be ropes or file backed or something?

One possibility that I considered is having speech Slabs as well. You could imagine adding text annotations to spans of audio, e.g. their transcriptions. Is anyone likely to do this soon? Probably not. So if you want to simplify things to just assume strings, it probably wouldn't be the end of the world.

Epic has Span as one of its primitives. Because of this, I'm inclined to separate the label of a span and the spans' range that the label applies to. Thoughts?

I'm not sure I quite understood this. Can you give a brief snippet of what class structure you're proposing here?

Better slab implementation : anyone have a better idea than bucketed annotations by type and then sorted within a bucket?

The UIMA CAS approach is probably a pretty good place to start (see the Implementation -> Index Repository section):

Design and implementation of the UIMA Common Analysis System

http://domino.research.ibm.com/tchjr/journalindex.nsf/2733206779564b3d85256bd500483abf/221395678866f87d85256eed00780cd1!OpenDocument

They do basically what you say - they have a sorted index for each type - but they take some extra care to handle superclass/subclass relations properly, e.g. doing the merge-sort across indexes to cover all subtypes when searching for a supertype.

Steve

David Hall

unread,

Jun 2, 2014, 2:55:43 PM6/2/14

to scalanlp...@googlegroups.com

On Mon, Jun 2, 2014 at 10:52 AM, Steven Bethard <bet...@cis.uab.edu> wrote:

I'm not sure I quite understood this. Can you give a brief snippet of what class structure you're proposing here?

A Span is just Span(begin, end), and annotations are essentially (Span, LabelType) It's possible I can just rename chalk's span to LabeledSpan.

-- David

Steven Bethard

unread,

Jun 2, 2014, 5:33:49 PM6/2/14

to scalanlp...@googlegroups.com, dl...@cs.berkeley.edu

Having a span be just Span(begin, end) sounds good to me.

Steve

David Hall

unread,

Jun 4, 2014, 8:41:19 PM6/4/14

to scalanlp...@googlegroups.com

Ok, this is the current status:

https://github.com/dlwh/epic/tree/chalk_import

Tokenizers and SentenceSegmenters are now Slabified. The rest of Epic mostly doesn't use slabs yet. AnalysisFunctions and Slabs have been improved a little bit

Twokenize has been removed. TreebankTokenizer now subsumes it. (I checked against brendan's examples. There are some differences, namely we split contractions and we turn ")" into -RRB-, etc.)

-- David

Steven Bethard

unread,

Jun 5, 2014, 1:18:10 PM6/5/14

to scalanlp...@googlegroups.com, dl...@cs.berkeley.edu

On Wednesday, June 4, 2014 8:41:19 PM UTC-4, David Hall wrote:

Ok, this is the current status:

https://github.com/dlwh/epic/tree/chalk_import

I got errors[1] in trying to build this until I manually ran `cp sbt-launch.jar project/sbt-launch.jar`. Maybe the sbt-launch.jar in project is out of date? I don't really use SBT much, so I'm not sure whether the problem's on my end or in the repository.

I noticed in AnalysisPipeline.main that you convinced Scala to compile "sentenceSegmenter andThen tokenizer", where I was always using a workaround "stringBegin andThen sentenceSegmenter andThen tokenizer". What was the trick? Is it your definition of andThen?

Steve

[1]

$ ./sbt

Getting org.scala-tools.sbt sbt_2.9.1 0.13.2 ...

:: problems summary ::

:::: WARNINGS

module not found: org.scala-tools.sbt#sbt_2.9.1;0.13.2

==== local: tried

/Users/bethard/.ivy2/local/org.scala-tools.sbt/sbt_2.9.1/0.13.2/ivys/ivy.xml

==== typesafe-ivy-releases: tried

http://repo.typesafe.com/typesafe/ivy-releases/org.scala-tools.sbt/sbt_2.9.1/0.13.2/ivys/ivy.xml

==== Maven Central: tried

http://repo1.maven.org/maven2/org/scala-tools/sbt/sbt_2.9.1/0.13.2/sbt_2.9.1-0.13.2.pom

==== Scala-Tools Maven2 Repository: tried

http://scala-tools.org/repo-releases/org/scala-tools/sbt/sbt_2.9.1/0.13.2/sbt_2.9.1-0.13.2.pom

==== Scala-Tools Maven2 Snapshots Repository: tried

http://scala-tools.org/repo-snapshots/org/scala-tools/sbt/sbt_2.9.1/0.13.2/sbt_2.9.1-0.13.2.pom

::::::::::::::::::::::::::::::::::::::::::::::

:: UNRESOLVED DEPENDENCIES ::

::::::::::::::::::::::::::::::::::::::::::::::

:: org.scala-tools.sbt#sbt_2.9.1;0.13.2: not found

::::::::::::::::::::::::::::::::::::::::::::::

:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS

unresolved dependency: org.scala-tools.sbt#sbt_2.9.1;0.13.2: not found

Error during sbt execution: Error retrieving required libraries

(see /Users/bethard/.sbt/boot/update.log for complete log)

Error: Could not retrieve sbt 0.13.2

David Hall

unread,

Jun 5, 2014, 1:42:54 PM6/5/14

to scalanlp...@googlegroups.com

I didn't realize I was still bundling an SBT. I've stopped doing that, as a general rule.

Yeah. The big change is that AnalysisPipelines are no longer Function1. They instead have a method def apply[Input <: RequiredInput](...). The new andThen is kind of the only thing it could be, given the new definition. One thing that we can't eloquently express right now is the conjunction of two pipelines that have disjoint requirements. Scala is not going to figure out that if I have x andThen y, where x needs Foo and produces Bar, and y needs a Foo, Bar, and Baz to produce a Bam, then x andThen y expects a Foo with Baz. I think we'd need to move to something else.

A few questions:

1) right now my Tree class knows about spans over tokens. This is very useful in the learning pipeline. Should I introduce another annotation type (or types) for constituents that we should put in slabs? How should constituents be related to one another? should I instead have a ParseAnnotation that has a Tree and the relevant char offsets?

2) What is the best way to represent relationships between Spans? We have a good mechanism for representing annotations of a particular span, but for things like coref or SRL we want to be able to relate two or more spans to one another. Maybe just pick one span to "host" the annotation and it has pointers to the other spans involved?

-- David

David Hall

unread,

Jun 5, 2014, 2:53:08 PM6/5/14

to scalanlp...@googlegroups.com

Also, I'm now meant to give a talk on Breeze and Epic in two weeks at SF Scala. Anyone have thoughts on how to motivate parsing to a general CS audience?

Steven Bethard

unread,

Jun 7, 2014, 5:35:35 PM6/7/14

to scalanlp...@googlegroups.com

On Thu, Jun 5, 2014 at 1:42 PM, David Hall <dl...@cs.berkeley.edu> wrote:
> I didn't realize I was still bundling an SBT. I've stopped doing that, as a
> general rule.

Sounds like a good idea.

> Yeah. The big change is that AnalysisPipelines are no longer Function1. They
> instead have a method def apply[Input <: RequiredInput](...). The new
> andThen is kind of the only thing it could be, given the new definition.
> One thing that we can't eloquently express right now is the conjunction of
> two pipelines that have disjoint requirements. Scala is not going to figure
> out that if I have x andThen y, where x needs Foo and produces Bar, and y
> needs a Foo, Bar, and Baz to produce a Bam, then x andThen y expects a Foo
> with Baz. I think we'd need to move to something else.

Is that something else Spire? Would it handle this case? I don't know
how important that case is to me, but it doesn't seem unreasonable...

> A few questions:
>
> 1) right now my Tree class knows about spans over tokens. This is very
> useful in the learning pipeline. Should I introduce another annotation type
> (or types) for constituents that we should put in slabs? How should
> constituents be related to one another? should I instead have a
> ParseAnnotation that has a Tree and the relevant char offsets?

It's nice to have constituents as annotations so that you can, say,
search for the largest constituent that a particular named entity
contains. In theory, you don't need to provide any relation between
constituents since that can be inferred from Slab.covered. In
practice, it might still be useful to have a pointer from parent to
child (or vice versa) since that would be quicker than a Slab lookup.
How hard would it be to just make your Tree an annotation? (Assuming
constituents are Trees and sub-Trees.) Then we'd get whatever helpful
utility methods you already have for Trees.

> 2) What is the best way to represent relationships between Spans? We have a
> good mechanism for representing annotations of a particular span, but for
> things like coref or SRL we want to be able to relate two or more spans to
> one another. Maybe just pick one span to "host" the annotation and it has
> pointers to the other spans involved?

The "host" approach is pretty much what people do in UIMA. So a
predicate has pointers to its arguments. With coreference, either each
mention points to its antecedent, or there's a Entity annotation that
has pointers to all the EntityMention annotations. (What the begin and
end offsets are for the Entity is a point of contention...)

One other thing I guess I should mention is that various corpora have
annotations with discontinuous spans. I haven't thought at all about
what we'd have to do to handle that sensibly. (UIMA doesn't really
support it either.)

Steve

David Hall

unread,

Jun 7, 2014, 6:05:29 PM6/7/14

to scalanlp...@googlegroups.com

On Sat, Jun 7, 2014 at 2:35 PM, Steven Bethard <steven....@gmail.com> wrote:

On Thu, Jun 5, 2014 at 1:42 PM, David Hall <dl...@cs.berkeley.edu> wrote:
> I didn't realize I was still bundling an SBT. I've stopped doing that, as a
> general rule.

Sounds like a good idea.

> Yeah. The big change is that AnalysisPipelines are no longer Function1. They
> instead have a method def apply[Input <: RequiredInput](...). The new
> andThen is kind of the only thing it could be, given the new definition.
> One thing that we can't eloquently express right now is the conjunction of
> two pipelines that have disjoint requirements. Scala is not going to figure
> out that if I have x andThen y, where x needs Foo and produces Bar, and y
> needs a Foo, Bar, and Baz to produce a Bam, then x andThen y expects a Foo
> with Baz. I think we'd need to move to something else.

Is that something else Spire? Would it handle this case? I don't know
how important that case is to me, but it doesn't seem unreasonable...

Shapeless, yeah. I think it's possible to manually specify the types in the current approach, but the inference isn't powerful enough to do that itself, I think.

> A few questions:
>
> 1) right now my Tree class knows about spans over tokens. This is very
> useful in the learning pipeline. Should I introduce another annotation type
> (or types) for constituents that we should put in slabs? How should
> constituents be related to one another? should I instead have a
> ParseAnnotation that has a Tree and the relevant char offsets?

It's nice to have constituents as annotations so that you can, say,
search for the largest constituent that a particular named entity
contains. In theory, you don't need to provide any relation between
constituents since that can be inferred from Slab.covered. In
practice, it might still be useful to have a pointer from parent to
child (or vice versa) since that would be quicker than a Slab lookup.
How hard would it be to just make your Tree an annotation? (Assuming
constituents are Trees and sub-Trees.) Then we'd get whatever helpful
utility methods you already have for Trees.

The biggest thing is (carefully!) going through the code usages and being clear whether a begin/end pair refers to tokens or to char offsets.

> 2) What is the best way to represent relationships between Spans? We have a
> good mechanism for representing annotations of a particular span, but for
> things like coref or SRL we want to be able to relate two or more spans to
> one another. Maybe just pick one span to "host" the annotation and it has
> pointers to the other spans involved?

The "host" approach is pretty much what people do in UIMA. So a
predicate has pointers to its arguments. With coreference, either each
mention points to its antecedent, or there's a Entity annotation that
has pointers to all the EntityMention annotations. (What the begin and
end offsets are for the Entity is a point of contention...)

I like that latter one. I do see the potential for problems for where the entity goes. I'd probably do the entire document? (Or, rather, the largest chunk that was fed to the coref system at one time.)

One other thing I guess I should mention is that various corpora have
annotations with discontinuous spans. I haven't thought at all about
what we'd have to do to handle that sensibly. (UIMA doesn't really
support it either.)

Ah. Yeah. Pointers probably work ok here, i guess?

Steve

Steven Bethard

unread,

Jun 7, 2014, 6:25:37 PM6/7/14

to scalanlp...@googlegroups.com, dl...@cs.berkeley.edu

On Saturday, June 7, 2014 6:05:29 PM UTC-4, David Hall wrote:

On Sat, Jun 7, 2014 at 2:35 PM, Steven Bethard <steven....@gmail.com> wrote:

> Yeah. The big change is that AnalysisPipelines are no longer Function1. They
> instead have a method def apply[Input <: RequiredInput](...). The new
> andThen is kind of the only thing it could be, given the new definition.
> One thing that we can't eloquently express right now is the conjunction of
> two pipelines that have disjoint requirements. Scala is not going to figure
> out that if I have x andThen y, where x needs Foo and produces Bar, and y
> needs a Foo, Bar, and Baz to produce a Bam, then x andThen y expects a Foo
> with Baz. I think we'd need to move to something else.

Is that something else Spire? Would it handle this case? I don't know
how important that case is to me, but it doesn't seem unreasonable...

Shapeless, yeah. I think it's possible to manually specify the types in the current approach, but the inference isn't powerful enough to do that itself, I think.

Seems like a good argument for moving to shapeless then. Making users figure out the types for pipelines sounds like a recipe for frustration. ;-)

The "host" approach is pretty much what people do in UIMA. So a
predicate has pointers to its arguments. With coreference, either each
mention points to its antecedent, or there's a Entity annotation that
has pointers to all the EntityMention annotations. (What the begin and
end offsets are for the Entity is a point of contention...)

I like that latter one. I do see the potential for problems for where the entity goes. I'd probably do the entire document? (Or, rather, the largest chunk that was fed to the coref system at one time.)

Yeah, that seems reasonable. The only other alternative I can think of is to start from the start of the first mention and end at the end of the last mention. One argument for doing that would be that if you had a large document with sections, you could ask for the Entity annotations that appear wholly within that section.

In general, when trying to figure out what the offsets should be, I try to ask myself: Who might want to search for this, and where would they expect to find it?

One other thing I guess I should mention is that various corpora have
annotations with discontinuous spans. I haven't thought at all about
what we'd have to do to handle that sensibly. (UIMA doesn't really
support it either.)

Ah. Yeah. Pointers probably work ok here, i guess?

So have a primary Span with a pointer to the secondary discontinuous span? Yeah, that might work, but I don't know how preceding, following, covered, etc. should handle such a thing.

Steve

Jason Baldridge

unread,

Jun 9, 2014, 10:45:48 AM6/9/14

to scalanlp...@googlegroups.com

Cool! I'd focus on:

- high level motivation, which you can do in one or two ways -- language acquisition perspective, how do kids come up with these grammar things and why is machine learning relevant (could be pretty much a side note to just say that this is an interesting thing and maybe say a bit about what Dan Klein and others have done in unsupervised parsing land), and then the practical in terms of identifying pred-arg relationships such that one can query DBs using natural language, or talk to robots, etc. They'll get it if you talk about it in terms of turning unstructured language data into structured pred-arg data -- just make it clear to them that this is a deeper level of processing than what is usually meant for unstructured to structured processing of natural language.

- make sure to emphasize how NL parsing is harder than you'd thing. I really like the "she saw her duck with a telescope" and "the a are of I" examples -- they drive the point home quickly and effectively. Lillian Lee made two nice slides for this in the talk we gave at SXSW a couple of years back (feel free to use/modify, and I'm happy to share the keynote source): http://www.jasonbaldridge.com/papers/hlt-sxsw12.pdf

- which leads to: motivate the computational challenge of the problem, which is fun to geek out on and motivates the GPU angle nicely and links back to why we care about ML approaches as a way to manage both ambiguity and search in parsing.

Then, show them all the cool stuff you did, and how Scala helped get it done! (And of course, any thing that Scala made harder, etc.)

If you make slides, it would be great to see them!

Jason

David Hall

unread,

Jun 9, 2014, 9:02:08 PM6/9/14

to scalanlp...@googlegroups.com

Awesome, thanks so much!

Reply all

Reply to author

Forward