One very nice feature of OpenNLP is the *even* more low-level stuff -- sentence identification, tokenization -- using MaxEnt with models available for various languages. These are great for getting started with the technology with raw text. Would you all think including that sort of stuff in Epic is useful, or would Chalk be a home for that functionality?
-Eli
Keep (and Modify)
- Slabs. I like slabs. Maybe they should be shapeless-ized, and the base implementation needs to be improved, as has been noted, but it's a good abstraction.
A few questions on Slab:
- What ContentTypes are we envisioning, apart from strings? I guess really long texts that need to be ropes or file backed or something?
- Epic has Span as one of its primitives. Because of this, I'm inclined to separate the label of a span and the spans' range that the label applies to. Thoughts?
- Better slab implementation : anyone have a better idea than bucketed annotations by type and then sorted within a bucket?
I'm not sure I quite understood this. Can you give a brief snippet of what class structure you're proposing here?
Ok, this is the current status:
On Thu, Jun 5, 2014 at 1:42 PM, David Hall <dl...@cs.berkeley.edu> wrote:Sounds like a good idea.
> I didn't realize I was still bundling an SBT. I've stopped doing that, as a
> general rule.
Is that something else Spire? Would it handle this case? I don't know
> Yeah. The big change is that AnalysisPipelines are no longer Function1. They
> instead have a method def apply[Input <: RequiredInput](...). The new
> andThen is kind of the only thing it could be, given the new definition.
> One thing that we can't eloquently express right now is the conjunction of
> two pipelines that have disjoint requirements. Scala is not going to figure
> out that if I have x andThen y, where x needs Foo and produces Bar, and y
> needs a Foo, Bar, and Baz to produce a Bam, then x andThen y expects a Foo
> with Baz. I think we'd need to move to something else.
how important that case is to me, but it doesn't seem unreasonable...
It's nice to have constituents as annotations so that you can, say,
> A few questions:
>
> 1) right now my Tree class knows about spans over tokens. This is very
> useful in the learning pipeline. Should I introduce another annotation type
> (or types) for constituents that we should put in slabs? How should
> constituents be related to one another? should I instead have a
> ParseAnnotation that has a Tree and the relevant char offsets?
search for the largest constituent that a particular named entity
contains. In theory, you don't need to provide any relation between
constituents since that can be inferred from Slab.covered. In
practice, it might still be useful to have a pointer from parent to
child (or vice versa) since that would be quicker than a Slab lookup.
How hard would it be to just make your Tree an annotation? (Assuming
constituents are Trees and sub-Trees.) Then we'd get whatever helpful
utility methods you already have for Trees.
The "host" approach is pretty much what people do in UIMA. So a
> 2) What is the best way to represent relationships between Spans? We have a
> good mechanism for representing annotations of a particular span, but for
> things like coref or SRL we want to be able to relate two or more spans to
> one another. Maybe just pick one span to "host" the annotation and it has
> pointers to the other spans involved?
predicate has pointers to its arguments. With coreference, either each
mention points to its antecedent, or there's a Entity annotation that
has pointers to all the EntityMention annotations. (What the begin and
end offsets are for the Entity is a point of contention...)
One other thing I guess I should mention is that various corpora have
annotations with discontinuous spans. I haven't thought at all about
what we'd have to do to handle that sensibly. (UIMA doesn't really
support it either.)
Steve
On Sat, Jun 7, 2014 at 2:35 PM, Steven Bethard <steven....@gmail.com> wrote:
> Yeah. The big change is that AnalysisPipelines are no longer Function1. TheyIs that something else Spire? Would it handle this case? I don't know
> instead have a method def apply[Input <: RequiredInput](...). The new
> andThen is kind of the only thing it could be, given the new definition.
> One thing that we can't eloquently express right now is the conjunction of
> two pipelines that have disjoint requirements. Scala is not going to figure
> out that if I have x andThen y, where x needs Foo and produces Bar, and y
> needs a Foo, Bar, and Baz to produce a Bam, then x andThen y expects a Foo
> with Baz. I think we'd need to move to something else.
how important that case is to me, but it doesn't seem unreasonable...Shapeless, yeah. I think it's possible to manually specify the types in the current approach, but the inference isn't powerful enough to do that itself, I think.
The "host" approach is pretty much what people do in UIMA. So apredicate has pointers to its arguments. With coreference, either each
mention points to its antecedent, or there's a Entity annotation that
has pointers to all the EntityMention annotations. (What the begin and
end offsets are for the Entity is a point of contention...)I like that latter one. I do see the potential for problems for where the entity goes. I'd probably do the entire document? (Or, rather, the largest chunk that was fed to the coref system at one time.)
One other thing I guess I should mention is that various corpora have
annotations with discontinuous spans. I haven't thought at all about
what we'd have to do to handle that sensibly. (UIMA doesn't really
support it either.)Ah. Yeah. Pointers probably work ok here, i guess?