Fwd: [scalanlp] Tree data type

23 views

Skip to first unread message

David Hall

unread,

Feb 12, 2015, 11:28:30 PM2/12/15

to scalanlp...@googlegroups.com, Simon Hafner

Simon, actually scalanlp-discuss is the right list, sorry. I'll reply to this email.

---------- Forwarded message ----------
From: Simon Hafner <hafne...@gmail.com>
Date: Wed, Feb 11, 2015 at 7:02 PM
Subject: [scalanlp] Tree data type
To: scal...@googlegroups.com

I'm trying to figure out a data type for trees. Epic currently encodes the position of a tree node as a Span object relative to the token sequence. So in the tree

| \

B C

Would be encoded as

Node(A, Span(0,2), List(Node(B, Span(0,1)), Node(C, Span(1,2)))

In slabs, there is not really a concept of a sentence or a sorted list, so the Spans can't reference any index. But the spans would allow to flatten the tree into a list and reconstruct it, which would make the serialization easier.

One idea (inspired by Jason) is to store the tokens in the leafs, but I don't know how exactly to store them, because each object is the same. Option[Token] maybe? But then there's the possibility of an invalid tree. It would also be impossible to flatten a tree without any further addition.

Another idea would be to use char offset spans instead. Each tree node has a span attribute which corresponds to which characters it stores. To retrieve the corresponding token on a leaf node, index the tokens by span (which is an int) and retrieve the corresponding token. This would also allow to flatten the tree and rely on the spans to reconstruct it.

What I've found on UIMA:

https://mail-archives.apache.org/mod_mbox/uima-user/200902.mbox/%3C498B9A...@linuxusers.de%3E

https://uima.apache.org/d/uimaj-2.6.0/apidocs/org/apache/uima/cas/text/AnnotationTree.html

There doesn't seem to be too much, except the tree nodes have the span attribute as char offset and no way to flatten the tree.

I'm leaning towards the second option with char offsets because it makes flattening possible, and the first option makes it easy to produce invalid trees where the leafs don't have the tokens.

What are your thoughts? What are the problems with either idea, or is it possible to combine the two?

--
You received this message because you are subscribed to the Google Groups "ScalaNLP" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scalanlp+u...@googlegroups.com.
To post to this group, send email to scal...@googlegroups.com.
Visit this group at http://groups.google.com/group/scalanlp.
For more options, visit https://groups.google.com/d/optout.

David Hall

unread,

Feb 12, 2015, 11:34:47 PM2/12/15

to scalanlp...@googlegroups.com, Simon Hafner

On Thu, Feb 12, 2015 at 8:28 PM, David Hall <david....@gmail.com> wrote:

Simon, actually scalanlp-discuss is the right list, sorry. I'll reply to this email.

---------- Forwarded message ----------
From: Simon Hafner <hafne...@gmail.com>
Date: Wed, Feb 11, 2015 at 7:02 PM
Subject: [scalanlp] Tree data type
To: scal...@googlegroups.com

I'm trying to figure out a data type for trees. Epic currently encodes the position of a tree node as a Span object relative to the token sequence. So in the tree

A
| \
B C

Would be encoded as
Node(A, Span(0,2), List(Node(B, Span(0,1)), Node(C, Span(1,2)))

The current data structure Tree[L] is in service of the underlying algorithms. It might make sense to add another data structure (Parse[L, W]?) that is more specific.

In slabs, there is not really a concept of a sentence or a sorted list, so the Spans can't reference any index. But the spans would allow to flatten the tree into a list and reconstruct it, which would make the serialization easier.

Trees really are about relationships between spans (subject to certain constraints). It might make more sense to extend slabs to have a notion of a dependency link between two spans. (here, parent constituent would be the head and its children the dependent.) This has come up before in our early discussion of slabs (Steven Bethard, I think.)

Maybe all spans of a given type have (0 or more) dependents of the same type? Maybe some of them are "roots" for a particular type, which are the only ones directly accessible by the Slab interface?

One idea (inspired by Jason) is to store the tokens in the leafs, but I don't know how exactly to store them, because each object is the same. Option[Token] maybe? But then there's the possibility of an invalid tree. It would also be impossible to flatten a tree without any further addition.

I would be ok with changing Tree[L] to Tree[L, W] and messing with the Tree hierarchy to support that.

Another idea would be to use char offset spans instead. Each tree node has a span attribute which corresponds to which characters it stores. To retrieve the corresponding token on a leaf node, index the tokens by span (which is an int) and retrieve the corresponding token. This would also allow to flatten the tree and rely on the spans to reconstruct it.