Tree data type

Simon Hafner

unread,

Feb 11, 2015, 10:02:33 PM2/11/15

to scal...@googlegroups.com

I'm trying to figure out a data type for trees. Epic currently encodes the position of a tree node as a Span object relative to the token sequence. So in the tree

A

| \

B C

Would be encoded as

Node(A, Span(0,2), List(Node(B, Span(0,1)), Node(C, Span(1,2)))

In slabs, there is not really a concept of a sentence or a sorted list, so the Spans can't reference any index. But the spans would allow to flatten the tree into a list and reconstruct it, which would make the serialization easier.

One idea (inspired by Jason) is to store the tokens in the leafs, but I don't know how exactly to store them, because each object is the same. Option[Token] maybe? But then there's the possibility of an invalid tree. It would also be impossible to flatten a tree without any further addition.

Another idea would be to use char offset spans instead. Each tree node has a span attribute which corresponds to which characters it stores. To retrieve the corresponding token on a leaf node, index the tokens by span (which is an int) and retrieve the corresponding token. This would also allow to flatten the tree and rely on the spans to reconstruct it.

What I've found on UIMA:

https://mail-archives.apache.org/mod_mbox/uima-user/200902.mbox/%3C498B9A...@linuxusers.de%3E

https://uima.apache.org/d/uimaj-2.6.0/apidocs/org/apache/uima/cas/text/AnnotationTree.html

There doesn't seem to be too much, except the tree nodes have the span attribute as char offset and no way to flatten the tree.

I'm leaning towards the second option with char offsets because it makes flattening possible, and the first option makes it easy to produce invalid trees where the leafs don't have the tokens.

What are your thoughts? What are the problems with either idea, or is it possible to combine the two?

Simon Hafner

unread,

Feb 13, 2015, 1:43:04 PM2/13/15

to scal...@googlegroups.com

> On Thu, Feb 12, 2015 at 8:28 PM, David Hall <> wrote:

>> I'm trying to figure out a data type for trees. Epic currently encodes the

>> position of a tree node as a Span object relative to the token sequence. So

>> in the tree

>>

>> A

>> | \

>> B C

>>

>> Would be encoded as

>> Node(A, Span(0,2), List(Node(B, Span(0,1)), Node(C, Span(1,2)))

>

> The current data structure Tree[L] is in service of the underlying

> algorithms. It might make sense to add another data structure (Parse[L, W]?)

> that is more specific.

Or epic.slab.Tree. Or reuse scalaz.Tree. So another entirely different type.

>> In slabs, there is not really a concept of a sentence or a sorted list, so

>> the Spans can't reference any index. But the spans would allow to flatten

>> the tree into a list and reconstruct it, which would make the serialization

>> easier.

>

> Trees really are about relationships between spans (subject to certain

> constraints). It might make more sense to extend slabs to have a notion of a

> dependency link between two spans. (here, parent constituent would be the

> head and its children the dependent.) This has come up before in our early

> discussion of slabs (Steven Bethard, I think.)

Are the two spans arranged in a context-free manner? Otherwise we're

drifting into graph territory. Also, is it possible to keep all data on

a node or is it required to be able to annotate the connections too?

> Maybe all spans of a given type have (0 or more) dependents of the same

> type? Maybe some of them are "roots" for a particular type, which are the

> only ones directly accessible by the Slab interface?

Yes, that's what I intend to do. The current interface of a parser

is sentence: Vector[String] => scalaz.Tree[T <: SpanAnnotation], so one

root node per sentence per parser.

>> One idea (inspired by Jason) is to store the tokens in the leafs, but I

>> don't know how exactly to store them, because each object is the same.

>> Option[Token] maybe? But then there's the possibility of an invalid tree. It

>> would also be impossible to flatten a tree without any further addition.

>

> I would be ok with changing Tree[L] to Tree[L, W] and messing with the Tree

> hierarchy to support that.

Yes, but I don't see how it would be possible to make the leaf nodes

another type. And it wouldn't really be checked by the compiler either.

>> Another idea would be to use char offset spans instead. Each tree node has

>> a span attribute which corresponds to which characters it stores. To

>> retrieve the corresponding token on a leaf node, index the tokens by span

>> (which is an int) and retrieve the corresponding token. This would also

>> allow to flatten the tree and rely on the spans to reconstruct it.

>

> That's currently how I've been doing. It's kind of clunky.

That would be the serialization format in the end. I'd prefer to be

able to infer the tree from a (ordered) list of SpanAnnotation. Also,

it has the benefit that I don't have to deal with the special case of

leaf nodes.

And imo each clunky interface can be made prettier by a few helpers

here and there.

David Hall

unread,

Feb 14, 2015, 12:41:00 PM2/14/15

to scal...@googlegroups.com

On Fri, Feb 13, 2015 at 10:43 AM, Simon Hafner <hafne...@gmail.com> wrote:

> On Thu, Feb 12, 2015 at 8:28 PM, David Hall <> wrote:
>> I'm trying to figure out a data type for trees. Epic currently encodes the
>> position of a tree node as a Span object relative to the token sequence. So
>> in the tree
>>
>> A
>> | \
>> B C
>>
>> Would be encoded as
>> Node(A, Span(0,2), List(Node(B, Span(0,1)), Node(C, Span(1,2)))
>
>
> The current data structure Tree[L] is in service of the underlying
> algorithms. It might make sense to add another data structure (Parse[L, W]?)
> that is more specific.
Or epic.slab.Tree. Or reuse scalaz.Tree. So another entirely different type.

I don't know why, but scalaz always makes me uneasy. They tend to use names and symbols that scare mere mortal programmers. Plus performance is usually awful.

>> In slabs, there is not really a concept of a sentence or a sorted list, so
>> the Spans can't reference any index. But the spans would allow to flatten
>> the tree into a list and reconstruct it, which would make the serialization
>> easier.
>
>
> Trees really are about relationships between spans (subject to certain
> constraints). It might make more sense to extend slabs to have a notion of a
> dependency link between two spans. (here, parent constituent would be the
> head and its children the dependent.) This has come up before in our early
> discussion of slabs (Steven Bethard, I think.)

Are the two spans arranged in a context-free manner? Otherwise we're
drifting into graph territory. Also, is it possible to keep all data on
a node or is it required to be able to annotate the connections too?

What's wrong with graphs?

You'll need data on links, in general, but that doesn't seem like a problem.

--
You received this message because you are subscribed to the Google Groups "ScalaNLP" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scalanlp+u...@googlegroups.com.
To post to this group, send email to scal...@googlegroups.com.
Visit this group at http://groups.google.com/group/scalanlp.
For more options, visit https://groups.google.com/d/optout.

Simon Hafner

unread,

Feb 14, 2015, 1:15:23 PM2/14/15

to scal...@googlegroups.com

2015-02-14 11:41 GMT-06:00 David Hall <david....@gmail.com>:
> On Fri, Feb 13, 2015 at 10:43 AM, Simon Hafner <hafne...@gmail.com>
> wrote:
>>
>> > On Thu, Feb 12, 2015 at 8:28 PM, David Hall <> wrote:
>> >> I'm trying to figure out a data type for trees. Epic currently encodes
>> >> the
>> >> position of a tree node as a Span object relative to the token
>> >> sequence. So
>> >> in the tree
>> >>
>> >> A
>> >> | \
>> >> B C
>> >>
>> >> Would be encoded as
>> >> Node(A, Span(0,2), List(Node(B, Span(0,1)), Node(C, Span(1,2)))
>> >
>> >
>> > The current data structure Tree[L] is in service of the underlying
>> > algorithms. It might make sense to add another data structure (Parse[L,
>> > W]?)
>> > that is more specific.
>> Or epic.slab.Tree. Or reuse scalaz.Tree. So another entirely different
>> type.
>
> I don't know why, but scalaz always makes me uneasy. They tend to use names
> and symbols that scare mere mortal programmers. Plus performance is usually
> awful.

Said tree doesn't use symbols, although a few names. And streams.
Also, performance might be an issue because it isn't optimized.

>> >> In slabs, there is not really a concept of a sentence or a sorted list,
>> >> so
>> >> the Spans can't reference any index. But the spans would allow to
>> >> flatten
>> >> the tree into a list and reconstruct it, which would make the
>> >> serialization
>> >> easier.
>> >
>> >
>> > Trees really are about relationships between spans (subject to certain
>> > constraints). It might make more sense to extend slabs to have a notion
>> > of a
>> > dependency link between two spans. (here, parent constituent would be
>> > the
>> > head and its children the dependent.) This has come up before in our
>> > early
>> > discussion of slabs (Steven Bethard, I think.)
>>
>> Are the two spans arranged in a context-free manner? Otherwise we're
>> drifting into graph territory. Also, is it possible to keep all data on
>> a node or is it required to be able to annotate the connections too?
>
> What's wrong with graphs?

Not too much. However, it's a different data structure which requires
a different approach. We could go full on GrAF, I'd be fine with that.
And then encode the trees with it. It does however kinda kill the
whole idea of slabs where you have different types. With GrAF, you can
connect everything with everything, which makes typechecking
impossible.

> You'll need data on links, in general, but that doesn't seem like a problem.

It's a different data type, kinda, or you get a Tuple2[Child, Link] in
your children list.

Simon Hafner

unread,

Feb 17, 2015, 10:58:58 AM2/17/15

to scal...@googlegroups.com

I also don't like the idea of referencing tokens because it makes it hard to serialize, since there are now two references to a single Token object, which requires ids or copying. Both require full view of the data instead of just localized serialization.

Simon Hafner

unread,

Feb 18, 2015, 12:03:04 AM2/18/15

to scal...@googlegroups.com

Could you give me an example where you would need a reference to the Tokens in the parsetree?

David Hall

unread,

Feb 18, 2015, 1:55:46 AM2/18/15

to scal...@googlegroups.com

I mean, printing the parse tree in PTB format, for sure. The equivalent of "covered" is probably sufficient.

On Tue, Feb 17, 2015 at 9:03 PM, Simon Hafner <hafne...@gmail.com> wrote:

Could you give me an example where you would need a reference to the Tokens in the parsetree?

--

Simon Hafner

unread,

Feb 18, 2015, 10:53:06 AM2/18/15

to scal...@googlegroups.com

2015-02-18 0:55 GMT-06:00 David Hall <david....@gmail.com>:
> I mean, printing the parse tree in PTB format, for sure. The equivalent of
> "covered" is probably sufficient.

I intend to have a Tree[T <: SpanAnnotation] where the SpanAnnotation
has char offset semantics.

David Hall

unread,

Feb 18, 2015, 1:20:45 PM2/18/15

to scal...@googlegroups.com

Should be fine then

Simon Hafner

unread,

Feb 18, 2015, 1:28:39 PM2/18/15

to scal...@googlegroups.com

2015-02-18 12:20 GMT-06:00 David Hall <david....@gmail.com>:
> On Wed, Feb 18, 2015 at 7:53 AM, Simon Hafner <hafne...@gmail.com> wrote:
>>
>> 2015-02-18 0:55 GMT-06:00 David Hall <david....@gmail.com>:
>> > I mean, printing the parse tree in PTB format, for sure. The equivalent
>> > of
>> > "covered" is probably sufficient.
>> I intend to have a Tree[T <: SpanAnnotation] where the SpanAnnotation
>> has char offset semantics.
> Should be fine then

Perfect. Any idea what to use as tree? Should I copy/paste the code
from the epic.trees.Tree over into slab, make it a masterTree and
derive epic.trees.Tree from that with token span semantics and
epic.slab.Tree via char offset span semantics?

David Hall

unread,

Feb 18, 2015, 1:34:06 PM2/18/15

to scal...@googlegroups.com

There's nothing in Tree that requires token semantics. You can probably just use it directly.

Simon Hafner

unread,

Feb 18, 2015, 1:42:50 PM2/18/15

to scal...@googlegroups.com

2015-02-18 12:34 GMT-06:00 David Hall <david....@gmail.com>:
> On Wed, Feb 18, 2015 at 10:28 AM, Simon Hafner <hafne...@gmail.com>
> wrote:
>> 2015-02-18 12:20 GMT-06:00 David Hall <david....@gmail.com>:
>> > On Wed, Feb 18, 2015 at 7:53 AM, Simon Hafner <hafne...@gmail.com>
>> > wrote:
>> >>
>> >> 2015-02-18 0:55 GMT-06:00 David Hall <david....@gmail.com>:
>> >> > I mean, printing the parse tree in PTB format, for sure. The
>> >> > equivalent
>> >> > of
>> >> > "covered" is probably sufficient.
>> >> I intend to have a Tree[T <: SpanAnnotation] where the SpanAnnotation
>> >> has char offset semantics.
>> > Should be fine then
>> Perfect. Any idea what to use as tree? Should I copy/paste the code
>> from the epic.trees.Tree over into slab, make it a masterTree and
>> derive epic.trees.Tree from that with token span semantics and
>> epic.slab.Tree via char offset span semantics?
> There's nothing in Tree that requires token semantics. You can probably just
> use it directly.

https://github.com/reactormonk/epic/blob/master/src/main/scala/epic/trees/Tree.scala#L42-L49

David Hall

unread,

Feb 18, 2015, 1:46:08 PM2/18/15

to scal...@googlegroups.com

Ah. Maybe rename to isContiguous?

Simon Hafner

unread,

Feb 18, 2015, 1:51:22 PM2/18/15

to scal...@googlegroups.com

2015-02-18 12:46 GMT-06:00 David Hall <david....@gmail.com>:
> Ah. Maybe rename to isContiguous?

Good with me. I would like a different data type for the Tree used in
the epic parsers which has the token span semantics vs. the tree used
in slabs, so the two are different to the typechecker.

epic.slab.utils.Tree // other name?
epic.slab.Tree extends epic.slab.utils.Tree with SpanAnnotation
epic.trees.Tree extends epic.slab.utils.Tree {
def span: Span
...
}

David Hall

unread,

Feb 18, 2015, 2:35:27 PM2/18/15

to scal...@googlegroups.com

makes sense. Another name might be good, but I don't have a great idea for that.

}

Simon Hafner

unread,

Feb 19, 2015, 11:07:21 PM2/19/15

to scal...@googlegroups.com

Nope, doesn't work without ugly type magic, so I'll have to duplicate
code a bit. What exactly speaks against using scalaz.Tree[T <:
SpanAnnotation] ?
http://docs.typelevel.org/api/scalaz/stable/7.0.3/doc/index.html#scalaz.Tree

I don't really like the idea of just copy/pasting out code from epic
for an n-ary tree. So copy/pasting or scalaz?

David Hall

unread,

Feb 21, 2015, 1:16:35 AM2/21/15

to scal...@googlegroups.com

Honestly, I mostly just don't like scalaz. But also the draw method is not the normal NLP/PTB standard.

What about the tree or dag structured span relationships idea? We'll need that for coreference resolution anyway.

-- David

Reply all

Reply to author

Forward