I'm trying to figure out a data type for trees. Epic currently encodes the position of a tree node as a Span object relative to the token sequence. So in the tree
A
| \
B C
Would be encoded as
Node(A, Span(0,2), List(Node(B, Span(0,1)), Node(C, Span(1,2)))
In slabs, there is not really a concept of a sentence or a sorted list, so the Spans can't reference any index. But the spans would allow to flatten the tree into a list and reconstruct it, which would make the serialization easier.
One idea (inspired by Jason) is to store the tokens in the leafs, but I don't know how exactly to store them, because each object is the same. Option[Token] maybe? But then there's the possibility of an invalid tree. It would also be impossible to flatten a tree without any further addition.
Another idea would be to use char offset spans instead. Each tree node has a span attribute which corresponds to which characters it stores. To retrieve the corresponding token on a leaf node, index the tokens by span (which is an int) and retrieve the corresponding token. This would also allow to flatten the tree and rely on the spans to reconstruct it.
There doesn't seem to be too much, except the tree nodes have the span attribute as char offset and no way to flatten the tree.
I'm leaning towards the second option with char offsets because it makes flattening possible, and the first option makes it easy to produce invalid trees where the leafs don't have the tokens.
What are your thoughts? What are the problems with either idea, or is it possible to combine the two?