multiple tokens per word

Jens Grivolla

unread,

Apr 30, 2021, 9:47:21 AM4/30/21

to dkpro-core-developers

Hi,

while working on the DKPro integration of UDpipe (in order to update to newer code and models) we found that UDpipe produces multiple tokens for words that include clitics like "moverse" in Spanish which would be decomposed into "mover" and "se" (it also happens e.g. in Portuguese, Catalan, etc.).

The original wrapper as currently in DKPro produces tokens that crash the pipeline (IIRC it creates overlapping tokens which are then not correctly handled by other components). As a fix we changed the tokenizer to not produce the sub-tokens, but the result is incompatible with e.g. the UDpipe PoS-tagging models, leading to erroneous results.

So we are now looking for a way to correctly represent those sub-tokens within the DKPro typesystem. Since we don't actually have the offsets for the different parts of the original word we were thinking of just arbitrarily setting the offsets so all tokens are in a non-overlapping sequence that together covers the original word, e.g. for a 10-character word with 3 sub-tokens to just have the first 4 covered by the first sub-token, the next 4 by the next, and the last 3 for the third sub-token.

Before we go ahead with this I would just want to confirm if that's the "correct" way to go about it, and whether there are other tokenizers in DKPro that already handle sub-tokens so we can try to do it the same way.

Best,

Jens

Richard Eckart de Castilho

unread,

Apr 30, 2021, 9:53:28 AM4/30/21

to dkpro-core...@googlegroups.com

Hi Jens,

> On 30. Apr 2021, at 15:47, Jens Grivolla <jens.g...@gmail.com> wrote:
>
> So we are now looking for a way to correctly represent those sub-tokens within the DKPro typesystem. Since we don't actually have the offsets for the different parts of the original word we were thinking of just arbitrarily setting the offsets so all tokens are in a non-overlapping sequence that together covers the original word, e.g. for a 10-character word with 3 sub-tokens to just have the first 4 covered by the first sub-token, the next 4 by the next, and the last 3 for the third sub-token.

That approach won't work in all cases. E.g. if you have the word `à` which is actually `a` + `a`, then you cannot subdivide the single character into the two words.

The approach that was design to address this is to have multiple tokens at the same offsets but have an "order" feature which says in which order to read/process them described here:

https://github.com/dkpro/dkpro-core/issues/1152

As far as I can see, the "order" feature described in the issue has been introduced already, but so
far no component uses it.

Does that make sense to you?

Cheers,

-- Richard

Jens Grivolla

unread,

May 3, 2021, 12:41:13 PM5/3/21

to dkpro-core-developers

Thanks Richard, that makes a lot of sense. I hadn't thought about the Portuguese case, and for Spanish and Catalan, e.g., it's always possible to split the word into sub-spans.

Since we are working with Portuguese also we will try the "order" feature, which appears to be available in 1.12.x which we are currently basing our work on. We will try to move to 2.x, too, but we'll have to see if we manage to get all of our components to work. Luckily, a huge part of our pipeline is DKPro core already.