Hi,
while working on the DKPro integration of UDpipe (in order to update to newer code and models) we found that UDpipe produces multiple tokens for words that include clitics like "moverse" in Spanish which would be decomposed into "mover" and "se" (it also happens e.g. in Portuguese, Catalan, etc.).
The original wrapper as currently in DKPro produces tokens that crash the pipeline (IIRC it creates overlapping tokens which are then not correctly handled by other components). As a fix we changed the tokenizer to not produce the sub-tokens, but the result is incompatible with e.g. the UDpipe PoS-tagging models, leading to erroneous results.
So we are now looking for a way to correctly represent those sub-tokens within the DKPro typesystem. Since we don't actually have the offsets for the different parts of the original word we were thinking of just arbitrarily setting the offsets so all tokens are in a non-overlapping sequence that together covers the original word, e.g. for a 10-character word with 3 sub-tokens to just have the first 4 covered by the first sub-token, the next 4 by the next, and the last 3 for the third sub-token.
Before we go ahead with this I would just want to confirm if that's the "correct" way to go about it, and whether there are other tokenizers in DKPro that already handle sub-tokens so we can try to do it the same way.
Best,
Jens