Hi!
As noted previously, the orth column and the associated indices are regarded as non-essential for
the shared task. They are provided as is: we assumed this extra information
can potentially prove useful to participants but thoroughly checking
the contents was not deemed a high priority for AXOLOTL-24. Participants
are obviously welcome to do further preprocessing if they feel this is
relevant.
As for Finnish specifically:
1.
some of the punctuation marks in the target words within the examples
of usage are intentional: they can indicate missing or reconstructed
segments of text. We therefore decided against removing them globally.
While commas, periods, colons and semicolons can likely be stripped
safely, it would be more reasonable to preserve quotes, square brackets
and hyphens.
2. as mentioned earlier, the
information was not and will not be verified for the train set, owing to
the time costs required to fix the automatic alignment.
Best
On behalf of other organizers,
Timothee Mickus