Final scores and "Call for information"

293 views
Skip to first unread message

Rob v

unread,
Sep 4, 2021, 5:58:34 AM9/4/21
to MultiLexNorm
Dear participants,

Attached is the pdf with the results of the shared task  (I have included MoNoise scores for reference), some interesting results are obtained and I am very interested in learning about your approaches!
Thanks to you all for participation!

I would like to ask all participants to e-mail the following information to multil...@gmail.com

team-name:
members + affiliation: 
short description of system for overview paper (1 paragraph):
Will publish code: yes/no
Used additional annotated normalization data: yes/no (and which?)
Other external resources used: 

We are still running the external evaluation, and will post the results here when available.
multilexnorm_results.pdf

Rob v

unread,
Sep 4, 2021, 6:03:38 AM9/4/21
to MultiLexNorm
Please send the information as soon as possible, and latest 3 days before the paper deadline: September 19

robvanderg

unread,
Sep 8, 2021, 6:58:29 AM9/8/21
to MultiLexNorm
Updated results are including extrinsic evaluation are attached. Ranking of the teams seems similar, but interestingly MFR ranks much higher in comparison to ERR

Op zaterdag 4 september 2021 om 12:03:38 UTC+2 schreef ro...@itu.dk:
multilexnorm_scores (4).pdf

Milan Straka

unread,
Sep 15, 2021, 7:58:12 AM9/15/21
to robvanderg, MultiLexNorm
Hi Rob,

> -----Original message-----
> From: robvanderg <robva...@live.nl>
> Sent: 8 Sep 2021, 03:58
>
> Updated results are including extrinsic evaluation are attached. Ranking of
> the teams seems similar, but interestingly MFR ranks much higher in
> comparison to ERR

I have a question regarding the extrinsic evaluation, because I am not
sure what the reported LAS numbers are.

The MultiLexNorm web page states that
As secondary evaluation, we will include an evaluation of the
downstream effect of normalization. We will focus on Dependency
parsing, and include the raw input data with the distributed test data
for some of the languages. Then, we train a dependency parser for each
available language on canonical data, and evaluate the effect of
having normalization versus the orginal data.

My expectation was that using the extrinsic evaluation treebanks:
- we pass the forms through the submitted systems (including LAI)
- we then run MaChAmp (trained on canonical data) to get the parses
- we then compute LAS using the predicted parses

If that would be the case, the numbers are way too low -- for example
- on it_postwita, LAI gives 66.49, but MaChAmp paper https://arxiv.org/pdf/2005.14672.pdf
presents 74.9; the former is UD 2.8 and the latter UD 2.6, but I still
would not expect that much difference. Also, UDPipe 2 using
only UD 2.6 it-postwita training data gives us 83.6 LAS

- on it_twittiro, LAI gives 70.06, MaChAmp paper presents 77.3, UDPipe
2 trained solely on UD 2.6 it-twittiro gives 80.28. Note that twittiro
UD 2.6 and UD 2.8 is nearly identical.


Also, the lexical normalization can change the number of words. How is
then LAS score computed? I could imagine some kind of LCS and then
report F1 LAS score; but at least in case of the above two treebanks,
the gold trees are annotated on the original data (i.e., before
normalization), while we would need gold trees on normalized data.
(Ah -- there is some kind of merging in 1.machamp.pred.py, maybe that
handles it? But I am unsure how a consistent tokenization would be
obtained from the normalized text.)

Last nitpick, how are multiword tokens handled (are they passed on the
input to the lexical normalization, or are the syntax trees remapped to
tokens instead to words)?

Thanks very much,
cheers,
Milan Straka

> Op zaterdag 4 september 2021 om 12:03:38 UTC+2 schreef ro...@itu.dk:
>
> > Please send the information as soon as possible, and latest 3 days before
> > the paper deadline: September 19
> >
> > On Saturday, 4 September 2021 at 11:58:34 UTC+2 Rob v wrote:
> >
> >> Dear participants,
> >>
> >> Attached is the pdf with the results of the shared task (I have included
> >> MoNoise scores for reference), some interesting results are obtained and I
> >> am very interested in learning about your approaches!
> >> Thanks to you all for participation!
> >>
> >> I would like to ask all participants to e-mail the following information
> >> to multil...@gmail.com
> >>
> >> team-name:
> >> members + affiliation:
> >> short description of system for overview paper (1 paragraph):
> >> Will publish code: yes/no
> >> Used additional annotated normalization data: yes/no (and which?)
> >> Other external resources used:
> >>
> >> We are still running the external evaluation, and will post the results
> >> here when available.
> >>
> >
>
> --
> You received this message because you are subscribed to the Google Groups "MultiLexNorm" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to multilexnorm...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/multilexnorm/da5cbe19-61e4-4e4a-a155-96ff4db6c898n%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.


signature.asc

robvanderg

unread,
Sep 15, 2021, 10:14:40 AM9/15/21
to MultiLexNorm
Hi Milan, 

Thanks for your questions, and apologies we didnt make the extrinsic evaluation more clear (yet), in the paper there will be a more in-depth description, and the full scripts are online for details as well.

In short, we opted to evaluate domain adaptation scenario's , meaning that the Twitter training data is not used to train the model, but the largest UD canonical language treebank instead. The exact treebanks can be found here: https://bitbucket.org/robvanderg/multilexnorm/src/78f48b8c1ae1ac12996f64f3c21344c0643d9103/extrEval/scripts/1.machamp.pred.py#lines-5 

Of course, another obvious setting would be to train on Twitter data. However, our main reason not to do this right now, is that it might then also make sense to normalize the training data first (because otherwise you are actually increasing distance, instead of decreasing it!). And this would complicate the setup of the shared task quite substantially. 

> Also, the lexical normalization can change the number of words. How is then LAS score computed? 
Very good question, and we have picked our brains about this as well. In the end, we opted to ignore the merging of words (as you found in 1.machamp.pred.py, where they are "undone"), because this is rare, we can't check whether it is correct, and it is complex to merge/allign this somehow into the LAS score. For the splits, we decided to be lenient (following https://aclanthology.org/2021.eacl-main.200.pdf for POS tags), and count an arc correct if it is linked to one of the correct subwords. This is giving a small benefits to teams that split more, so in the paper we will definitely report also the number of splits. We also believe teams would probably not have cheated on this, as they were unaware of this metric. The alignment of the splits happens in the evaluation script itself (2.extrTable.py)

> Last nitpick, how are multiword tokens handled (are they passed on the input to the lexical normalization, or are the syntax trees remapped to tokens instead to words)?
Do you mean the multiword tokens in the annotated treebanks?, for the treebanks that have them, they are removed as MaChAmp does not support them: https://bitbucket.org/robvanderg/multilexnorm/src/78f48b8c1ae1ac12996f64f3c21344c0643d9103/extrEval/scripts/0.preprocess.sh#lines-7

Hope this clears it up a bit, and let me know if you have any more questions.

robvanderg

unread,
Sep 20, 2021, 9:36:39 AM9/20/21
to MultiLexNorm
The link for the paper submission is: https://www.softconf.com/emnlp2021/W-NUT2021/

Op woensdag 15 september 2021 om 07:14:40 UTC-7 schreef robvanderg:

Milan Straka

unread,
Sep 21, 2021, 6:27:39 AM9/21/21
to robvanderg, MultiLexNorm
Hi Rob,

I realized I never responded -- so thank you for all your answers!

I read the 1.machamp.pred.py script before, but did not realize the
models listed were not the twitter models, sorry. But your decision
definitely makes sense, given that otherwise the model is trained
on the unnormalized data. Just for fun I repeated the extrinsic
evaluation with the two Italian twitter models (so the parsing happens
with UDPipe 2 model trained solely on it-postwita and it-twittiro),
and the results are much less consistent with the intrinsic evaluation
(with LAI being close to top):

treebank & avg. & it-postwita & it-twittiro \\
cl-monoise & \textbf{77.58} & 78.35 & \textbf{76.81} \\
yvesscherrer-1 & 77.51 & 78.41 & 76.60 \\
LAI & 77.48 & \textbf{78.54} & 76.43 \\
maet-2 & 77.48 & 78.50 & 76.46 \\
learnML-1 & 77.48 & 78.50 & 76.46 \\
MFR & 77.44 & 78.32 & 76.56 \\
monoise & 77.42 & 78.34 & 76.50 \\
yvesscherrer-2 & 77.41 & 78.36 & 76.46 \\
bucuram-2 & 77.16 & 78.07 & 76.25 \\
bucuram-1 & 77.16 & 78.07 & 76.25 \\
davda54-1 & 77.13 & 78.14 & 76.11 \\
davda54-2 & 77.13 & 78.11 & 76.15 \\
machamp & 77.02 & 77.97 & 76.08 \\
maet-1 & 76.69 & 77.23 & 76.15 \\
thunderml-2$^*$ & 76.27 & 77.08 & 75.45 \\
team-1 & 76.27 & 77.08 & 75.45 \\
DiveshRK-2$^*$ & 76.24 & 76.95 & 75.52 \\
DiveshRK-1$^*$ & 76.24 & 76.95 & 75.52 \\
team-2 & 76.23 & 77.03 & 75.42 \\
learnML-2 & 76.22 & 76.93 & 75.52 \\
thunderml-1$^*$ & 76.10 & 76.95 & 75.24 \\

But I did not doublecheck my code, so there could be some bugs.

Cheers,
Milan Straka

> -----Original message-----
> From: robvanderg <robva...@live.nl>
> --
> You received this message because you are subscribed to the Google Groups "MultiLexNorm" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to multilexnorm...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/multilexnorm/d8d53cdb-d077-4498-83bc-64d25d3489e3n%40googlegroups.com.
signature.asc

Milan Straka

unread,
Sep 21, 2021, 6:32:05 AM9/21/21
to robvanderg, MultiLexNorm
Hi,

> -----Original message-----
> From: robvanderg <robva...@live.nl>
> Sent: 20 Sep 2021, 06:36
>
> The link for the paper submission
> is: https://www.softconf.com/emnlp2021/W-NUT2021/

should we submit an anonymized version of the paper (so no URLs, model
names to be released, anonymous team names, etc)? Or, given that we are
submitting a system-description paper, it can be non-anonymous?

We will be referring to our results, so some information can be deduced
from our CodaLab usernames anyway.

Thanks & cheers,
Milan Straka
> --
> You received this message because you are subscribed to the Google Groups "MultiLexNorm" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to multilexnorm...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/multilexnorm/aac0c818-b000-4ac7-99fe-bd74a3814001n%40googlegroups.com.
signature.asc

robvanderg

unread,
Sep 21, 2021, 9:47:22 AM9/21/21
to MultiLexNorm
Hi Milan, 

Thanks for sharing the results, that is interesting!, The differences seem to be smaller, and the gains disappear. Would be interesting for future work to dig deeper in this direction (normalizing training data for example)

Please do not include your names yet, and also use placeholders for the URLs, but team-names are fine.

Best, 
Rob


Op dinsdag 21 september 2021 om 12:32:05 UTC+2 schreef str...@ufal.mff.cuni.cz:
Reply all
Reply to author
Forward
0 new messages