Different results on CodaLab and in the Overview paper

26 views
Skip to first unread message

Milan Straka

unread,
Oct 7, 2021, 5:10:20 PM10/7/21
to robvanderg, MultiLexNorm
Hi Rob,

we found out that the results of the davda54 (ÚFAL) submissions in
CodaLab and in the overview paper (and in the results sent to the
multilexnorm mailing list) are a bit different -- specifically:
- in CodaLab, davda54-1 avg ERR is 66.34, davda54-2 avg ERR is 67.42
- in the overview paper, davda54-1 is 66.21, davda54-2 is 67.30
Regarding the individual treebank results, some are the same, but
several are different.

Is this a known issue (i.e., the final evaluation is supposed to be
different to CodaLab)? In our paper, we present ablation experiments
based on the CodaLab evaluation, which are then inconsistent with the
official results.

Thanks,
cheers,
Milan Straka
signature.asc

robvanderg

unread,
Oct 8, 2021, 3:31:00 AM10/8/21
to MultiLexNorm
Hi Milan, 

Thanks for noticing!, during the setup of CodaLab, we noticed that there are minor differences between python2 and python3. So I was assuming that this causes the disrepancies. After taking a closer look, it seems like our codalab setup used a slightly older version of the data (due to different people being in charge of data-updates and the CodaLab). The differences are minor, and are only in the use of interjections, where some were accidentally still normalized (hahahaha->haha is now not normalized anymore). I don't think this should change any conclusions, as most probably all systems make the same mistakes with the old version of the data.

Apologies for overseeing this! I did just push the script thats used for generating the table to the repo: https://bitbucket.org/robvanderg/multilexnorm/src/master/scripts/mainTable.py so to get the correct numbers, you can just clone the repo, put your submissions in the submission folder, and run this script.


Op donderdag 7 oktober 2021 om 23:10:20 UTC+2 schreef str...@ufal.mff.cuni.cz:

Milan Straka

unread,
Oct 8, 2021, 8:46:11 AM10/8/21
to robvanderg, MultiLexNorm
Hi Rob,

> -----Original message-----
> From: robvanderg <robva...@live.nl>
> Sent: 8 Oct 2021, 00:31
>
> Thanks for noticing!, during the setup of CodaLab, we noticed that there
> are minor differences between python2 and python3. So I was assuming that
> this causes the disrepancies. After taking a closer look, it seems like our
> codalab setup used a slightly older version of the data (due to different
> people being in charge of data-updates and the CodaLab). The differences
> are minor, and are only in the use of interjections, where some were
> accidentally still normalized (hahahaha->haha is now not normalized
> anymore). I don't think this should change any conclusions, as most
> probably all systems make the same mistakes with the old version of the
> data.
>
> Apologies for overseeing this! I did just push the script thats used for
> generating the table to the
> repo: https://bitbucket.org/robvanderg/multilexnorm/src/master/scripts/mainTable.py
> so to get the correct numbers, you can just clone the repo, put your
> submissions in the submission folder, and run this script.

thanks a lot for the script, it made our work easy :-) We recomputed the
ablation experiments (obtaining the same results for the two runs as in
the official results), and updated the ablation discussion (we
originally saw a ~0.1 percent point improvement of a beam search, which
was caused purely by non-consistent evaluation in different settings)
and uploaded the final camera-ready version to SoftConf -- if it is
still possible to use it, that would be great.

Thanks & cheers,
Milan Straka

> Op donderdag 7 oktober 2021 om 23:10:20 UTC+2 schreef
> str...@ufal.mff.cuni.cz:
>
> > Hi Rob,
> >
> > we found out that the results of the davda54 (ÚFAL) submissions in
> > CodaLab and in the overview paper (and in the results sent to the
> > multilexnorm mailing list) are a bit different -- specifically:
> > - in CodaLab, davda54-1 avg ERR is 66.34, davda54-2 avg ERR is 67.42
> > - in the overview paper, davda54-1 is 66.21, davda54-2 is 67.30
> > Regarding the individual treebank results, some are the same, but
> > several are different.
> >
> > Is this a known issue (i.e., the final evaluation is supposed to be
> > different to CodaLab)? In our paper, we present ablation experiments
> > based on the CodaLab evaluation, which are then inconsistent with the
> > official results.
> >
> > Thanks,
> > cheers,
> > Milan Straka
> >
>
> --
> You received this message because you are subscribed to the Google Groups "MultiLexNorm" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to multilexnorm...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/multilexnorm/cd6b304a-4867-4b94-ae8c-b962ffb72035n%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

signature.asc
Reply all
Reply to author
Forward
0 new messages