Baselines Trial Data + update scoring script

1 view

Skip to first unread message

Els

unread,

Mar 24, 2010, 6:09:31 AM3/24/10

to SemEval2010_Cross-Lingual Word Sense Disambiguation

Hi,

while running the scoring script for the Trial Baselines,
I discovered some issues with the scoring results

1. In case you end the final system guess with ";"
(as it was specified in the documentation),
the scorer counts an extra "empty" guess.
I discussed this with Diana (from the lexical substitution task),
and we should indeed remove the final ";" from both the system
output and gold standard files.

I will also run an extra check on all system files to make sure
that your test files are scored correctly.

2. I made some other changes to the script (all in comment in the
header of the script),
most important are:

* fix for Unicode as suggested by Simone
=> I checked the other occurrences of \w, and this was only used
for regexps matching the English target word, so this should be fine.
=> I opted to keep Simone's fix instead of a conversion table,
in order to avoid that two different words (different on accent level)
match.

* the evaluation is made case-insensitive
=> I noticed that for "bank" there are a lot of proper names in the
corpus,
but not all lemmatizers keep the uppercase character in first
position,
and I don't want to penalize lemmatizer errors.
=> This is the only exception where duplication in the system output
is allowed
(e.g. Bank and bank)

* I've made a fix for windows input, where scoring went wrong as well.

You can find the Trial baselines (Trial_Baselines.pdf)
and an update of the scorer on:
http://lt3.hogent.be/semeval/Trial/