Serious problem with scorer and unicode diacritics

2 views
Skip to first unread message

Maarten van Gompel

unread,
Mar 17, 2010, 6:37:32 AM3/17/10
to cross-li...@googlegroups.com
Hi Els, Veronique, co-participants,

I stumbled upon some serious problems whilst scoring the Spanish trial
data. First of all some of the gold data lacks a newline character. In
occupation.es.gold there is no newline between occupation.n.es 9 and
occupation.n.es 10. And the same occurs for movement.n.es 9 and
movement.n.es 10. I didn't check all other languages so I can't rule out
the same problem occurs in more files, but this error can be easily
corrected once aware of it.

A far more serious issue is that I found the scorer doesn't play well
with unicode (utf-8). I hadn't noticed the problem before as I was
working mostly on dutch, which isn't rich in diacritics, but now I'm
trying Spanish, the scorer seems broken. This will also affect other
languages with any kind of diacritics.

For occupation.n, I have a Spanish baseline generator which simply
predicts "ocupaci�n" for all instances. However, when this is scored I
get a score of 0.00, even though the gold standard also predicts
ocupaci�n in some instances!

It seems as if all system predictions with diacritics are silently
discarded. My hypothesis is confirmed when I bluntly remove diacritics
from both the gold and system file (ocupaci�n => ocupacion), and scoring
proceeds as expected.

As diacritics are an essential part of most of the languages in this
task, this has a huge negative impact on the scores. In the
cross-lingual lexical subtitution task, the gold standard simply omits
any kind of diacritics, and system output is expected to do the same,
circumventing this problem (although I don't think this is a very
elegant solution).

For your debugging purposes, I attach my baseline output for
occupation.es , and a verbose log of the scorer (notice the lower number
for 'attempted', skipping precisely those attempts that would be correct).

How shall we proceed?

Regards,

--

Maarten van Gompel (Proycon)
Induction of Linguistic Knowledge Research Group, University of Tilburg

pro...@anaproy.nl
pro...@unilang.org

--------------------------------------------------------------------------
Personal Homepage: http://proycon.anaproy.nl
UniLang Language Community: http://www.unilang.org
--------------------------------------------------------------------------
JABBER: maar...@luon.net, AIM: proycon, YAHOO: proycon
MSN: pro...@anaproy.nl
--------------------------------------------------------------------------

occupation.es.baseline.best
scorer.log

Simone Paolo Ponzetto

unread,
Mar 17, 2010, 7:01:34 AM3/17/10
to SemEval2010_Cross-Lingual Word Sense Disambiguation
Hi Marteen,

good timing: we also noticed yesterday night a UTF-8 related problem
with the scorer. We printed the value of the variable $sub at line 324
of the scorer and noticed that diacritics as in "gewächs" gets chopped
up as in "chs".

We fixed the error by changing the original regexp in line 322

if ($res =~ /(\w[\w\'-\s]+) (\d+)/) {

to a more relaxed one

if ($res =~ /(.+) (\d+)/) {

Hope it helps, but please do check: after some years Perl is still an
exoteric language for me.

Best - Simone

On Mar 17, 11:37 am, Maarten van Gompel <proy...@anaproy.nl> wrote:
> Hi Els, Veronique, co-participants,
>
> I stumbled upon some serious problems whilst scoring the Spanish trial
> data. First of all some of the gold data lacks a newline character. In
> occupation.es.gold there is no newline between occupation.n.es 9 and
> occupation.n.es 10. And the same occurs for movement.n.es 9 and
> movement.n.es 10. I didn't check all other languages so I can't rule out
> the same problem occurs in more files, but this error can be easily
> corrected once aware of it.
>
> A far more serious issue is that I found the scorer doesn't play well
> with unicode (utf-8). I hadn't noticed the problem before as I was
> working mostly on dutch, which isn't rich in diacritics, but now I'm
> trying Spanish, the scorer seems broken. This will also affect other
> languages with any kind of diacritics.
>
> For occupation.n, I have a Spanish baseline generator which simply

> predicts "ocupaci�n" for all instances. However, when this is scored I


> get a score of 0.00, even though the gold standard also predicts

> ocupaci�n in some instances!


>
> It seems as if all system predictions with diacritics are silently
> discarded. My hypothesis is confirmed when I bluntly remove diacritics

> from both the gold and system file (ocupaci�n => ocupacion), and scoring


> proceeds as expected.
>
> As diacritics are an essential part of most of the languages in this
> task, this has a huge negative impact on the scores. In the
> cross-lingual lexical subtitution task, the gold standard simply omits
> any kind of diacritics, and system output is expected to do the same,
> circumventing this problem (although I don't think this is a very
> elegant solution).
>
> For your debugging purposes, I attach my baseline output for
> occupation.es , and a verbose log of the scorer (notice the lower number
> for 'attempted', skipping precisely those attempts that would be correct).
>
> How shall we proceed?
>
> Regards,
>
> --
>
> Maarten van Gompel (Proycon)
>   Induction of Linguistic Knowledge Research Group, University of Tilburg
>

> proy...@anaproy.nl
> proy...@unilang.org


>
> --------------------------------------------------------------------------
> Personal Homepage:        http://proycon.anaproy.nl
> UniLang Language Community:    http://www.unilang.org
> --------------------------------------------------------------------------

> JABBER: maarte...@luon.net, AIM: proycon, YAHOO: proycon
> MSN: proy...@anaproy.nl
> --------------------------------------------------------------------------
>
> [occupation.es.baseline.best< 1K ]occupation.n.es 1 :: ocupación;
> occupation.n.es 2 :: ocupación;
> occupation.n.es 3 :: ocupación;
> occupation.n.es 4 :: ocupación;
> occupation.n.es 5 :: ocupación;
> occupation.n.es 6 :: ocupación;
> occupation.n.es 7 :: ocupación;
> occupation.n.es 8 :: ocupación;
> occupation.n.es 9 :: ocupación;
> occupation.n.es 10 :: ocupación;
> occupation.n.es 11 :: ocupación;
> occupation.n.es 12 :: ocupación;
> occupation.n.es 13 :: ocupación;
> occupation.n.es 14 :: ocupación;
> occupation.n.es 15 :: ocupación;
> occupation.n.es 16 :: ocupación;
> occupation.n.es 17 :: ocupación;
> occupation.n.es 18 :: ocupación;
> occupation.n.es 19 :: ocupación;
> occupation.n.es 20 :: ocupación;
>
>  scorer.log
> 1KViewDownload

Maarten van Gompel

unread,
Mar 17, 2010, 7:21:52 AM3/17/10
to cross-li...@googlegroups.com
Simone Paolo Ponzetto a �crit :

> Hi Marteen,
>
> good timing: we also noticed yesterday night a UTF-8 related problem
> with the scorer. We printed the value of the variable $sub at line 324
> of the scorer and noticed that diacritics as in "gew�chs" gets chopped

> up as in "chs".
>
> We fixed the error by changing the original regexp in line 322
>
> if ($res =~ /(\w[\w\'-\s]+) (\d+)/) {
>
> to a more relaxed one
>
> if ($res =~ /(.+) (\d+)/) {
>
> Hope it helps, but please do check: after some years Perl is still an
> exoteric language for me.
>
> Best - Simone

Hi Simone,

Thanks for your response. I checked and can confirm that your fix indeed
resolves the problem I had. I don't really know how perl handles
unicode, but I suppose \w probably doesn't match against non-ascii.
There seem to be a few more places where \w is used in regexps, I'm not
sure if those will cause problems elsewhere. But for now things seem to
be working again, if the task organizers apply your fix as well,
everything will hopefully work out.

Thanks,

--

Maarten van Gompel (Proycon)
ILK, University of Tilburg

pro...@anaproy.nl
pro...@unilang.org

--------------------------------------------------------------------------


Personal Homepage: http://proycon.anaproy.nl
UniLang Language Community: http://www.unilang.org
--------------------------------------------------------------------------

JABBER: maar...@luon.net, AIM: proycon, YAHOO: proycon
MSN: pro...@anaproy.nl
--------------------------------------------------------------------------

Els

unread,
Mar 17, 2010, 8:16:08 AM3/17/10
to SemEval2010_Cross-Lingual Word Sense Disambiguation
Hi,

I also mainly tested the script for Dutch and didn't come across the
UTF-8 problem,
I'm sorry for that!

We decided to not throw out accentuated characters, because in some
cases
they can differentiate between two meanings. I've done a quick check
on the
translations for the test words, and didn't find translations that
only differ on the
accent level at first sight.

So we will in any case ignore the accents for the evaluation.
For the time being it's OK to use the fix that is proposed by Simone,
but it is probably safer to do a conversion for all accentuated
characters
to their non-accentuated counterpart,
in this way we will not have similar problems at other places in the
script.
I will create a conversion table and update the perl script
accordingly.
I will send it around at latest beginning next week.

So for training/testing your systems, you can ignore all accents.

Best,
Els

> proy...@anaproy.nl
> proy...@unilang.org


>
> --------------------------------------------------------------------------
> Personal Homepage:        http://proycon.anaproy.nl
> UniLang Language Community:    http://www.unilang.org
> --------------------------------------------------------------------------

> JABBER: maarte...@luon.net, AIM: proycon, YAHOO: proycon
> MSN: proy...@anaproy.nl
> --------------------------------------------------------------------------

Els

unread,
Mar 17, 2010, 8:36:08 AM3/17/10
to SemEval2010_Cross-Lingual Word Sense Disambiguation
I will do a check on the newlines in the gold standard
for the test data, thanks for mentioning the problem!

Best,
Els

Maarten van Gompel

unread,
Mar 17, 2010, 9:10:03 AM3/17/10
to cross-li...@googlegroups.com
Els a �crit :

> Hi,
>
> I also mainly tested the script for Dutch and didn't come across the
> UTF-8 problem,
> I'm sorry for that!
>
> We decided to not throw out accentuated characters, because in some
> cases
> they can differentiate between two meanings. I've done a quick check
> on the
> translations for the test words, and didn't find translations that
> only differ on the
> accent level at first sight.
>
> So we will in any case ignore the accents for the evaluation.
> For the time being it's OK to use the fix that is proposed by Simone,
> but it is probably safer to do a conversion for all accentuated
> characters
> to their non-accentuated counterpart,
> in this way we will not have similar problems at other places in the
> script.
> I will create a conversion table and update the perl script
> accordingly.
> I will send it around at latest beginning next week.
>
> So for training/testing your systems, you can ignore all accents.
>
> Best,
> Els

Hi Els,

If you are looking for a simple conversion script to strip all
characters of their accents, then I already have a small python script
laying around that does precisely that, to save you some work. You will
find it attached to this message. It takes one utf-8 encoded file as
argument, and prints ascii output to stdout. But it's in my personal
opinion a less elegant solution than fixing the scorer using Simone's fix.

By the way, I'm not sure if this has been asked already, but how many
different 'runs' of our system may we submit? Just one or are multiple
allowed?

Ciao,

--

Maarten van Gompel (Proycon)

pro...@anaproy.nl
pro...@unilang.org

--------------------------------------------------------------------------
Personal Homepage: http://proycon.anaproy.nl
UniLang Language Community: http://www.unilang.org
--------------------------------------------------------------------------

JABBER: maar...@luon.net, AIM: proycon, YAHOO: proycon
MSN: pro...@anaproy.nl
--------------------------------------------------------------------------

stripaccents
Reply all
Reply to author
Forward
0 new messages