Queries about the training data

38 views
Skip to first unread message

Mishal Kazmi (Student)

unread,
Oct 14, 2015, 9:53:57 AM10/14/15
to ists-s...@googlegroups.com, Peter Schüller
Hello,

I had some queries regarding the training data.

Firstly the encoding has not been specified so there are certain sentences that are problematic.
Secondly there may be an issue with the encoding altogether, such as:

Fran\C3 \A7 ois Hollande threatens legal action over affair claims
where \C3 and \A7 have a space between them

Or

Oh , little town of Bethlehem \E2

Or

[ Iran leader Rouhani ] [ says ] [ nuclear deal ] [ with U.S. ] [ possible ] [ within \C2 \BF three months\C2 ] [ \BF ]
where they have been chunked differently.

Could you please provide us with more details.

Thank you,

Mishal Kazmi
PhD Student in Electronics Engineering
Human Language and Speech Technologies Lab
Sabanci University

Peter Schüller

unread,
Oct 15, 2015, 12:58:54 AM10/15/15
to Interpretable STS Semeval Task, peter.s...@marmara.edu.tr
Dear Organizers,

Actually I am quite sure there is a problem with the encoding of the training data: 0xC2BF is a UTF-8 code for inverted question mark and 0xC3A7 is UTF-8 for the french c as required in François.
These codes are present in the training data but there are space characters after each byte which should not be the case.

So if we assume the input is UTF-8 then it is broken UTF-8 because of the extra spaces.
If we assume the input is Latin1 then instead of 'François' (with 0xE7 for the ç) we see 'Franà § is' (0xC3 0x20 0xA7 0x20 instead of 0xE7).

This problem is present both in the .chunk.txt as well as in the .txt files.

Best Regards,
Peter Schüller

Eneko Agirre

unread,
Oct 15, 2015, 2:19:44 AM10/15/15
to Peter Schüller, Interpretable STS Semeval Task, peter.s...@marmara.edu.tr

thansk again. we'll look into that and get back to you when fixed

best

eneko

10/15/2015 06:58 AM(e)an, Peter Schüller igorleak idatzi zuen:
> --
> You received this message because you are subscribed to the Google
> Groups "Interpretable STS Semeval Task" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to ists-semeval...@googlegroups.com
> <mailto:ists-semeval...@googlegroups.com>.
> Visit this group at http://groups.google.com/group/ists-semeval.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/ists-semeval/2e622807-cae0-462f-ab83-deeb21a2443b%40googlegroups.com
> <https://groups.google.com/d/msgid/ists-semeval/2e622807-cae0-462f-ab83-deeb21a2443b%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

--

Eneko Agirre
Euskal Herriko Unibertsitatea
University of the Basque Country
http://ixa2.si.ehu.eus/eneko

Interpretable STS Semeval Task

unread,
Oct 26, 2015, 1:17:06 PM10/26/15
to Interpretable STS Semeval Task, schue...@gmail.com, peter.s...@marmara.edu.tr
Dear participants,

thanks for your contributions regarding the training data. We have checked the encoding issues you mentioned and the UTF-8 version of the training data is already available on the web site.

In case you have any further comments, please send them to the forum.

Peter Schüller

unread,
Oct 28, 2015, 8:52:03 AM10/28/15
to Interpretable STS Semeval Task, schue...@gmail.com, peter.s...@marmara.edu.tr, ists-s...@googlegroups.com
Thank you very much, now the data looks perfect!

Best Regards,
Peter Schüller
Reply all
Reply to author
Forward
0 new messages