You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to ists-s...@googlegroups.com, Peter Schüller
Hello,
I had some queries regarding the training data.
Firstly the encoding has not been specified so there are certain sentences that are problematic.
Secondly there may be an issue with the encoding altogether, such as:
Fran\C3 \A7 ois Hollande threatens legal action over affair claims
where \C3 and \A7 have a space between them
Or
Oh , little town of Bethlehem \E2
Or
[ Iran leader Rouhani ] [ says ] [ nuclear deal ] [ with U.S. ] [ possible ] [ within \C2 \BF three months\C2 ] [ \BF ] where they have been chunked differently.
Could you please provide us with more details.
Thank you,
Mishal Kazmi
PhD Student in Electronics Engineering Human Language and Speech Technologies Lab
Sabanci University
Peter Schüller
unread,
Oct 15, 2015, 12:58:54 AM10/15/15
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Interpretable STS Semeval Task, peter.s...@marmara.edu.tr
Dear Organizers,
Actually I am quite sure there is a problem with the encoding of the training data: 0xC2BF is a UTF-8 code for inverted question mark and 0xC3A7 is UTF-8 for the french c as required in François. These codes are present in the training data but there are space characters after each byte which should not be the case.
So if we assume the input is UTF-8 then it is broken UTF-8 because of the extra spaces. If we assume the input is Latin1 then instead of 'François' (with 0xE7 for the ç) we see 'Franà § is' (0xC3 0x20 0xA7 0x20 instead of 0xE7).
This problem is present both in the .chunk.txt as well as in the .txt files.
Best Regards, Peter Schüller
Eneko Agirre
unread,
Oct 15, 2015, 2:19:44 AM10/15/15
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Peter Schüller, Interpretable STS Semeval Task, peter.s...@marmara.edu.tr
thansk again. we'll look into that and get back to you when fixed
best
eneko
10/15/2015 06:58 AM(e)an, Peter Schüller igorleak idatzi zuen:
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Interpretable STS Semeval Task, schue...@gmail.com, peter.s...@marmara.edu.tr
Dear participants,
thanks for your contributions regarding the training
data. We have checked the encoding issues you mentioned and the UTF-8 versionof the training data is already
available on the web site.
In case you have any further comments, please send them to the
forum.
Peter Schüller
unread,
Oct 28, 2015, 8:52:03 AM10/28/15
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Interpretable STS Semeval Task, schue...@gmail.com, peter.s...@marmara.edu.tr, ists-s...@googlegroups.com