Some issues with trial data

Maarten van Gompel

unread,

Nov 11, 2009, 10:35:43 AM11/11/09

to SemEval2010.CrossLingualLexicalSubstitution

Dear organisers,

I wanted to bring to your attention that I found some technical issues
with the trial data, clls.trial.data . I notice you make extensive use
of XML entities, but some of them are broken, possibly by a
tokenisation process. As a result, the XML does not validate, and
picky XML parsers may break on processing. See for example instance
161 , and occuring in several other spots as well.

Secondly, there are some encoding issues. It seems something went
wrong here, as this is not valid UTF-8, iso-8859-1, cp1252, or any
other encoding I recognize. Look for example at instances 16, 100,
112, 133, 190, and various more.

Kind regards,

--

Maarten van Gompel (Proycon)
Induction of Linguistic Knowledge Research Group
University of Tilburg

Diana McCarthy

unread,

Nov 12, 2009, 4:28:47 AM11/12/09

to clls...@googlegroups.com

Dear Maarten

Both issues arose because we extracted the data direct from the English
Internet Corpus (produced by Serge Sharoff) and haven't changed the xml
from that. We are discussing what we do about this and will let you know
ASAP.

very best

Diana

Maarten van Gompel wrote, On 11/11/09 15:35:

> --
>
> You received this message because you are subscribed to the Google Groups "SemEval2010.CrossLingualLexicalSubstitution" group.
> To post to this group, send email to clls...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/clls2010?hl=.
>
>
>

--

===========================================================================
Diana McCarthy, http://www.dianamccarthy.co.uk/
Lexical Computing Ltd. http://www.sketchengine.co.uk/
===========================================================================

Diana McCarthy

unread,

Nov 13, 2009, 2:49:26 AM11/13/09

to clls...@googlegroups.com

Hi Maarten

As I mentioned, these xml and encoding issues arose because we have left
the original corpus data (http://corpus.leeds.ac.uk/internet.html) as it
was without any further cleaning of the sentences. We will clean the
data for the test release. We will not change the release of trial data
because it has already been released however, we will mention the issue
on our task web site and perhaps release a script for fixing some of the
xml errors so that the file validates.

very best wishes

Diana

Diana McCarthy wrote, On 12/11/09 09:28:

Maarten van Gompel

unread,

Nov 13, 2009, 4:09:11 AM11/13/09

to clls...@googlegroups.com

Diana McCarthy escreveu:

> Hi Maarten
>
> As I mentioned, these xml and encoding issues arose because we have left
> the original corpus data (http://corpus.leeds.ac.uk/internet.html) as it
> was without any further cleaning of the sentences. We will clean the
> data for the test release. We will not change the release of trial data
> because it has already been released however, we will mention the issue
> on our task web site and perhaps release a script for fixing some of the
> xml errors so that the file validates.
>
> very best wishes
>
> Diana

Hi Diana,

Thank you very much, that indeed seems a good solution.

Regards,

Reply all

Reply to author

Forward