Dear all,
The evaluation period has started ! The test data is available in the following address:
http://nlp.uned.es/weps/weps2/WePS2_test_data.zip
Please send your system's output in a zip file to the organizers address
weps-or...@lsi.uned.esRemember that the deadline for this submission is the 8th of December. The evaluation results for each
team will be sent back the 17th of December.
In the test data you will find metadata files describing the documents and the web pages for each name.
* Metadata
Each xml file contains the top 150 search results metadata for each name.
It includes the URL, title, rank number (starting at 1) and MIME type of each document.
Not all documents in the metadata files are part of the WePS-2 corpus. The attribute "inWepsCorpus"
on each "doc" element indicates whether the referred document is included or not. Documents not
included won't be evaluated neither in the attribute nor the clustering tasks.
* Web Pages
The web pages directory contains all the documents downloaded from the search results of each person name.
Documents are named according to their position in the ranking (001.html, 002.html). In many cases the list
of files skips numbers from the original ranking. This is because not all documents have been downloaded and
included in the corpus. Only html and plain text documents have been downloaded and documents not containing
the query string (the person name) where ignored too. In some cases the document couldn't be downloaded or the
server was unavailable.
* Clustering task
- The format of your system output must be the same found in the
training data. For instance:
<clustering>
<entity id="0">
<doc rank="0"/>
<doc rank="1"/>
<doc rank="3"/>
<doc rank="4"/>
</entity>
<entity id="1">
<doc rank="5"/>
....
- One XML file is expected for each clustering problem (person name).
Please check that your XML is correct for parsing. Name the files using
the person name in uppercase as in "AMANDA_LENTZ.xml".
* Attribute Extraction Task
- Create your system output as directed in Section 6 of the task
guideline (or the same format as the training data). Information about
each name should be in a separate file, which has a name like
"Alexander_Macomb.txt".
- All the files should be in a single directory. The name of the
directory should be the task name and your site name (e.g. AE_NYU;
change NYU to your site name). Please make a zip/tar/tgz file (e.g.
AE_NYU.tzg) which contain the directory. Then send it (by e-mail) to the
organizer to submit.
- Note that the training data include "Education" attribute, which was
later changed to "school" "major" and "degree". We apologize that we
did not have time to fix the changes in the training data.
- Information about "pages to ignore" (Section 3 in the guideline) will
be distributed after the evaluation. Please create your data for all
pages. In other words, you don't have to detect which pages to ignore by
yourself. We will not evaluate the pages to ignore at the evaluation
even you make some outputs.
- After submission, we will check our gold data with the system outputs.
However, we expect that this task is too much time consuming if we check
it against all the outputs. We are likely to check against the output
which was created by at least two systems.
Javier Artiles (on behalf of the WePS organizers)