we know it's a lot of work, and we appreciate all your efforts.
The result submission deadline for both tasks is now June 23, 2010.
The Web pages have been updated accordingly (mind your browser cache).
Martin
--
Martin Potthast
Bauhaus-Universität Weimar
www.webis.de --- www.netspeak.cc
Martin Potthast wrote:
: we know it's a lot of work, and we appreciate all your efforts.
: The result submission deadline for both tasks is now June 23, 2010.
Can I kindly request that for possible further PAN competitions
the deadline will NOT be extended?
After all, the training corpus has been open for several months now,
so any possible improvements in _software_ detection (as opposed to
manual post-processing or whatever) should have been implemented and tested
a long time ago.
The deadline extension will help only those competitors
who have very slow and inefficient software(*), or who manually post-process
or tune the results. Even I have been able to do some marginal manual
improvements, which goes totally against the purpose of _software_ plagiarism
detection.
The availability timespan of the competition corpus is the only
means of comparing the run time amongst the competitors, and by extending
the deadline the weight of the speed of the detection software has now been
even further lowered.
What I would suggest to do next year in order to improve
the competition:
- create the competition corpus the same way as the training corpus has
been made. If I understand it correctly, this year the software
plagiarist has been improved/modified since the training corpus
has been released, and can now create new types of plagiarism,
which we did not have a chance to test for earlier.
- have the training and competition corpus in exactly the same format:
A single ZIP file, the same directory structure,
not the part1-partN mess as the training corpus is now.
It would allow us to fully script the detection software.
Evaluating the competition corpus would then need just to
run the top-level script with another ZIP file as an argument.
- the time span between the competition corpus release and the submission
deadline should be a day or two (one week as an _absolute_ maximum),
if we want the results to bear any significance for real-world
systems. Spending several weeks by evaluating only 20 thousands
of documents is simply ridiculous. Real-world production systems
should be able to handle milions of documents efficiently,
not to be choking on 20,0000 documents.
- separate the intrinsic and external detection (and possibly compute
the mean of the results), as discussed earlier.
Thanks for reading my overly long e-mail :-)
-Jan Kasprzak
(*) <sarkasm>Several _weeks_ of run time? Do you guys use Java
or whatever?</sarkasm>
--
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| GPG: ID 1024/D3498839 Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E |
| http://www.fi.muni.cz/~kas/ Journal: http://www.fi.muni.cz/~kas/blog/ |
Please don't top post and in particular don't attach entire digests to your
mail or we'll all soon be using bittorrent to read the list. --Alan Cox
--
You received this message because you are subscribed to the Google Group "PAN".
Visit this group at http://groups.google.com/group/pan-workshop-series
To unsubscribe send email to pan-workshop-se...@googlegroups.com.