Deadline extension until June 23, 2010

0 views
Skip to first unread message

Martin Potthast

unread,
Jun 14, 2010, 4:17:05 AM6/14/10
to pan-workshop-series
Dear participants,

we know it's a lot of work, and we appreciate all your efforts.

The result submission deadline for both tasks is now June 23, 2010.

The Web pages have been updated accordingly (mind your browser cache).

Martin


--
Martin Potthast
Bauhaus-Universität Weimar
www.webis.de --- www.netspeak.cc

Jan Kasprzak

unread,
Jun 15, 2010, 8:44:40 AM6/15/10
to pan-works...@googlegroups.com
Martin, and all participants,

Martin Potthast wrote:
: we know it's a lot of work, and we appreciate all your efforts.


: The result submission deadline for both tasks is now June 23, 2010.

Can I kindly request that for possible further PAN competitions
the deadline will NOT be extended?

After all, the training corpus has been open for several months now,
so any possible improvements in _software_ detection (as opposed to
manual post-processing or whatever) should have been implemented and tested
a long time ago.

The deadline extension will help only those competitors
who have very slow and inefficient software(*), or who manually post-process
or tune the results. Even I have been able to do some marginal manual
improvements, which goes totally against the purpose of _software_ plagiarism
detection.

The availability timespan of the competition corpus is the only
means of comparing the run time amongst the competitors, and by extending
the deadline the weight of the speed of the detection software has now been
even further lowered.

What I would suggest to do next year in order to improve
the competition:

- create the competition corpus the same way as the training corpus has
been made. If I understand it correctly, this year the software
plagiarist has been improved/modified since the training corpus
has been released, and can now create new types of plagiarism,
which we did not have a chance to test for earlier.

- have the training and competition corpus in exactly the same format:
A single ZIP file, the same directory structure,
not the part1-partN mess as the training corpus is now.
It would allow us to fully script the detection software.
Evaluating the competition corpus would then need just to
run the top-level script with another ZIP file as an argument.

- the time span between the competition corpus release and the submission
deadline should be a day or two (one week as an _absolute_ maximum),
if we want the results to bear any significance for real-world
systems. Spending several weeks by evaluating only 20 thousands
of documents is simply ridiculous. Real-world production systems
should be able to handle milions of documents efficiently,
not to be choking on 20,0000 documents.

- separate the intrinsic and external detection (and possibly compute
the mean of the results), as discussed earlier.

Thanks for reading my overly long e-mail :-)

-Jan Kasprzak

(*) <sarkasm>Several _weeks_ of run time? Do you guys use Java
or whatever?</sarkasm>

--
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| GPG: ID 1024/D3498839 Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E |
| http://www.fi.muni.cz/~kas/ Journal: http://www.fi.muni.cz/~kas/blog/ |
Please don't top post and in particular don't attach entire digests to your
mail or we'll all soon be using bittorrent to read the list. --Alan Cox

Luca de Alfaro

unread,
Jun 15, 2010, 12:33:31 PM6/15/10
to pan-works...@googlegroups.com
I second this request. 

The time after the release of the training corpus is the one that matters for building / improving systems.  The final corpus should be used only for automated labeling. 
Extending the deadline makes it possible for people who have the time to select the most dubious classifications and review them by hand, which is against the spirit of the competition.
The only need for a window of > 2 days is to (a) accomodate people who are traveling or have other committments in those exact days, and (b) facilitate the work of people who rely on web APIs etc (systems may be down). 
I think one week, or at most 10 days, is a reasonable period between evaluation corpus publication and result submission. 

Luca


--
You received this message because you are subscribed to the Google Group "PAN".
Visit this group at http://groups.google.com/group/pan-workshop-series
To unsubscribe send email to pan-workshop-se...@googlegroups.com.

Jim White

unread,
Jun 15, 2010, 1:16:48 PM6/15/10
to pan-works...@googlegroups.com
I'm sensitive to the concerns over possible advantage for this task in terms of being a "competition", but in our case the need for more time is simply to be able to submit any kind of entry, not in order to make in depth analysis and tweaking based on the test set and we'll not be doing and review "by hand".  Time demands for other commitments being what they've been, we've not been able to even start writing the code to process the test set and generate the results file until today.  And our code being purely of an experimental (but not very sophisticated) nature, it will take a couple days to run on the lowly dual core machine at hand.

Thank you very much for the time extension!

Jim

Alzahrani, Salha

unread,
Jun 16, 2010, 1:25:48 AM6/16/10
to pan-works...@googlegroups.com
As Jim said, the purpose of asking for extension is not to take an extra advantage of in-depth analysis. Also, it's shame to accuse others with such kind of manually post-processing or tuning the results since it's beyond the academic morals. 

The reason for asking for extension is because I'm working individually. Being alone may lead you to some mistakes. I wish I could have a group to share knowledge and divide tasks. 

However, I'm a PhD student and I have an ability to learn and being independent. I've programmed the software for this competition from scratch. I have tested, measured the time for processing and tried to estimate how long the processing time will be for the whole training corpus. I was planning to using a grid computer in my university. Unfortunately time has not helped me to redesign my code with MPI. 

Strictly speaking, I'm not very sophisticated or working within a team of experts in plagiarism detection. I'm just a student struggling to finalize my results!! 

Thank you to PAN committee for this extension. At least you help students to participate.

Salha
Reply all
Reply to author
Forward
0 new messages