¿Which size will be the competition corpus for external analysis?

Tartessos

unread,

Feb 18, 2010, 3:14:53 PM2/18/10

to PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse.

Hello everybody from Huelva (South-west of Spain)!!

I'm happy to participate this year.

I'd like to know how much big would be the competition corpus this
year.

Last year, it was similar size to the development corpus (1GB source +
1 GB suspicious).

This year, we have a better development corpus than in PAN'09 (thaks
Weimar and Valencia Universities), much bigger to practice with
differents folds of it, but I think that if competition corpus is so
big as the development corpus, it would be an inconvenience for many
possible teams, due to the possible wide analysis time of preliminar
purposes.

Last year, I was tring with some colleges to create a first proposal
for PAN'09, but it needed so long time for analysis that we had no
hope to finish any trial before deadline time, and so we refused to
participate.

Even when we hope have no problem this year for a little bigger
competition corpus, it would be good for beginers in this task that
competition corpus would not grow so much as the development corpus.

If analysis time is not an important variable to evaluate in the
competition, I think there is no reason to advance in corpus size.
Same size than last year would be enought.

However, would be also a challenge to test every year a bigger corpus
(we know that the growing sources avaliabilty for plagiarism is also a
part of the real problem), but in that case, I think that analysis
time and machine requirements would also be evaluation variables to
have in mind.

Regards,

Diego

Martin Potthast

unread,

Feb 18, 2010, 4:33:44 PM2/18/10

to pan-works...@googlegroups.com

Hi Diego,

> I'm happy to participate this year.

We're happy to have you on board!

> I'd like to know how much big would be the competition corpus this
> year.

It will be about the same size as this year's training corpus.

> Last year, I was tring with some colleges to create a first proposal
> for PAN'09, but it needed so long time for analysis that we had no
> hope to finish any trial before deadline time, and so we refused to
> participate.

I'm sorry to hear that.

> Even when we hope have no problem this year for a little bigger
> competition corpus, it would be good for beginers in this task that
> competition corpus would not grow so much as the development corpus.
>
> If analysis time is not an important variable to evaluate in the
> competition, I think there is no reason to advance in corpus size.
> Same size than last year would be enought.

In fact, analysis time is important. Consider the Web to be the source
documents, then you'll see that there is need for other solutions to
plagiarism detection than in the case of a small collection.
We increase the size of the competition corpus to foster research and
development in this direction.

The good news is that you don't have to start from scratch, since
there is already a lot of research on related subjects like
near-duplicate detection.

Best regards,
Martin

--
Martin Potthast
Bauhaus-Universität Weimar
www.webis.de --- www.netspeak.cc

Tartessos

unread,

Feb 20, 2010, 4:31:03 AM2/20/10

to PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse.

Dear Mr. Martin:

After reading again my own post and your reply, y have to rectify my
translation. My english is not so good as I'd like it would be.
On 18 feb, 22:33, Martin Potthast <martin.potth...@uni-weimar.de>
wrote:

>
> > Last year, I was tring with some colleges to create a first proposal
> > for PAN'09, but it needed so long time for analysis that we had no
> > hope to finish any trial before deadline time, and so we refused to
> > participate.
>
> I'm sorry to hear that.
>

When I wrote "...we refused to participate", I think I was using a too
hard expression. May be more correctly "and so we gived up doing
participation, to try better ocasion with a faster and well and
adjusted and tested version".

Anyway, all of us were happy to try it, even when we had stopped that
preliminar project. We had learnt many things about plagiarism
detection and followed closely the advance of the competition and the
papers.

I take the oportunity to congratulate, not only to the winner (Grozea
et al) team (an excelent proposal, of course the worthy winner), else
to the other competitors, because we can learn from every one, even
for different things than for plagiarism detection tasks.

Specially, I congratulate to the second classified team (Kasprzak et
al) for having the faster system, and to the third (Basile et al)
because having the more cheap architecture (working on a standard PC),
both with not so far results.

I am also grateful to Palkovskii for the countermeasures lesson in his
papers.

I could be writing so much about how many I have enjoyed and the
amount of things I have learnt by reading all your papers.

It has been my motivation to work this year and so, we hope to offer a
really competitive proposal this year.

I wait all of you would enjoy this task so much as we are doing.

¡¡Thanks again for everybody, organization commitee and Participants
of PAN'09!!

Martin Potthast

unread,

Feb 20, 2010, 5:33:31 AM2/20/10

to pan-works...@googlegroups.com

> When I wrote "...we refused to participate", I think I was using a too
> hard expression. May be more correctly "and so we gived up doing
> participation, to try better ocasion with a faster and well and
> adjusted and tested version".

We know our corpus poses quite a challenge. By doubling its size we
increase that challenge even more. However, if you look at last year's
campaign, most of the participants tackled the problem simply by
exhaustive comparison, thus eliminating the problem of candidate
retrieval. This is fine by us, however, it does not help to develop
practical solutions that may work on Web scale. This year, by doubling
the corpus size we have squared the efforts necessary to do an
exhaustive comparison. Now two things may happen: people start using
cluster computers or supercomputers in order to perform exhaustive
comparisons in a reasonable time, or hopefully, people start
developing more worthwhile solutions, which may lead them to new
insights into the subject.

> Anyway, all of us were happy to try it, even when we had stopped that
> preliminar project. We had learnt many things about plagiarism
> detection and followed closely the advance of the competition and the
> papers.
>
> I take the oportunity to congratulate, not only to the winner (Grozea
> et al) team (an excelent proposal, of course the worthy winner), else
> to the other competitors, because we can learn from every one, even
> for different things than for plagiarism detection tasks.
>
> Specially, I congratulate to the second classified team (Kasprzak et
> al) for having the faster system, and to the third (Basile et al)
> because having the more cheap architecture (working on a standard PC),
> both with not so far results.
>
> I am also grateful to Palkovskii for the countermeasures lesson in his
> papers.
>
> I could be writing so much about how many I have enjoyed and the
> amount of things I have learnt by reading all your papers.
>
> It has been my motivation to work this year and so, we hope to offer a
> really competitive proposal this year.
>
> I wait all of you would enjoy this task so much as we are doing.
>
> ¡¡Thanks again for everybody, organization commitee and Participants
> of PAN'09!!

I don't know what to say, other than: Thank you very much!

Best,
Martin

PS:

> After reading again my own post and your reply, y have to rectify my
> translation. My english is not so good as I'd like it would be.

Diego, your English is perfectly fine, for you to make yourself understood!

But if you sometimes wonder how other's typically write a phrase
consider asking our Netspeak service at www.netspeak.cc. For your
phrase above, it offers a rich choice of examples, just by querying
"we ? to participate" (deep link:
http://webis21.medien.uni-weimar.de/netspeak/netspeak?q=we+%3f+to+participate&exact).

Tartessos

unread,

Feb 21, 2010, 4:37:45 AM2/21/10

to PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse.

On 20 feb, 11:33, Martin Potthast <martin.potth...@uni-weimar.de>
wrote:

> > When I wrote "...we refused to participate", I think I was using a too
> > hard expression. May be more correctly "and so we gived up doing
> > participation, to try better ocasion with a faster and well and
> > adjusted and tested version".
>
> We know our corpus poses quite a challenge. By doubling its size we
> increase that challenge even more. However, if you look at last year's
> campaign, most of the participants tackled the problem simply by
> exhaustive comparison, thus eliminating the problem of candidate
> retrieval. This is fine by us, however, it does not help to develop
> practical solutions that may work on Web scale. This year, by doubling
> the corpus size we have squared the efforts necessary to do an
> exhaustive comparison. Now two things may happen: people start using
> cluster computers or supercomputers in order to perform exhaustive
> comparisons in a reasonable time, or hopefully, people start
> developing more worthwhile solutions, which may lead them to new
> insights into the subject.
>

I'm fully agree with your reasons. However, this points that this year
the competition ought measure the analysis time, or at least opening a
subcompetiton for getting the faster analysis system with enought good
results (may be even 15% lower than no_matter_time winner, or any
similar realationship).

Also may be good to decide which is more usefull to be used in a
single PC, or similar cheap architecture, and so, the time would not
be so important.

Really, these differents awards, may be sponsored by different
partners, showing their interest direction.
These partners may be envolved in how to arrange the dificulties to
check the competitors are using real data (we know its very difficult
to ensure).

> consider asking our Netspeak service atwww.netspeak.cc. For your

> phrase above, it offers a rich choice of examples, just by querying

> "we ? to participate" (deep link:http://webis21.medien.uni-weimar.de/netspeak/netspeak?q=we+%3f+to+par...).
>
Tanks Martin. I'll take note :)

Diego.

Martin Potthast

unread,

Feb 21, 2010, 5:12:49 AM2/21/10

to pan-works...@googlegroups.com

Hi Diego,

> I'm fully agree with your reasons. However, this points that this year
> the competition ought measure the analysis time, or at least opening a
> subcompetiton for getting the faster analysis system with enought good
> results (may be even 15% lower than no_matter_time winner, or any
> similar realationship).

It would be nice to measure running time, but doing so poses a number
of problems:
A common test bed is required: either everybody has to buy the same
computer with the same configuration, or you'll have to send us your
programs so that they'd be run on a computer of our choice. The former
is of course unrealistic, the latter less so. However, you (=the
participants) sending us your programs would mean that we become a
part of your debugging cycle, and we certainly don't want that. All
kinds of things might happen when you send us your program, such as it
doesn't work or compile, we use it the wrong way, it never stops, it
requires pre-installed hardware or software we don't have, and many
more. We also thought about a live competition where everyone brings
their software, sets it up, and runs it at the workshop site. But this
adds even more imponderabilities.

Therefore we came to the conclusion that a simple interface is the
best choice, that is, you send us your detection results, and later on
a description of how you obtained them. From your result set we can
judge your performance, and from your description we can judge
whether the runtime complexity of your approach affects its
practicability.

> Also may be good to decide which is more usefull to be used in a
> single PC, or similar cheap architecture, and so, the time would not
> be so important.

The above point applies that we can't require everyone to buy a
certain type of architecture. Even if we could, we can't check whether
people are actually using it or not. On the other hand, if someone has
a supercomputer at her disposal, why disallow its usage? Based on how
you define "fairness" (there are many possible definitions for this
situation), this might seem unfair to them.

Best regards,
Martin

Reply all

Reply to author

Forward