Task 1 Evaluation

2 views
Skip to first unread message

Markus Muhr

unread,
May 27, 2010, 5:56:19 AM5/27/10
to PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse.
Hi!

I have a question concerning the evaluation of the results. In this
year there is no explicit distinction between external and intrinsic
plagiarism in the test corpus, but I would like to know if the actual
evaluation will make a distinction, in other words will there be a
separate evaluation for external, intrinsic and an overall evaluation.

Furthermore, can one suspicious document contain intrinsic and
external plagiarism passages or is it one or the other if any?

Regards,
Markus Muhr

Martin Potthast

unread,
May 27, 2010, 7:51:40 AM5/27/10
to pan-works...@googlegroups.com
Hi Markus,

> I have a question concerning the evaluation of the results. In this
> year there is no explicit distinction between external and intrinsic
> plagiarism in the test corpus, but I would like to know if the actual
> evaluation will make a distinction, in other words will there be a
> separate evaluation for external, intrinsic and an overall evaluation.

We will certainly dig into the submitted results to see how
participants coped with the different kinds and types of plagiarism.
But just like last year, the task winner will be determined on the
whole of the test corpus.

> Furthermore, can one suspicious document contain intrinsic and
> external plagiarism passages or is it one or the other if any?

I'm afraid, I cannot answer this question at the moment.

Best,
Martin


--
Martin Potthast
Bauhaus-Universität Weimar
www.webis.de --- www.netspeak.cc

Tartessos

unread,
May 28, 2010, 2:28:39 PM5/28/10
to PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse.
Hello again!

However, I think is not so good idea having both systems together on
same corpus, especially for external systems.

The bigger amount of suspicious document will increase the amount of
false possitives, and so, the precission measure will be affected if
appearing detections for the extra amount of suspicious-documents
added for intinsic analysis purposes are counted.

To compare results and it's evolution since last year, might be better
having similar and separated corpus for both tasks.

I'd like better the idea of a twice larger source-documents corpus (as
in training corpus) in order to analize how the performance of the
external systems is affected due to the bigger oportunity of casual
coincidences for every suspicious document, plagiarized or not.

Anyway, this depends of the way thah evaluation program would work
when evaluating external systems.

Even because we think that the time for analysis is also very
important (we don't matter would not be evaluated in this
competition), to keep the proportion betwen source and suspicious
corpus, and the amount and proportion of plagiarized sections, would
be a good idea to compare evolution in the state of the art
technology.

Best regards,

Diego Rodríguez

Martin Potthast

unread,
May 30, 2010, 5:57:23 AM5/30/10
to pan-works...@googlegroups.com
Hi Diego,

first of all, my apologies for the late answer.

> However, I think is not so good idea having both systems together on
> same corpus, especially for external systems.
>
> The bigger amount of suspicious document will increase the amount of
> false possitives, and so, the precission measure will be affected if
> appearing detections for the extra amount of suspicious-documents
> added for intinsic analysis purposes are counted.

You may be right, but then again, a real plagiarism detector has no
knowledge about whether or not it can expect plagiarized documents
whose sources are available somewhere.

> To compare results and it's evolution since last year, might be better
> having similar and separated corpus for both tasks.

There is actually no difference to last year with regard to
performance evaluation: last year's winner was determined on the whole
of the corpus! We will, however, also compute external-only
performance as well as intrinsic-only performance.

> I'd like better the idea of a twice larger source-documents corpus (as
> in training corpus) in order to analize how the performance of the
> external systems is affected due to the bigger oportunity of casual
> coincidences for every suspicious document, plagiarized or not.
>
> Anyway, this depends of the way thah evaluation program would work
> when evaluating external systems.
>
> Even because we think that the time for analysis is also very
> important (we don't matter would not be evaluated in this
> competition), to keep the proportion betwen source and suspicious
> corpus, and the amount and proportion of plagiarized sections, would
> be a good idea to compare evolution in the state of the art
> technology.

In fact, just like in the PAN-PC-09, the amount of source documents is
the same as the amount of suspicious documents. Note, however, that we
do not hand out the source documents for the intrinsic portions since
this would make evaluation of intrinsic plagiarism detectors
pointless. This is the reason why there are more suspicious documents
than source documents.

Tartessos

unread,
May 31, 2010, 4:04:01 AM5/31/10
to PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse.
Hi, Martin,

Thanks for your reply:

> You may be right, but then again, a real plagiarism detector has no
> knowledge about whether or not it can expect plagiarized documents
> whose sources are available somewhere.

I want not to discuss of the advantages and inconveniences of
intrinsic and external systems, which all of us feel obvious.

> There is actually no difference to last year with regard to
> performance evaluation: last year's winner was determined on the whole
> of the corpus! We will, however, also compute external-only
> performance as well as intrinsic-only performance.
...
> In fact, just like in the PAN-PC-09, the amount of source documents is
> the same as the amount of suspicious documents. Note, however, that we
> do not hand out the source documents for the intrinsic portions since
> this would make evaluation of intrinsic plagiarism detectors
> pointless. This is the reason why there are more suspicious documents
> than source documents.

I hope that, for external analysis presision performance, would not
compute any XML result for the extra amount of files added for
intrinsic evaluation purposes, as in that case, the proportion betwen
external plagiarized files (and/or sections) and not plagiarized would
be worse than in the trainin corpus or last year competition.

We have to remember that for precission purposes, evaluate an
unplagirized file would never increase the average (only realized on
detected sections, unexisting on new unplagiarizad files, or on
external point of view, intrinsic plagiarized sections). However,
every possible detection (ever false, in this case) would count as 0%
precision to compute average.

I'd like that for external analysis presision performance would not
compute any XML result for the extra amount of files added for
intrinsic evaluation purposes, as in that case, the proportion betwen
external plagiarized files (and/or sections) and not plagiarized,
would be worse than in PAN-PC-09 external section or last year's
competition. This would result in lower performance for external
systems this year.

Mixing results for getting global performance is easy, as in last year
competition.

Best regards,

Diego

Martin Potthast

unread,
May 31, 2010, 4:19:35 AM5/31/10
to pan-works...@googlegroups.com
Hi Diego,

> I hope that, for external analysis presision performance, would not
> compute any XML result for the extra amount of files added for
> intrinsic evaluation purposes, as in that case, the proportion betwen
> external plagiarized files (and/or sections) and not plagiarized would
> be worse than in the trainin corpus or last year competition.
>
> We have to remember that for precission purposes, evaluate an
> unplagirized file would never increase the average (only realized on
> detected sections, unexisting on new unplagiarizad files, or on
> external point of view, intrinsic plagiarized sections). However,
> every possible detection (ever false, in this case) would count as 0%
> precision to compute average.
>
> I'd like that for external analysis presision performance would not
> compute any XML result for the extra amount of files added for
> intrinsic evaluation purposes, as in that case, the proportion betwen
> external plagiarized files (and/or sections) and not plagiarized,
> would be worse than in PAN-PC-09 external section or last year's
> competition. This would result in lower performance for external
> systems this year.
>
> Mixing results for getting global performance is easy, as in last year
> competition.

I take it your primary concern is that the introduction of the
suspicious documents from the intrinsic portion of the corpus opens
the possibility of detections that otherwise wouldn't be detected.
While this is true, I emphasize that this is intended: a plagiarism
detector has no knowledge about whether to expect cases for which
sources are available or cases for which no sources are available. In
this connection, last year's distinction between intrinsic cases and
external cases was maybe a mistake.

Rest assured, however, that we will compute the performance of each
plagiarism detector on external cases only as well as on intrinsic
cases only.

Markus Muhr

unread,
May 31, 2010, 5:15:04 AM5/31/10
to pan-works...@googlegroups.com
Hi Martin,

In my opinion intrinsic plagiarism detection on its own may be more of
academic interest, since it would make a hell of problems, if someone was
accused of plagiarism without a proof in form of a source document. However, I
think intrinsic plagiarism detection methods may serve as a preliminary
filtering step in real-world problems. In other words, intrinsic plagiarism
detection methods should not serve as detectors per se, but as a filtering step
and should be evaluated as such.

Furthermore, I want to add that if you take the best approaches for external
and intrinsic plagiarism of the last year to form a hybrid approach, the
overall results will most likely be very bad concerning the precision, so I
think Diego made a good point by crossing out this fact.

Best,
Markus

Tartessos

unread,
May 31, 2010, 8:05:00 AM5/31/10
to PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse.
Thanks, Markus and Martin.

We're agree with Markus, and because that we're centered on external
system. Also there is to have in mind that a group job created by
several colleagues, could be detected as possibly plagiarized by an
external system (false possitive), but there is no reason to try
exclude that interesting task (you will be agree), and I think is good
for competition that equipments try both tasks, like last year.

Only a new remark for former question:

If the number of unplagiarized suspicious files (or impossible to
detect by external system as included files for intrinsic purposes)
tends to infinite, due to the only possibility of having false
possitives, the precision performance, and consequently, the overall
performance, will tend to zero.

Including more documents for intrinsic system, will only affect to the
analysis time, without no more consecuences to the performance
measures than a bigger accuracy due to the bigger number of analized
documents.

And this I'm telling is not absolutely real: detecting plagium in
intrinsic systems for documents with more than 50% plagiarized, will
also affect to the real performance of intrinsic system (may even to
mark just the opposite if all the plagiarized zones are from an only
source document).

The way we will be evaluated this year will disconnect the way from
last year competition (difficulties to compare the advance in external
systems), and it is not clear how will benefit to point for intrinsic
systems.

The same external (or even intrinsic) systems developed last year,
would get lower performance this year due to former reasons. Would be
difficult to evaluate the advance in the state of the art technology
if the evaluation system is not following an stable criteria.

Theses are the real reasons for corpus might not be mixed.

There are no problems for anybody would develop a system which
includes both technologies to detect plagium if the result is better
on every different task.

The problem is how to interprete and compare the results for every
year if test conditios are so differents.

Best,
Diego



Martin Potthast

unread,
Jun 1, 2010, 9:52:15 AM6/1/10
to pan-works...@googlegroups.com
Hi Markus, hi Diego, and everyone else,

in order for this debate to lead somewhere, in the mail following this
one, I will summarize what we have so far. To all those not interested
in the details, you may stop reading now.

But first, let me give detailed answers to your questions / concerns:

> In my opinion intrinsic plagiarism detection on its own may be more of
> academic interest, since it would make a hell of problems, if someone was
> accused of plagiarism without a proof in form of a source document.

This is exactly the point, but for plagiarism detection in general:
regardless which detection algorithm you use, to establish beyond
doubt that something was plagiarized the algorithm would need to
determine whether someone acted knowingly and with bad intentions.
Isn't a necessary pre-condition for some text being plagiarism the
guilty conscience of the plagiarist? Otherwise such cases would be
only text reuse.

I very much like the idea of Paul Clough to do away with all the
negative connotations the term "plagiarism" brings about, and instead
use the more abstract term "text reuse" which just implies that some
piece of text is derived from another piece of text, for whatever
reason. In fact, the latter is what all our IR algorithms can truly
detect, while the former can only be determined by a (human) judge,
maybe with the assistance of some text reuse detector.

This raises the question why we still call our research plagiarism
detection, which can be easily answered, when comparing the following
two queries on Google scholar:
Plagiarism: http://scholar.google.com/scholar?hl=en&q=plagiarism&btnG=Search&as_sdt=2000&as_ylo=&as_vis=0
"Text Reuse": http://scholar.google.com/scholar?hl=en&q=%22text+reuse%22&btnG=Search&as_sdt=2000&as_ylo=&as_vis=0

> However, I
> think intrinsic plagiarism detection methods may serve as a preliminary
> filtering step in real-world problems. In other words, intrinsic plagiarism
> detection methods should not serve as detectors per se, but as a filtering step
> and should be evaluated as such.

I don't agree: human readers can, very accurately, spot style changes.
Also, a lot of advances are being made in automatic authorship
verification. In fact, if I were in possession of an ideal authorship
verifier, I could tell whether a document has been written by exactly
one author or rather by two authors, and which part has been written
by which author. If the document is allegedly written by one author,
this would be just as good evidence for possible plagiarism as
retrieving the original source.

The only thing we need to research and develop now is such an
authorship verifier. How difficult can it be? ;-)

> Furthermore, I want to add that if you take the best approaches for external
> and intrinsic plagiarism of the last year to form a hybrid approach, the
> overall results will most likely be very bad concerning the precision, so I
> think Diego made a good point by crossing out this fact.

You may be right, but then again, this is where the research starts,
isn't it? We need reliable algorithms! I.e., who wants to be judged by
an algorithm or a set of algorithms that can't even distinguish
background noise from true plagiarism / text reuse?

> If the number of unplagiarized suspicious files (or impossible to
> detect by external system as included files for intrinsic purposes)
> tends to infinite, due to the only possibility of having false
> possitives, the precision performance, and consequently, the overall
> performance, will tend to zero.

If this is true, then there is no chance to build a real plagiarism detector.
First, exactly the same would happen if we simply increased the number
of source documents. What if we offered the whole World Wide Web as
source documents?
Second, we imagine detection algorithms to compare _all_ the documents
in the world that deserve checking (or simply all documents?) against
the whole Web.

> Including more documents for intrinsic system, will only affect to the
> analysis time, without no more consecuences to the performance
> measures than a bigger accuracy due to the bigger number of analized
> documents.

I'm afraid, this is not true. Our evaluation measures are affected
only by the number of cases S which are actually there, and by the
detections R reported by an algorithm. The number of documents is not
considered in any of the measures.

> And this I'm telling is not absolutely real: detecting plagium in
> intrinsic systems for documents with more than 50% plagiarized, will
> also affect to the real performance of intrinsic system (may even to
> mark just the opposite if all the plagiarized zones are from an only
> source document).

Right, and this poses some interesting challenges for intrinsic
plagiarism detection, doesn't it?
Since this is a classification problem that requires further context
knowledge, what would be the safe bet if such knowledge is not at
hand?

> The way we will be evaluated this year will disconnect the way from
> last year competition (difficulties to compare the advance in external
> systems), and it is not clear how will benefit to point for intrinsic
> systems.

As I said before, our evaluation does not differ much from before: the
algorithms will be evaluated on the whole of the corpus, and also the
performance for external plagiarism and intrinsic plagiarism will be
measured individually. Please compare the three tables found on last
year's Web page:
http://www.uni-weimar.de/medien/webis/research/workshopseries/pan-09/competition.html#results
The third table, entitled "Overall Tasks", was used to determine the winner.

The only difference this year is, that there are more documents to be
compared which opens up the possibility for more background noise
overlaps between documents, and consequently, more false positive
detections. In order to prepare and train your algorithms accordingly,
you might simply combine the two sub-directories intrinsic and
external of the PAN-PC-09.

Moreover, please note that last year's corpus was not perfect. We have
reworked the corpus from scratch and repaired many of the problems
(e.g., unwanted duplication) encountered during PAN-09. However, we
went to great lengths in order to ensure comparability, and we will
provide detailed analyses that go beyond those of last year.

> The same external (or even intrinsic) systems developed last year,
> would get lower performance this year due to former reasons. Would be
> difficult to evaluate the advance in the state of the art technology
> if the evaluation system is not following an stable criteria.
>

> The problem is how to interprete and compare the results for every
> year if test conditios are so differents.

Absolutely right, but Rome wasn't build in a day, and if plagiarism
detectors require research and development, so do evaluation
frameworks for plagiarism detectors. Think of last year, maybe, as
version 0.9; OK but not perfect. Many participants pointed out some
problems with PAN-PC-09. Hopefully, this year is version 1.0. :-) And
if everything goes according to plan next year will be only version
1.0.1. We will, however, press on and try new ways to evaluate
plagiarism detection success.

Best,
Martin

PS: Now follows a summary of our debate.

Martin Potthast

unread,
Jun 1, 2010, 9:53:50 AM6/1/10
to pan-works...@googlegroups.com
Hi again,

here the summary; please let me know if I missed something:

Situation:
- Intrinsic and external plagiarism cases are now mixed.

Reason:
- A plagiarism detector has no knowledge whether a source is available or no.

I would like to emphasize that, right from the start, we mentioned
that there will be no distinction between intrinsic and external
plagiarism detection this year. This is not something that came as a
surprise, but something that was planned and communicated long ago.


Your concerns:
1. More suspicious documents may lead to a higher amount of overlaps
between documents.
2. The distinction of last year allowed researchers to focus.
3. Intrinsic plagiarism detection is rendered difficult by the fact
that there are documents with more than 50% plagiarism.
4. Intrinsic plagiarism detection is not good enough to be applied in
practice since it does not provide proof.
5. The setting is not the same as last year.

Our answers:
-> 1. This can be generalized: more documents yield an exponential
growth of possibilities for overlap. But this should not stop us from
adding more documents.
-> 2. Agreed.
-> 3. True. Just like last year, the corpus does not contain documents
with more than 50% intrinsic plagiarism.
-> 4. This is not true. Given a sufficiently good enough classifier
that distinguishes authors, intrinsic detection offers proof just like
external detection does. In turn, external detection cannot offer
proof either, but only evidence! Consider the distinction between
plagiarism and text reuse in this connection.
-> 5. Not entirely true: we have changed and reworked the corpus in
order to remove some of last year's problems. For instance, we have
deduplified the documents pretty thoroughly. The evaluation, however,
is done just like last year.

Possible reactions:
a. Roll back, and re-release the test collection, this time including
a distinction between intrinsic and external cases.
b. Re-introduce the distinction in the upcoming PAN-PC-10, to be
released after the lab.
c. Leave everything as it is now.

Current conclusion:
At the moment, I'm torn between (b) and (c). The former only because
of concern 2. We will discuss this when we compile PAN-PC-10, which
will replace PAN-PC-09. Option (a), to me, is out of the question
since it includes to high a risk of error and misunderstanding on both
your and our side, and since time is already running short.

Best,
Martin

Tartessos

unread,
Jun 2, 2010, 5:48:18 AM6/2/10
to PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse.

Tanks, Martin:

> Current conclusion:
> At the moment, I'm torn between (b) and (c). The former only because
> of concern 2. We will discuss this when we compile PAN-PC-10, which
> will replace PAN-PC-09. Option (a), to me, is out of the question
> since it includes to high a risk of error and misunderstanding on both
> your and our side, and since time is already running short.
>
I'm agree: it is too late to option (a). May be better to leave as now
it's.
However, for development purposes I think will be better having both
corpus separately as in PAN-PC-09 is.
For testing purposes of hybrid systems, we cam mix both corpus, and
get similar results.

However, would be possible having an (d) option (next year) with three
separate blocks: external, internal and hybrid.

Anyway, in order to compare increasing performance, I think will be
necessary to fix the percentaje of plagiarized docs on every corpus,
the ratio betwen document's number and plagiarized sections, the ratio
for plagiarized lengths and the ratio for distinct obfuscation types,
and the ratio betwen source and suspicious documents (for possible
analysis time evaluation in external systems).

Of course, including new obfuscation (as new human simulated
obfuscation) types or things like that, are difficult criterias to
introduce without altering corpus line at all, but there is to try to
minimize it (I'm sure you are tryng it). Anyway we understand It's
really a improvement, and we clap for it.

Really, I wanted to ask if the evaluation method for different
plagiarism, would consider to maintain the 50% ratio for unplagiarized
suspicious docs for every case (int & ext), or if due to the nature of
this corpus it is not possible (i.e.: same suspicious document could
have both (int & ext) plagiums).

If sections are numericaly separated (even without having notice about
how it's done) and plagium not mixed on same suspicous doc, would not
be so difficult getting highly comprable partial results.

Have in mind at least for next year competition and corpus
(v1.1 ;-) ).


Best regards,

Diego
I.E.S. "José Caballero" / Universidad de Huelva (Huelva - Spain).

Jan Kasprzak

unread,
Jun 2, 2010, 7:21:03 AM6/2/10
to pan-works...@googlegroups.com
Martin Potthast wrote:
: here the summary; please let me know if I missed something:

I think the main problem with mixing relatively good external
algorithm with generaly lower-performance intrinsic algorithm can
lead to the _lower_ overall score because of much lower overall precision
and only marginally higher recall (and possibly higher granularity)
than with good external algorithm only.

-Jan Kasprzak

--
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| GPG: ID 1024/D3498839 Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E |
| http://www.fi.muni.cz/~kas/ Journal: http://www.fi.muni.cz/~kas/blog/ |
Please don't top post and in particular don't attach entire digests to your
mail or we'll all soon be using bittorrent to read the list. --Alan Cox

Tartessos

unread,
Jun 3, 2010, 5:35:02 AM6/3/10
to PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse.
Hello, Jan, hello Martin, and hello for anybody who would be yet
interested in this thread:


>         I think the main problem with mixing relatively good external
> algorithm with generaly lower-performance intrinsic algorithm can
> lead to the _lower_ overall score because of much lower overall precision
> and only marginally higher recall (and possibly higher granularity)
> than with good external algorithm only.

At least with the state of the art techology form last year, Jan is
right. However it's yet a good idea finding how to improve and combine
both kinds of analysis if finally a global better performance is got.

At the moment, as Jean comments, it is very possible than in order to
get better global rank, if an only submission is arranged, that the
few equipments which tried to improve intrinsic technology, as they
will have no possibility of pointing separately and in that way,
adding global rank, they will refuse combining, or sending results of
intrinsic analysys, as only the best of all will be taken.

Even, when as Martin said, this compettion corpus is more similar to a
real case, really we are probably disencouraging the development of
intrinsic systems.

For next competition (this year I think it is too late), I sugest:

Three separated corpus:
- Intrinsic and external as in last year competition
- a new one, mixed (as now), which performance would be computed as
harmonic mean
of Intrinsic and external performance, evaluated on same corpus but
separately.

Then nobody would try for that corpus using one only analyzer, and the
equipments who arrange more technologies would be benefitted.
Also will be easier comparing advances since a former PAN to newer one
at any case.

After discussing, I have discovered than really more damaged analysis
type, is the intrinsic one.

Really we are not affected as we are using only external. I'd like to
get the opinion of M. Granitzer et al., who last year was the only
brave equipment tryng both challenges.

Regards,

Diego.

Martin Potthast

unread,
Jun 4, 2010, 3:49:32 AM6/4/10
to pan-works...@googlegroups.com
Hi everybody,

it is possible that this year's decision to combine both kinds of
detection methods is to the disadvantage of intrinsic plagiarism
detection research; i.e. the opportunistic decision being not to apply
intrinsic plagiarism detection at all because of its possible low
precision. We will take this into consideration when compiling the
final corpus for release.

On the other hand, the one who builds an external detection algorithm
plus a high-precision intrinsic detection algorithm (an algorithm
which makes safe choices) may gain an advantage over those who just
build external algorithms.

Best,
Martin

> --
> You received this message because you are subscribed to the Google Group "PAN".
> Visit this group at http://groups.google.com/group/pan-workshop-series
> To unsubscribe send email to pan-workshop-se...@googlegroups.com.

Cristian Grozea

unread,
Jun 8, 2010, 8:34:53 AM6/8/10
to pan-works...@googlegroups.com
Dear Martin,

I am yet to understand how the scoring works now for the combined case.
Obviously the passages copied from the sources could be detected by an
intrinsic plagiarism method as intrinsic plagiarism.

Let's assume some competitor finds each and every external plagiarism
passage in the suspicious documents but flags them all as intrinsic
plagiarism.
What will the score be?

Thank you in advance for clarifying this.

Best regards,
Cristian


--
Dr. Cristian Grozea
Fraunhofer Institute FIRST
Kekulestrasse 7
Berlin 12489, Germany

Martin Potthast

unread,
Jun 9, 2010, 4:00:07 AM6/9/10
to pan-works...@googlegroups.com
Hi Christian,

> Let's assume some competitor finds each and every external plagiarism
> passage in the suspicious documents but flags them all as intrinsic
> plagiarism.
> What will the score be?

Assuming that the plagiarized passage is about the same length as the
source passage, this should be 0.5 recall at 1 precision.

Best,
Martin

Reply all
Reply to author
Forward
0 new messages