Call for Participation

Martin Potthast

unread,

Dec 10, 2012, 9:24:16 AM12/10/12

to pan-works...@googlegroups.com

Dear everyone,

I am happy to announce next year's PAN. Below you will find the Call
for Participation. We hope to welcome you again for another round of
exciting evaluations around the three tasks of plagiarism detection,
author identification, and this year's newcomer author profiling.

Please note that the timeline of the workshop has been shifted towards
the beginning of the year, and evaluations will take place from
January till June.

Here's the CfP:

-------------------------------------------------------------------------------
PAN @ CLEF: Call for Participation
-------------------------------------------------------------------------------

We invite you to take part in one of the following evaluations:

1. Plagiarism Detection -- Given a document, is it an original?
This task is divided into source retrieval and text alignment.
Source retrieval is about searching for likely sources of a suspicious document.
Text alignment is about matching passages of reused text between documents.

2. Author Identification -- Given a document, who wrote it?
This task focuses on authorship verification and methods to answer the question
whether two given documents have the same author or no. This question
accurately emulates the real-world problem that most forensic linguists face
every day.

3. Author Profiling -- Given a document, what's its author's age / gender?
This task is concerned with predicting an author's demographics from her
writing. Besides being personally identifiable, an author's style may also
reveal her age and gender. Accurate predictors are of key interest to forensic
linguists and marketers alike.

Learn more at http://pan.webis.de.

PAN is held in conjunction with the CLEF'13 conference in Valencia, Spain.

-------------------------------------------------------------------------------
Important Dates
-------------------------------------------------------------------------------

now open Registration
Dec 15, 2012 Training data release
Mar 31, 2013 Run submission
Jun 16, 2013 Notebook submission
Sep 23-26, 2013 Conference

-------------------------------------------------------------------------------
Organization
-------------------------------------------------------------------------------

Martin Potthast, Tim Gollub, Matthias Hagen, Benno Stein
Webis @ Bauhaus-Universität Weimar

Parth Gupta, Paolo Rosso
NLEL @ Universidad Politécnica de Valencia

Efstathios Stamatatos
University of the Aegean

Moshe Koppel
Bar-Ilan University

Patrick Juola
Duquesne University

Shlomo Argamon
Illinois Institute of Technology

Giacomo Inches
IRGroup @ University of Lugano

Francisco Rangel
Autoritas Consulting

Arun kumar Jayapal

unread,

Dec 18, 2012, 5:25:41 AM12/18/12

to pan-works...@googlegroups.com

Hi Martin,

Hope you are doing good!

Again this time, I would like to participate for the Plagiarism
detection task and Author identification task. During the registration I
have not provided the affiliation. Is it mandatory to provide the
affiliation information?

Thanks,
Arun

> Webis @ Bauhaus-Universit�t Weimar
>
> Parth Gupta, Paolo Rosso
> NLEL @ Universidad Polit�cnica de Valencia

Jan Kasprzak

unread,

Dec 18, 2012, 7:52:07 AM12/18/12

to pan-works...@googlegroups.com

Hello, Martin and all!

Martin Potthast wrote:
: 1. Plagiarism Detection -- Given a document, is it an original?
: This task is divided into source retrieval and text alignment.
: Source retrieval is about searching for likely sources of a suspicious document.
: Text alignment is about matching passages of reused text between documents.

I am not sure whether you have already made the decision about rules,
or whether you are even willing to discuss it here. But maybe now is the best
time to start the discussion about the rules and scoring:

As for the detailed (pair-wise?) task, can I suggest that we
abandon plagdet, which has clear shortcomings for this task?
See the Cristian's poster from this year for a specific example,
and our paper for the parameters we had to modify in order to achieve
good results.

We can discuss a scoring system here, or in a private mail,
or I can create mailing list on our listserver, if needed.

Possible ideas for improvement are:

- no granularity at all

- algorithms tuned to the new scoring should work the same way
on a corpus with 50 % of plagiarized documents as well as
on a real-world data with > 99 % of non-plagiarized
documents. If possible, the scoring system should not induce
a dependency on the corpus structure at all.

- assigning and evaluating the confidence value (0..100%) to the detections
(or -100%..100% to _each_ part of the suspicious document).

- treating intrinsic plagiarism detections (those with no counterpart
in source documents) as "better than nothing", which was not
the case in previous years (given the low precision of the
intrinsic detectors)

- maybe enforce better matching of the source and suspicious document
passages. Currently, the following two results are
given the same score:

Gold standard:
src offset=0, src length=1000, susp offset=0, susp length=1000
src offset=5000, src length=1000, susp offset=5000, susp length=1000

Results 1: the same as Gold standard

Results 2:
src offset=5000, src length=1000, susp offset=0, susp length=1000
src offset=0, src length=1000, susp offset=5000, susp length=1000

- explicit rules about the passage boundaries: for example:
- leading and trailing whitespace are never part of the plagiarized
passage, but
- leading and trailing interpunction are

- if the computing speed is included in the results at all (I am not sure
about it), it should account for possible parallelization, and should
not prevent the obvious optimizations (like caching the tokenized
data).

What do you think about it?

Have a nice day,

-Jan "Yenya" Kasprzak

--
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| GPG: ID 1024/D3498839 Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E |
| http://www.fi.muni.cz/~kas/ Journal: http://www.fi.muni.cz/~kas/blog/ |
Please don't top post and in particular don't attach entire digests to your
mail or we'll all soon be using bittorrent to read the list. --Alan Cox

Martin Potthast

unread,

Dec 18, 2012, 8:47:53 AM12/18/12

to pan-workshop-series

Hi Arun,

thanks for joining again.

If you are affiliated with a university or company on whose behalf you
participate, it'd be nice to name them. If you are participating on
your own, it is not.

Martin

>> Webis @ Bauhaus-Universität Weimar
>>
>> Parth Gupta, Paolo Rosso
>> NLEL @ Universidad Politécnica de Valencia

>>
>> Efstathios Stamatatos
>> University of the Aegean
>>
>> Moshe Koppel
>> Bar-Ilan University
>>
>> Patrick Juola
>> Duquesne University
>>
>> Shlomo Argamon
>> Illinois Institute of Technology
>>
>> Giacomo Inches
>> IRGroup @ University of Lugano
>>
>> Francisco Rangel
>> Autoritas Consulting
>>
>

> --
> --
> You received this message because you are subscribed to the Google Group
> "PAN".
> Visit this group at http://groups.google.com/group/pan-workshop-series
> To unsubscribe send email to
> pan-workshop-se...@googlegroups.com.
>
>

--
Martin Potthast
Bauhaus-Universität Weimar
www.webis.de --- www.netspeak.org

Martin Potthast

unread,

Dec 18, 2012, 3:34:41 PM12/18/12

to pan-workshop-series

Hi Jan,

thanks for bringing this up. There's always room for discussion!

> As for the detailed (pair-wise?) task, can I suggest that we
> abandon plagdet, which has clear shortcomings for this task?
> See the Cristian's poster from this year for a specific example,
> and our paper for the parameters we had to modify in order to achieve
> good results.

Regarding Christian's comments, we have some counterargument which you
can find in last year's overview paper. In fact, the situation is not
as dire.

Also, with regard to your parameter tuning, I still believe this was
due to errors in the corpus, there being plagiarized passages that
really should have been two instead of one.

> We can discuss a scoring system here, or in a private mail,
> or I can create mailing list on our listserver, if needed.

As for alternative scorings; maybe we will not be ranking systems at
all, this year (or in alphabetical order). We may announce three best
systems, and we're thinking of introducing alternative ways of
combining the measures into one, so there's no single measure people
don't like, but many.

My suggestion is to focus on precision and recall, and granularity (if
you so choose).

> - no granularity at all

We will certainly continue to measure granularity. However, everyone
is free to choose which measure they prefer.

It's no use, making the task easier by just dropping things that we
find too difficult. Rather, we should think about how we can make the
task more demanding and measure more intricate things.

> - algorithms tuned to the new scoring should work the same way
> on a corpus with 50 % of plagiarized documents as well as
> on a real-world data with > 99 % of non-plagiarized
> documents. If possible, the scoring system should not induce
> a dependency on the corpus structure at all.

Every evaluation depends on the underlying corpus used. If some
algorithm detects more than it should (say, detections where there is
nothing to be detected), this harms its precision.

Also, we're not classifying anything here, so the class imbalance has
no impact on what an algorithm does on any given pair of documents.
Either there is a pair of reused passages or there is not. If an
algorithm would be able to perfectly grasp the semantics of a text, it
will be able to make a decision regardless how often such cases occur.

> - assigning and evaluating the confidence value (0..100%) to the detections
> (or -100%..100% to _each_ part of the suspicious document).

I'm not entirely sure what you mean by this.

> - treating intrinsic plagiarism detections (those with no counterpart
> in source documents) as "better than nothing", which was not
> the case in previous years (given the low precision of the
> intrinsic detectors)

Intrinsic plagiarism detection is an extremely difficult task, much
more so than external plagiarism detection. However, what does "better
than nothing" mean? As far as I know, there was a clear distinction
between passages to be detected an others, and those who detected more
passages correctly than not were given higher scores.

But never mind, there probably won't be intrinsic plagiarism detection
next year.

> - maybe enforce better matching of the source and suspicious document
> passages. Currently, the following two results are
> given the same score:
>
> Gold standard:
> src offset=0, src length=1000, susp offset=0, susp length=1000
> src offset=5000, src length=1000, susp offset=5000, susp length=1000
>
> Results 1: the same as Gold standard
>
> Results 2:
> src offset=5000, src length=1000, susp offset=0, susp length=1000
> src offset=0, src length=1000, susp offset=5000, susp length=1000

Hold on; if the reference implementation does not drop these two as
non-detections, then there's a clear error in the implementation. Can
you come up with two XML-files that I can easily input into the
performance measure script to double-check this?

> - explicit rules about the passage boundaries: for example:
> - leading and trailing whitespace are never part of the plagiarized
> passage, but
> - leading and trailing interpunction are

Good point!

> - if the computing speed is included in the results at all (I am not sure
> about it), it should account for possible parallelization, and should
> not prevent the obvious optimizations (like caching the tokenized
> data).

Yes, this year, you'll be given more freedom to do stuff. While the
task will still be as atomic as possible (say, given a pair of
documents...), you'll be able to cache stuff, etc.

Best,
Martin

Reply all

Reply to author

Forward