Hi Jan,
thanks for bringing this up. There's always room for discussion!
> As for the detailed (pair-wise?) task, can I suggest that we
> abandon plagdet, which has clear shortcomings for this task?
> See the Cristian's poster from this year for a specific example,
> and our paper for the parameters we had to modify in order to achieve
> good results.
Regarding Christian's comments, we have some counterargument which you
can find in last year's overview paper. In fact, the situation is not
as dire.
Also, with regard to your parameter tuning, I still believe this was
due to errors in the corpus, there being plagiarized passages that
really should have been two instead of one.
> We can discuss a scoring system here, or in a private mail,
> or I can create mailing list on our listserver, if needed.
As for alternative scorings; maybe we will not be ranking systems at
all, this year (or in alphabetical order). We may announce three best
systems, and we're thinking of introducing alternative ways of
combining the measures into one, so there's no single measure people
don't like, but many.
My suggestion is to focus on precision and recall, and granularity (if
you so choose).
> - no granularity at all
We will certainly continue to measure granularity. However, everyone
is free to choose which measure they prefer.
It's no use, making the task easier by just dropping things that we
find too difficult. Rather, we should think about how we can make the
task more demanding and measure more intricate things.
> - algorithms tuned to the new scoring should work the same way
> on a corpus with 50 % of plagiarized documents as well as
> on a real-world data with > 99 % of non-plagiarized
> documents. If possible, the scoring system should not induce
> a dependency on the corpus structure at all.
Every evaluation depends on the underlying corpus used. If some
algorithm detects more than it should (say, detections where there is
nothing to be detected), this harms its precision.
Also, we're not classifying anything here, so the class imbalance has
no impact on what an algorithm does on any given pair of documents.
Either there is a pair of reused passages or there is not. If an
algorithm would be able to perfectly grasp the semantics of a text, it
will be able to make a decision regardless how often such cases occur.
> - assigning and evaluating the confidence value (0..100%) to the detections
> (or -100%..100% to _each_ part of the suspicious document).
I'm not entirely sure what you mean by this.
> - treating intrinsic plagiarism detections (those with no counterpart
> in source documents) as "better than nothing", which was not
> the case in previous years (given the low precision of the
> intrinsic detectors)
Intrinsic plagiarism detection is an extremely difficult task, much
more so than external plagiarism detection. However, what does "better
than nothing" mean? As far as I know, there was a clear distinction
between passages to be detected an others, and those who detected more
passages correctly than not were given higher scores.
But never mind, there probably won't be intrinsic plagiarism detection
next year.
> - maybe enforce better matching of the source and suspicious document
> passages. Currently, the following two results are
> given the same score:
>
> Gold standard:
> src offset=0, src length=1000, susp offset=0, susp length=1000
> src offset=5000, src length=1000, susp offset=5000, susp length=1000
>
> Results 1: the same as Gold standard
>
> Results 2:
> src offset=5000, src length=1000, susp offset=0, susp length=1000
> src offset=0, src length=1000, susp offset=5000, susp length=1000
Hold on; if the reference implementation does not drop these two as
non-detections, then there's a clear error in the implementation. Can
you come up with two XML-files that I can easily input into the
performance measure script to double-check this?
> - explicit rules about the passage boundaries: for example:
> - leading and trailing whitespace are never part of the plagiarized
> passage, but
> - leading and trailing interpunction are
Good point!
> - if the computing speed is included in the results at all (I am not sure
> about it), it should account for possible parallelization, and should
> not prevent the obvious optimizations (like caching the tokenized
> data).
Yes, this year, you'll be given more freedom to do stuff. While the
task will still be as atomic as possible (say, given a pair of
documents...), you'll be able to cache stuff, etc.
Best,
Martin