Hi Paul,
let me give you a short breakdown on how we reached our decisions:
I am not entirely sure what data is being generated/provided for the
competition, but recently I have been working on building a more
realistic corpus of plagiarism examples (i.e. simulated plagiarism).
Why? Because I feel that something which is articificial may not
provide a suitable benchmark to work on (i.e. not exhibit the range of
plagiarism which is exhibited in "real" examples).
You are right, real plagiarism cases would be preferable.
We have considered to assemble real plagiarism cases we are aware of, but publishing them in a corpus would be like denouncing the authors in a persistent manner. In what time does a plagiarism case become time-barred, anyway?
Also, we don't have the license to publish these texts and therefore would need to ask the original authors/publishers of both the text containing the plagiarism and the source document. You can imagine who would agree and who would decline.
So, simulated plagiarism: We intended to do some crowd sourcing, but we considered the results unreliable and unrepresentative unless we reach a significant scale. Also, there is the issue that the most we could ask for is to rewrite a short text, and that the texts should be easy enough to be understood by an average Web surfer. Finally, this would take quite some time.
Nevertheless, one got to start somewhere: It depends on how many simulated cases you can offer whether they would be useful in this competition. We'd be happy to collaborate with you in building a larger corpus of simulated plagiarism.
In addition, I have a collection of news agency and newspaper articles
(the METER corpus) which I have used for studying text reuse in
journalism (an example of benign plagiarism). This might provide a
useful set of examples to study aspects of plagiarism such as text
editing (e.g. paraphrasing).
We have considered news articles (also your corpus), and a number of other potential sources. Our criteria were scale (number of documents), representativeness (document lengths, different topics, etc.) and last not least the amount of manual effort to clean and annotate the corpus. Finally, there is again the issue of obtaining the license to publish the texts. Some are free, some are not, some authors may have no problem being part of a plagiarism corpus, some do.
Apropos corpora containing paraphrases: the revisions of Wikipedia articles are a large-scale source of documents which are constantly rewritten. We have used this corpus earlier to evaluate near-duplicate detection algorithms [1,2]. But since Wikipedia articles are no monographs they are not applicable in the intrinsic plagiarism detection task.
In the end we reached the decision to construct artificial plagiarism, and to take books whose copyright is outdated as a basis.
So can someone provide more information on the kind of data which will
be provided?
I'll give you an idea of the corpus statistics:
- Corpus size: 22000 (union of source documents and documents containing plagiarism)
- Plagiarism length: short, medium, long
- Fraction of plagiarized text: 0%-100% (higher fractions with decreasing probability)
- Document lengths: small (up to paper size), medium, large (up to book size)
- Plagiarism types:
- Monolingual: modification degrees: none, low, high
- Multilingual: translation type: automatic
Best,
Martin
[1]
http://www.uni-weimar.de/medien/webis/publications/downloads/papers/stein_2008d.pdf[2]
http://www.uni-weimar.de/cms/medien/webis/research/corpora.html#c32888
--
Martin Potthast
Bauhaus-Universität Weimar
www.webis.deIf you do things right, people won't be sure you've done anything at all.