I'd be glad to join the programme committee for the workshop.
Regarding the experiment and your observations:
"(i) the corpus will contain only artificial plagiarism (text from a randomly chosen document which is rewritten automatically in a half-random and half-knowledge-based process). So, you can imagine that, for a human, it will be fairly easy to find the plagiarism while reading, and
(ii) the mere amount of text (about 20 000 books) will make a manual approach on the whole corpus almost infeasible."
For (i) analysis of the text style is probably out then, but selection of "key phrases" for manual searches seems still relevant. If we imagine that I choose phrases to search the book files and log the false positives and negatives, then this would provide comparison with automated systems that use text profiling and thumbnailing.
For (ii) my original approach was to search the web, rather than books. If we imagine that these texts, assuming that they are actually real books, will to some degree be represented in whole or part on the internet. Copying from books has, mistakenly, been seen as "safe" when using automated detection services, which generally focus on web pages with some publisher databases. Many cases that I have worked on have involved identifying suspect material, by web stubs (such as sample chapters or secondary citation) and then tracking back to the paper book used by the student (often in the uni library!). So, it might still be a feasible exercise.
I'd appreciate your thoughts.
--
Dr. Mike Reddy, Future Technology, Games Development and A.I., Department of Computing, Newport Business School, University of Wales, Newport, Allt-yr-yn Campus, PO Box 180 Newport South Wales NP20 5DA
Technoleg y Dyfodol, Datblygu Gemau a D.A., Yr Adran Gyfrifiadureg, Ysgol Fusnes Casnewydd, Prifysgol Cymru, Casnewydd, Campws Allt-yr-ynn, Blwch Post 180, Casnewydd, De Cymru NP20 5DA
Tel/Ffôn: +44 (0)1633 432452 Fax/Ffacs: +44 (0)1633 432307 Mobile/Symudol: +44 (0)7971 170 199
Email/Ebost: mike.reddy @ newport.ac.uk (remove spaces/dilëwch y bylchau)
-----Original Message-----
From: martin....@googlemail.com [mailto:martin....@googlemail.com] On Behalf Of Martin Potthast
Sent: 10 March 2009 16:38
To: Mike Reddy
Cc: pa...@webis.de
Subject: Re: PAN workshop
Dear Mike,
thank your for inquiry!
I would be very interested in either being a judge/referee for the exciting auto detection competition as well as
We appreciate your offer, and we invite you to join the program committee of the workshop.
offering a "yardstick" by challenging all the automatic systems by performing an independent manual search. While laborious, this would provide a much needed comparison and contrast to the automated systems and question/justify the need to have them.
We believe this would be a very interesting experiment and we encourage you to do it, however, we see two problems:
(i) the corpus will contain only artificial plagiarism (text from a randomly chosen document which is rewritten automatically in a half-random and half-knowledge-based process). So, you can imagine that, for a human, it will be fairly easy to find the plagiarism while reading, and
(ii) the mere amount of text (about 20 000 books) will make a manual approach on the whole corpus almost infeasible.
Nevertheless, we invite you to perform an analysis of the corpus and maybe publish about your results on the workshop.
Don't hesitate to ask us any questions, and be on the lookout for updates on the workshop Web page.
Best regards,
Martin
--
Martin Potthast
Bauhaus-Universität Weimar
www.webis.de
If you do things right, people won't be sure you've done anything at all.
"(i) the corpus will contain only artificial plagiarism (text from a randomly chosen document which is rewritten automatically in a half-random and half-knowledge-based process). So, you can imagine that, for a human, it will be fairly easy to find the plagiarism while reading, and
(ii) the mere amount of text (about 20 000 books) will make a manual approach on the whole corpus almost infeasible."
For (i) analysis of the text style is probably out then, but selection of "key phrases" for manual searches seems still relevant. If we imagine that I choose phrases to search the book files and log the false positives and negatives, then this would provide comparison with automated systems that use text profiling and thumbnailing.
For (ii) my original approach was to search the web, rather than books. If we imagine that these texts, assuming that they are actually real books, will to some degree be represented in whole or part on the internet. Copying from books has, mistakenly, been seen as "safe" when using automated detection services, which generally focus on web pages with some publisher databases. Many cases that I have worked on have involved identifying suspect material, by web stubs (such as sample chapters or secondary citation) and then tracking back to the paper book used by the student (often in the uni library!). So, it might still be a feasible exercise.