PAN workshop: The Horse v the Car!

8 views

Skip to first unread message

Mike Reddy

unread,

Mar 11, 2009, 5:27:04 AM3/11/09

to Martin Potthast, pan09-co...@googlegroups.com

Thanks Martin,

I'd be glad to join the programme committee for the workshop.

Regarding the experiment and your observations:

"(i) the corpus will contain only artificial plagiarism (text from a randomly chosen document which is rewritten automatically in a half-random and half-knowledge-based process). So, you can imagine that, for a human, it will be fairly easy to find the plagiarism while reading, and
(ii) the mere amount of text (about 20 000 books) will make a manual approach on the whole corpus almost infeasible."

For (i) analysis of the text style is probably out then, but selection of "key phrases" for manual searches seems still relevant. If we imagine that I choose phrases to search the book files and log the false positives and negatives, then this would provide comparison with automated systems that use text profiling and thumbnailing.

For (ii) my original approach was to search the web, rather than books. If we imagine that these texts, assuming that they are actually real books, will to some degree be represented in whole or part on the internet. Copying from books has, mistakenly, been seen as "safe" when using automated detection services, which generally focus on web pages with some publisher databases. Many cases that I have worked on have involved identifying suspect material, by web stubs (such as sample chapters or secondary citation) and then tracking back to the paper book used by the student (often in the uni library!). So, it might still be a feasible exercise.

I'd appreciate your thoughts.

--
Dr. Mike Reddy, Future Technology, Games Development and A.I., Department of Computing, Newport Business School, University of Wales, Newport, Allt-yr-yn Campus, PO Box 180 Newport South Wales NP20 5DA

Technoleg y Dyfodol, Datblygu Gemau a D.A., Yr Adran Gyfrifiadureg, Ysgol Fusnes Casnewydd, Prifysgol Cymru, Casnewydd, Campws Allt-yr-ynn, Blwch Post 180, Casnewydd, De Cymru NP20 5DA

Tel/Ffôn: +44 (0)1633 432452 Fax/Ffacs: +44 (0)1633 432307 Mobile/Symudol: +44 (0)7971 170 199
Email/Ebost: mike.reddy @ newport.ac.uk (remove spaces/dilëwch y bylchau)

-----Original Message-----
From: martin....@googlemail.com [mailto:martin....@googlemail.com] On Behalf Of Martin Potthast
Sent: 10 March 2009 16:38
To: Mike Reddy
Cc: pa...@webis.de
Subject: Re: PAN workshop

Dear Mike,

thank your for inquiry!

I would be very interested in either being a judge/referee for the exciting auto detection competition as well as

We appreciate your offer, and we invite you to join the program committee of the workshop.

offering a "yardstick" by challenging all the automatic systems by performing an independent manual search. While laborious, this would provide a much needed comparison and contrast to the automated systems and question/justify the need to have them.

We believe this would be a very interesting experiment and we encourage you to do it, however, we see two problems:
(i) the corpus will contain only artificial plagiarism (text from a randomly chosen document which is rewritten automatically in a half-random and half-knowledge-based process). So, you can imagine that, for a human, it will be fairly easy to find the plagiarism while reading, and
(ii) the mere amount of text (about 20 000 books) will make a manual approach on the whole corpus almost infeasible.

Nevertheless, we invite you to perform an analysis of the corpus and maybe publish about your results on the workshop.

Don't hesitate to ask us any questions, and be on the lookout for updates on the workshop Web page.

Best regards,
Martin

--
Martin Potthast
Bauhaus-Universität Weimar
www.webis.de

If you do things right, people won't be sure you've done anything at all.

Martin Potthast

unread,

Mar 11, 2009, 7:00:57 AM3/11/09

to pan09-co...@googlegroups.com

Hi Mike,

"(i) the corpus will contain only artificial plagiarism (text from a randomly chosen document which is rewritten automatically in a half-random and half-knowledge-based process). So, you can imagine that, for a human, it will be fairly easy to find the plagiarism while reading, and
(ii) the mere amount of text (about 20 000 books) will make a manual approach on the whole corpus almost infeasible."

For (i) analysis of the text style is probably out then, but selection of "key phrases" for manual searches seems still relevant. If we imagine that I choose phrases to search the book files and log the false positives and negatives, then this would provide comparison with automated systems that use text profiling and thumbnailing.

This is exactly what humans or basic algorithms for plagiarism detection do. And it will work, if you are lucky enough to select an n-gram from a plagiarized passage and we have not obfuscated this particular n-gram by reordering / replacing / deletion / insertion of words.

Apropos: we have done research on the number of n-grams a human or a software has to select from a suspicious document in order to hit a plagiarized passage with a high probability [1]. Our model is based on assumptions about the length of the document and the percentage of plagiarism within. It may be useful for this.

For (ii) my original approach was to search the web, rather than books. If we imagine that these texts, assuming that they are actually real books, will to some degree be represented in whole or part on the internet. Copying from books has, mistakenly, been seen as "safe" when using automated detection services, which generally focus on web pages with some publisher databases. Many cases that I have worked on have involved identifying suspect material, by web stubs (such as sample chapters or secondary citation) and then tracking back to the paper book used by the student (often in the uni library!). So, it might still be a feasible exercise.

Yes, you will find the books online, and I guess they are also indexed by search engines. However, there is still the above issue that you need to find a reliable means to select n-grams from the inserted text.

Anyway, this is a very interesting experiment: we may see humans fail because of the scale of the documents to be analyzed, or we may see that humans can still beat software. We will assist you with this if we can. When the corpus is released you should reconsider whether it is possible to participate like this.

Best,
Martin

[1] http://www.uni-weimar.de/medien/webis/publications/downloads/papers/stein_2007f.pdf

Reply all

Reply to author

Forward

0 new messages