plagiarized sections

2 views
Skip to first unread message

bar...@gmail.com

unread,
Apr 17, 2009, 6:21:39 AM4/17/09
to PAN'09 Competition on Plagiarism Detection
Hi
It seems to me that plagiarized sections are inserted to the original
text without any regularity.
for example, a plagiarized text can be a whole paragraph, but
sometimes it is only part of a paragraph (several consecutive
sentences within a paragraph).
further more, it can sometimes start in a middle of a word/sentence
(take a look at suspicious-document00003.txt for example).
this is not something that is not common in human plagiarism...
is that really the case in this corpus, or am i missing something
here ??

Barak

Martin Potthast

unread,
Apr 17, 2009, 7:02:45 AM4/17/09
to pan09-co...@googlegroups.com
Hi Barak,
 
It seems to me that plagiarized sections are inserted to the original
text without any regularity.

If there were any obvious regularity, then it would be too easy, wouldn't it?
 
for example, a plagiarized text can be a whole paragraph, but
sometimes it is only part of a paragraph (several consecutive
sentences within a paragraph).

We have created plagiarized passages of various lengths, so it may very well be that there are sometimes only a few sentences, and sometimes many paragraphs. Also, the plagiarism was inserted at random into the suspicious documents.
 
further more, it can sometimes start in a middle of a word/sentence
(take a look at suspicious-document00003.txt for example).
this is not something that is not common in human plagiarism...

About plagiarism that starts in the middle of a word: you may have found a bug we haven't. But in case this is a bug I wouldn't consider it critical since one or two broken words are not that much compared to long texts.

@Andreas: Can check upon this?

About plagiarism that starts in the middle of a sentence: a real plagiarist would have written more or less correct grammar, but, alas, we only have a random plagiarist with a limited number of text operations.
 
Best,
Martin


--
Martin Potthast
Bauhaus-Universität Weimar
www.webis.de

If you do things right, people won't be sure you've done anything at all.

vladislav....@gmail.com

unread,
Apr 17, 2009, 11:03:25 AM4/17/09
to PAN'09 Competition on Plagiarism Detection
Hello!

Probably a similar questions:

1. Is there any threshold on what to consider plagiarism? A single
three words sentence -- is it a plagiarism?
2. In many documents there is a sentence -- "Charles Franks and the
Online Distributed Proofreading Team." Is it considered plagiarism?

Thank you,
Vlad

On Apr 17, 12:02 pm, Martin Potthast <martin.potth...@uni-weimar.de>
wrote:

Martin Potthast

unread,
Apr 17, 2009, 6:00:06 PM4/17/09
to pan09-co...@googlegroups.com
Hi again,

Probably a similar questions:

1. Is there any threshold on what to consider plagiarism? A single
three words sentence -- is it a plagiarism?

There is a lower bound, and I think it is not lower than 30 words.
 
2. In many documents there is a sentence -- "Charles Franks and the
Online Distributed Proofreading Team." Is it considered plagiarism?

Ooops, we didn't notice that. No, only the plagiarism cases which were annotated by us are considered plagiarism everything else is not.

@Andreas: Please make a note to add some small checks for extreme cases of accidental overlap for the competition corpus.

Best,
Martin
 

vladislav....@gmail.com

unread,
Apr 18, 2009, 6:32:48 AM4/18/09
to PAN'09 Competition on Plagiarism Detection
Hi Martin!

Thanks for your answer.

> > 2. In many documents there is a sentence -- "Charles Franks and the
> > Online Distributed Proofreading Team." Is it considered plagiarism?
>
> Ooops, we didn't notice that. No, only the plagiarism cases which were
> annotated by us are considered plagiarism everything else is not.

But not all documents have annotations, for example: 00037, 00030,
00066, etc.
This leads to another confusion -- how to calculate the measures?

<?xml version="1.0" encoding="UTF-8"?>
<document
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="http://www.uni-weimar.de/medien/
webis/research/corpora/pan-pc-09/document.xsd"
reference="suspicious-document00066.txt">
<!-- Use tags like the one below to annotate plagiarism you detected.
-->
<!-- <feature name="detected-plagiarism" this_offset="5"
this_length="1000" source_reference="source-documentx.txt"
source_offset="100" source_length="1000"/> -->
</document>

Thank you,
Vlad

J.A. Palkovskii Plagiarism-Detector Project Leading Programmer

unread,
Apr 18, 2009, 7:42:03 AM4/18/09
to PAN'09 Competition on Plagiarism Detection
Helo Martin,
Helo Vlad,

I have crosschecked this issue from my side -
the files in question DO both exist:

suspicious-document00030.txt
suspicious-document00030.xml

and

suspicious-document00037.txt
suspicious-document00037.xml

But the description files are empty (the *.xml ones) they contain only
the
header, but there is no plagiarism sections described.

======================================================
<?xml version="1.0" encoding="UTF-8"?>
<document
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="http://www.uni-weimar.de/medien/
webis/research/corpora/pan-pc-09/document.xsd"
reference="suspicious-document00030.txt">
<!-- Use tags like the one below to annotate plagiarism you detected.
-->
<!-- <feature name="detected-plagiarism" this_offset="5"
this_length="1000" source_reference="source-documentx.txt"
source_offset="100" source_length="1000"/> -->
</document>
======================================================

I've come to a logical conclusion that according to the .xml
description
they does NOT contain any plagiarism from sources and thus they have
no plagiarism sections with ofsset etc.

This is either intentional (like a doc. with no plagiarism) and I'm
correct,
or Vlad is correct and these files lack description due to the
Plagiarism generation programme.

Please clarify the issue!
Looking forward to your reply!



Martin Potthast

unread,
Apr 18, 2009, 4:52:53 PM4/18/09
to pan09-co...@googlegroups.com
I've come to a logical conclusion that according to the .xml
description
they does NOT contain any plagiarism from sources and thus they have
no plagiarism sections with ofsset etc.

This is the right conclusion.
 
This is either intentional (like a doc. with no plagiarism) and I'm
correct,
or Vlad is correct and these files lack description due to the
Plagiarism generation programme.

It is intentional, so the empty description files show that there is nothing to be found.
Observe that the files to be analyzed are called suspicious documents, but suspicions are not necessarily true.


Best,
Martin


vladislav....@gmail.com

unread,
Apr 18, 2009, 5:52:24 PM4/18/09
to PAN'09 Competition on Plagiarism Detection
Hi Martin!

Huh..

That's definitely simplifies the task, but raises another question,
for example, document 0066 has the following common parts with the
source documents:
source-document00278.txt s something to have been the
source-document00381.txt s nothing left to do but to
source-document00824.txt nd, there is nothing left to
source-document00878.txt e, it were better we should
source-document01611.txt eet. There is nothing left
source-document04223.txt of the young bridegroom
source-document05053.txt eats the bread and drinks the
source-document05456.txt s nothing left to do but t
source-document05553.txt love "which moves the sun and
source-document05872.txt n stalks among the reeds
source-document06541.txt The dawn is rising from the sea, like a white
lady from her bed
source-document06609.txt The universe itself shall

or even more 0030:
source-document00513.txt a general resemblance to the
source-document00524.txt ing in the opposite direction.
source-document00537.txt I should think they would be
source-document00538.txt received the congratulations of the
source-document00538.txt The final arrangements were
source-document00555.txt mission, the most important
source-document00555.txt in the opposite direction
source-document00555.txt it was absolutely necessary to
source-document00561.txt t my first acquaintance
source-document00577.txt is better than the original "
source-document00577.txt you will be glad to hear--th
source-document00606.txt ing in the opposite direction.
source-document00639.txt according to his lights. He
source-document00648.txt in the opposite direction.
source-document00649.txt appeared in the doorway,
source-document00655.txt better than the original
source-document00669.txt the separation between
source-document00670.txt through the darkness. "The
source-document00676.txt in the opposite direction.
source-document00680.txt in the opposite direction,
source-document00680.txt first acquaintance with them
source-document00680.txt final arrangements were made,
source-document00699.txt e as comfortable as circumstances
source-document00712.txt ground, the village street,
source-document00712.txt in the opposite direction.
etc......

So, as I understand you, these shouldn't be considered plagiarism, and
this means we should use a lower bound stated -- 30 words? Is it
right?

I have about 10% left, so I need to find out how to refine the results
to make them fit the requirements.

Thanks for your prompt replies,
Vlad

On Apr 18, 9:52 pm, Martin Potthast <martin.potth...@uni-weimar.de>
wrote:
> > I've come to a logical conclusion that according to the .xml
> > description
> > they does NOT contain any plagiarism from sources and thus they have
> > no plagiarism sections with ofsset etc.
>
> This is the right conclusion.
>
> > This is either intentional (like a doc. with no plagiarism) and I'm
> > correct,
> > or Vlad is correct and these files lack description due to the
> > Plagiarism generation programme.
>
> It is intentional, so the empty description files show that there is nothing
> to be found.
> Observe that the files to be analyzed are called *suspicious* documents, but

Martin Potthast

unread,
Apr 20, 2009, 3:41:59 AM4/20/09
to pan09-co...@googlegroups.com
Hi Vlad,

That's definitely simplifies the task, but raises another question,
for example, document 0066 has the following common parts with the
source documents:
source-document00278.txt        s something to have been the
source-document00381.txt        s nothing left to do but to
source-document00824.txt        nd, there is nothing left to
source-document00878.txt        e, it were better we should
[....]


So, as I understand you, these shouldn't be considered plagiarism, and
this means we should use a lower bound stated -- 30 words? Is it
right?

I won't say you should cut off everything below 30 words, that wouldn't be realistic. What if one copies the main idea of someone else which happens to fit into one sentence?
I would say that you should think about how to define "plagiarism": Do you consider the above cases plagiarism? Dependent on your answer to this question your system should report these cases, or not.

Anyway, what you should detect is only what we annotated and nothing more, otherwise you will possibly have a bad precision.
 
I have about 10% left, so I need to find out how to refine the results
to make them fit the requirements.

Sorry I was late this time, but I was ill.

Best,
Martin

aei...@dsic.upv.es

unread,
Apr 20, 2009, 12:36:59 PM4/20/09
to pan09-co...@googlegroups.com
Hi Martin :)

>> further more, it can sometimes start in a middle of a word/sentence
>> (take a look at suspicious-document00003.txt for example).
>> this is not something that is not common in human plagiarism...
>

> @Andreas: Can check upon this?

I will check this as soon as I'm back to Valencia (23.4).

Best wishes from Granada - Andreas

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.


vladislav....@gmail.com

unread,
Apr 20, 2009, 6:46:36 PM4/20/09
to PAN'09 Competition on Plagiarism Detection
Martin, thanks for your prompt responses! Now it seems it's clear :)

Hope you're feeling better now.

On Apr 20, 8:41 am, Martin Potthast <martin.potth...@uni-weimar.de>
wrote:

vladislav....@gmail.com

unread,
Apr 20, 2009, 7:45:05 PM4/20/09
to PAN'09 Competition on Plagiarism Detection
I'm sorry Martin. One more question regarding measures calculation --

chars of correctly detected passages

May I assume that a "correctly detected passage" is the one that
overlaps with some passage from the annotation?

Thank you,
Vlad

On Apr 20, 11:46 pm, "vladislav.scherbi...@gmail.com"

Martin Potthast

unread,
Apr 21, 2009, 2:30:55 AM4/21/09
to pan09-co...@googlegroups.com
Hi Vlad,


I'm sorry Martin. One more question regarding measures calculation --

chars of correctly detected passages

May I assume that a "correctly detected passage" is the one that
overlaps with some passage from the annotation?

Exactly, however, only the chars which actually overlap with the plagiarized passage will be counted. For instance, if you happen to report a passage, say, from letter 0 to letter 1001 as plagiarized, and if there is a plagiarized passage from letter 1000 to letter 2000, then only one letter of plagiarism has been detected correctly. All the other letters from 0 to 1000 are false positives.
Reply all
Reply to author
Forward
0 new messages