It seems to me that plagiarized sections are inserted to the original
text without any regularity.
for example, a plagiarized text can be a whole paragraph, but
sometimes it is only part of a paragraph (several consecutive
sentences within a paragraph).
further more, it can sometimes start in a middle of a word/sentence
(take a look at suspicious-document00003.txt for example).
this is not something that is not common in human plagiarism...
Probably a similar questions:
1. Is there any threshold on what to consider plagiarism? A single
three words sentence -- is it a plagiarism?
2. In many documents there is a sentence -- "Charles Franks and the
Online Distributed Proofreading Team." Is it considered plagiarism?
I've come to a logical conclusion that according to the .xml
description
they does NOT contain any plagiarism from sources and thus they have
no plagiarism sections with ofsset etc.
This is either intentional (like a doc. with no plagiarism) and I'm
correct,
or Vlad is correct and these files lack description due to the
Plagiarism generation programme.
That's definitely simplifies the task, but raises another question,
for example, document 0066 has the following common parts with the
source documents:
source-document00278.txt s something to have been the
source-document00381.txt s nothing left to do but to
source-document00824.txt nd, there is nothing left to
source-document00878.txt e, it were better we should
[....]
So, as I understand you, these shouldn't be considered plagiarism, and
this means we should use a lower bound stated -- 30 words? Is it
right?
I have about 10% left, so I need to find out how to refine the results
to make them fit the requirements.
>> further more, it can sometimes start in a middle of a word/sentence
>> (take a look at suspicious-document00003.txt for example).
>> this is not something that is not common in human plagiarism...
>
> @Andreas: Can check upon this?
I will check this as soon as I'm back to Valencia (23.4).
Best wishes from Granada - Andreas
----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.
I'm sorry Martin. One more question regarding measures calculation --
chars of correctly detected passages
May I assume that a "correctly detected passage" is the one that
overlaps with some passage from the annotation?