Re: [PAN'13] Abridged summary of pan-workshop-series@googlegroups.com - 2 Messages in 1 Topic

62 views
Skip to first unread message

shikha Pandey

unread,
Aug 30, 2013, 1:15:59 AM8/30/13
to pan-works...@googlegroups.com
Hi, Is any guideline rules is available for Plagiarism detection on PAN competition.


On Sun, Aug 25, 2013 at 10:02 PM, <pan-works...@googlegroups.com> wrote:

Group: http://groups.google.com/group/pan-workshop-series/topics

    Shikha Pandey <shikham...@gmail.com> Aug 25 10:35AM -0700  

    Hi, all, I recently join this group. I will try for PAN-14.
    I m doing research in idea plagiarism. Is PAN held competition for idea
    plagiarism?
     
    ...more

    Back to top.

    Martin Potthast <martin....@uni-weimar.de> Aug 25 09:05PM +0200  

    Hi Shikha,
     
    thanks for registering; we will probably hold another PAN competition
    again next year. However, we are currently not studying idea
    plagiarism.
     
    Best,
    Martin
     
    On Sun, Aug 25, 2013 ...more

    Back to top.

--
--
You received this message because you are subscribed to the Google Group "PAN".
Visit this group at http://groups.google.com/group/pan-workshop-series
To unsubscribe send email to pan-workshop-se...@googlegroups.com.
---
You received this message because you are subscribed to the Google Groups "PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse." group.
To unsubscribe from this group and stop receiving emails from it, send an email to pan-workshop-se...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Martin Potthast

unread,
Aug 30, 2013, 2:46:47 AM8/30/13
to pan-workshop-series
Hi Shikha,

you'll find the here:
http://www.uni-weimar.de/medien/webis/research/events/pan-13/pan13-web/plagiarism-detection.html

But note that at PAN'14, the guidelines may be different from this year's.

Martin

Shikha Pandey

unread,
Sep 2, 2013, 3:21:14 AM9/2/13
to pan-works...@googlegroups.com
Hi Martin, Guideline rules, I means that,  How to decide that this portion of document is plagiarized ? , or  what  is the level of plagiarism, if it is plagiarized? Is there any policy for Plagiarism detection in PAN.
                                                                                 Shikha Pandey


On Friday, August 30, 2013 12:15:59 AM UTC-5, Shikha Pandey wrote:
Hi, Is any guideline rules is available for Plagiarism detection on PAN competition.
On Sun, Aug 25, 2013 at 10:02 PM, <pan-workshop-series@googlegroups.com> wrote:

Group: http://groups.google.com/group/pan-workshop-series/topics

    Shikha Pandey <shikham...@gmail.com> Aug 25 10:35AM -0700  

    Hi, all, I recently join this group. I will try for PAN-14.
    I m doing research in idea plagiarism. Is PAN held competition for idea
    plagiarism?
     
    ...more

    Back to top.

    Martin Potthast <martin....@uni-weimar.de> Aug 25 09:05PM +0200  

    Hi Shikha,
     
    thanks for registering; we will probably hold another PAN competition
    again next year. However, we are currently not studying idea
    plagiarism.
     
    Best,
    Martin
     
    On Sun, Aug 25, 2013 ...more

    Back to top.

--
--
You received this message because you are subscribed to the Google Group "PAN".
Visit this group at http://groups.google.com/group/pan-workshop-series
To unsubscribe send email to pan-workshop-series+unsub...@googlegroups.com.

---
You received this message because you are subscribed to the Google Groups "PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse." group.
To unsubscribe from this group and stop receiving emails from it, send an email to pan-workshop-series+unsub...@googlegroups.com.

Martin Potthast

unread,
Sep 2, 2013, 4:32:16 AM9/2/13
to pan-workshop-series
Hi Shikha,

if you are referring to how to measure the performance of a plagiarism
detector, I can refer you to the corresponding section on the PAN
plagiarism detection web page. Also, we've published about the
measures employed; take a look at the related work section of the web
page. For example, take a look in Section 2 here:
http://www.uni-weimar.de/medien/webis/publications/papers/stein_2010p.pdf

Best,
Martin

On Mon, Sep 2, 2013 at 9:21 AM, Shikha Pandey
<shikham...@gmail.com> wrote:
> Hi Martin, Guideline rules, I means that, How to decide that this portion
> of document is plagiarized ? , or what is the level of plagiarism, if it
> is plagiarized? Is there any policy for Plagiarism detection in PAN.
>
> Shikha Pandey
>
> On Friday, August 30, 2013 12:15:59 AM UTC-5, Shikha Pandey wrote:
>>
>> Hi, Is any guideline rules is available for Plagiarism detection on PAN
>> competition.
>>
>>
>> On Sun, Aug 25, 2013 at 10:02 PM, <pan-works...@googlegroups.com>
>>> pan-workshop-se...@googlegroups.com.
>>> ---
>>> You received this message because you are subscribed to the Google Groups
>>> "PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software
>>> Misuse." group.
>>> To unsubscribe from this group and stop receiving emails from it, send an
>>> email to pan-workshop-se...@googlegroups.com.
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
> --
> --
> You received this message because you are subscribed to the Google Group
> "PAN".
> Visit this group at http://groups.google.com/group/pan-workshop-series
> To unsubscribe send email to
> pan-workshop-se...@googlegroups.com.
> ---
> You received this message because you are subscribed to the Google Groups
> "PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software
> Misuse." group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pan-workshop-se...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.



--
Martin Potthast
Bauhaus-Universität Weimar
www.webis.de --- www.netspeak.org

shikha Pandey

unread,
Sep 2, 2013, 5:29:37 AM9/2/13
to pan-works...@googlegroups.com
Thank you Martin for your quick response. I already read this paper. I think your are not getting, I m not talking about performance measures of detector. again i explain, suppose i am going to develop a S/W for plagiarism detection, then how i will decide that this portion is plagiarized or not? if proper citation is there,like this... so plagiarism policy is required. so have you made any policy for competition?  because today's paraphrasing, summarizing, sentence reordering plagiarisms are generally used. 
waiting for your reply, or please suggest me if i m going in wrong direction.
  
 Shikha 


You received this message because you are subscribed to a topic in the Google Groups "PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse." group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/pan-workshop-series/jIdTlCbbhyc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to pan-workshop-se...@googlegroups.com.

Martin Potthast

unread,
Sep 2, 2013, 3:51:08 PM9/2/13
to pan-workshop-series
Hi Shikha,

thanks for the clarification; indeed I got you wrong.

In answer to your question, I am not aware of any scientific or
commercial tool that can truly tell apart plagiarism from other kinds
of text reuse. In fact, all algorithms to date only detect the latter
and leave the rest for their users to decide.

To my mind, asking a computer to judge whether a given case of text
reuse is plagiarism is not the right question, since everything about
plagiarism depends on context, let alone the state of mind of the
author at the time of writing. This raises interesting questions for
automation, and how humans can be further assisted in coming to a
conclusion, however, a fully automatic decision may not be possible.
That is not to say you cannot implement a software that outputs a
decision, but that you will have trouble convincing domain experts
that your machine will always come to the right conclusion. As long as
that is not accomplished, there will always be humans double-checking.

I wouldn't say you are on the wrong track with your question, but
perhaps on a track that leads away from computer science.

Martin

Shikha Pandey

unread,
Sep 3, 2013, 3:32:49 AM9/3/13
to pan-works...@googlegroups.com
Thanks Martin for your positive and valuable response.you are right, definitely i will think about it.  

 Shikha


On Friday, August 30, 2013 12:15:59 AM UTC-5, Shikha Pandey wrote:
Hi, Is any guideline rules is available for Plagiarism detection on PAN competition.
On Sun, Aug 25, 2013 at 10:02 PM, <pan-workshop-series@googlegroups.com> wrote:

Group: http://groups.google.com/group/pan-workshop-series/topics

    Shikha Pandey <shikham...@gmail.com> Aug 25 10:35AM -0700  

    Hi, all, I recently join this group. I will try for PAN-14.
    I m doing research in idea plagiarism. Is PAN held competition for idea
    plagiarism?
     
    ...more

    Back to top.

    Martin Potthast <martin....@uni-weimar.de> Aug 25 09:05PM +0200  

    Hi Shikha,
     
    thanks for registering; we will probably hold another PAN competition
    again next year. However, we are currently not studying idea
    plagiarism.
     
    Best,
    Martin
     
    On Sun, Aug 25, 2013 ...more

    Back to top.

--
--
You received this message because you are subscribed to the Google Group "PAN".
Visit this group at http://groups.google.com/group/pan-workshop-series
To unsubscribe send email to pan-workshop-series+unsub...@googlegroups.com.

---
You received this message because you are subscribed to the Google Groups "PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse." group.
To unsubscribe from this group and stop receiving emails from it, send an email to pan-workshop-series+unsub...@googlegroups.com.

Imene Bensalem

unread,
Sep 3, 2013, 6:27:20 AM9/3/13
to pan-works...@googlegroups.com
Hi Shikha , Hi Martin,

I enjoyed your discussion and I would like to participate please.

I think citation analysis techniques may help you to decide if the detected portion is actually a plagiarism case or a legitimate text reuse.
Citation analysis check if there exists a reference after the detected case. Moreover, if the reuse is verbatim it checks also the existence of a quotation marks.
But I think PAN corpus does not contain references (at list the old series of corpora , actually I did not consult the last one) because its goal is to evaluate the detection methods, however the citation analysis could be a post processing method.
So PAN corpora do not allow you to evaluate a detection method that has a citation analysis as a post processing step. 
But, I know a corpus build for this purpose, if your are interested, the paper that describes it is :

Shikha, its author has another paper on clef 2013, may be you will have the opportunity to meet her and discuss if you are attending the conference, 

Best regards

Martin Potthast

unread,
Sep 3, 2013, 7:59:41 AM9/3/13
to pan-workshop-series
> Hi Martin, could you post this back to the google group in response to this discussion?

Here's a message from Paul Clough on the subject:


Hi,

An interesting question and one that I had discussions with in the
legal community - I participated in a workshop about reuse and
copyright at Cambridge University and presented work I had done on
measuring text reuse in journalism. A lawyer was asked to respond to
my work and made this comment on how you could provide text reuse (or
plagiarism) within a (UK) court of law:

"What constitutes a substantial part of the claimant’s work is
difficult to predict but the
UK courts have provided general guidance on this issue. The overriding
principle is that
the assessment is a qualitative one and not quantitative. In
determining whether a
qualitatively substantial part has been copied, courts have taken into
account various
factors. A significant and consistently used factor is the originality
– i.e. skill and
labour - of the part that has been copied. The more skilful or
creative the part that is
copied, the more likely it is to be substantial and thus infringing."

If you would like to know more about the legal perspective on text
reuse then please feel free to contact me. Just to also add that when
we determine whether plagiarism has occurred in student work, we are
guided by the Turnitin similarity/originality score but the decision
on whether plagiarism has been committed is always taken after meeting
with the student (often what they do it not really plagiarism - an act
to deceive - but actually poor academic writing). Therefore automated
plagiarism detection tools can play an important role in assisting
humans with making decisions on whether plagiarism has occurred or not
(rather than make the decision for them).

Paul Clough
(Senior Lecturer in IR, University of Sheffield)
-------------------------------------------------------------------------
Dr. Paul Clough (Senior Lecturer)

Information School
University of Sheffield
Regent Court
Sheffield S1 4DP
Tel: +44 (0)114 2222664
Fax: +44 (0)114 2780300
Email: p.d.c...@sheffield.ac.uk
Web: http://ir.shef.ac.uk/cloughie/
-------------------------------------------------------------------------

Martin Potthast

unread,
Sep 3, 2013, 8:31:03 AM9/3/13
to pan-workshop-series
Hi Imene,

> I enjoyed your discussion and I would like to participate please.

Sure, thanks for contributing!

> I think citation analysis techniques may help you to decide if the detected
> portion is actually a plagiarism case or a legitimate text reuse.
> Citation analysis check if there exists a reference after the detected case.
> Moreover, if the reuse is verbatim it checks also the existence of a
> quotation marks.

In very specific genres and very specific circumstances, this may
indeed help. However, there are currently about 500+ different
citation styles used throughout the scientific community, let alone
those used in other genres. Also, there are countless ways of
indicating to a human reader whether a text is reused by means of
formatting, not only quotation marks. In general, citations,
references, and footnotes in a text a highly idiosyncratic, so that
even humans have a lot of difficulties telling which part of a text
belongs to a reference, both in the source document and the
suspicious document. Is it the entire sentence, the entire paragraph,
the entire chapter, before or after or both, etc...? It is not
sufficient to merely find a reference in a text but its the author's
intended meaning must be interpreted.

So, if you can pinpoint a specific style of citations and a specific
style of formatting that must be present (e.g., a teacher may define
how exactly things are to be done), then an automated solution may be
developed to check that. But doing so on a case-by-case basis is
certainly not economical.

Apart from the above, there's still the problem of representing the
required information in a model, so a computer may make sense of it.
People use Word, PDF, ODF, RTF, and many other document types to
format their text. Being able to accurately transfer the formatting
information from all of these formats into a computer model is a
Herculean task, especially if you are working alone. There are some
tools that can extract text, and some that convert many of the
aforementioned formats, but neither preserves formatting information,
let alone figures, tables, formulas, and other peculiar formatting
people come up with in their documents.

Also, as Paul pointed out, at the end of the day, it's not just about
missing references but there are other problems to be considered that
a computer cannot decide (i.e., whether a scholar actually knows what
she's doing).

> But I think PAN corpus does not contain references (at list the old series
> of corpora , actually I did not consult the last one) because its goal is to
> evaluate the detection methods, however the citation analysis could be a
> post processing method.

The reason we omitted references from the PAN corpus is because we
could not find reasonable asnwers to the following questions: How do
you denote references in a plain text file? Which citation style to
use? And wouldn't it be too easy to parse whichever citation style
we would use out of a plain text file using simple regex patterns?

On the one hand, we did not wish to waste people's time by introducing
problems that can be easily be solved, whereas on the other hand, we
found it next to impossible to come up with a representative selection of
reference styles in a representative selection of document formats.

> So PAN corpora do not allow you to evaluate a detection method that has a
> citation analysis as a post processing step.
> But, I know a corpus build for this purpose, if your are interested, the
> paper that describes it is :
> Solange de L. Pertile, Viviane Pereira Moreira: A Test Collection to
> Evaluate Plagiarism by Missing or Incorrect References. CLEF 2012: 141-143

Thanks for the pointer!

Martin


On Tue, Sep 3, 2013 at 1:59 PM, Martin Potthast

Imene Bensalem

unread,
Sep 3, 2013, 11:17:16 AM9/3/13
to pan-works...@googlegroups.com, martin....@uni-weimar.de
Hi Martin; 
Thank you very much for you detailed clarification;
Actually you pointed out things that I did not pay attention like the position of the reference which is not always after the citation, and the text structure which also plays a role...
I agree with you that citation analysis is domain/genre-dependent and maybe it would be difficult to develop one corpus that considers all kinds of referencing and citing. 
I agree also that this task of course will not decide instead of humans whether it is a plagiarism case or not, nonetheless, it gives further helpful information on the reused texts that have been already detected with a plagiarism detection method.

I found another paper in this area, it seem interesting (I did not yet read it): 

Kind regards

Imene
>>> >>> ---
>>> >>> You received this message because you are subscribed to the Google
>>> >>> Groups
>>> >>> "PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social
>>> >>> Software
>>> >>> Misuse." group.
>>> >>> To unsubscribe from this group and stop receiving emails from it, send
>>> >>> an
>>> >>> For more options, visit https://groups.google.com/groups/opt_out.
>>> >>
>>> >>
>>> > --
>>> > --
>>> > You received this message because you are subscribed to the Google Group
>>> > "PAN".
>>> > Visit this group at http://groups.google.com/group/pan-workshop-series
>>> > To unsubscribe send email to
>>> > ---
>>> > You received this message because you are subscribed to the Google
>>> > Groups
>>> > "PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social
>>> > Software
>>> > Misuse." group.
>>> > To unsubscribe from this group and stop receiving emails from it, send
>>> > an
>>> > For more options, visit https://groups.google.com/groups/opt_out.
>>>
>>>
>>>
>>> --
>>> Martin Potthast
>>> Bauhaus-Universität Weimar
>>> www.webis.de  ---  www.netspeak.org
>>>
>>> --
>>> --
>>> You received this message because you are subscribed to the Google Group
>>> "PAN".
>>> Visit this group at http://groups.google.com/group/pan-workshop-series
>>> To unsubscribe send email to
>>> ---
>>> You received this message because you are subscribed to a topic in the
>>> Google Groups "PAN Workshop Series. Uncovering Plagiarism, Authorship, and
>>> Social Software Misuse." group.
>>> To unsubscribe from this topic, visit
>>> https://groups.google.com/d/topic/pan-workshop-series/jIdTlCbbhyc/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to
>>>
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
>> --
>> --
>> You received this message because you are subscribed to the Google Group
>> "PAN".
>> Visit this group at http://groups.google.com/group/pan-workshop-series
>> To unsubscribe send email to
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software
>> Misuse." group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>
> --
> Martin Potthast
> Bauhaus-Universität Weimar
> www.webis.de  ---  www.netspeak.org
>
> --
> --
> You received this message because you are subscribed to the Google Group "PAN".
> Visit this group at http://groups.google.com/group/pan-workshop-series
> To unsubscribe send email to pan-workshop-series+unsub...@googlegroups.com.
> ---
> You received this message because you are subscribed to the Google
> Groups "PAN Workshop Series. Uncovering Plagiarism, Authorship, and
> Social Software Misuse." group.
> To unsubscribe from this group and stop receiving emails from it, send

Imene Bensalem

unread,
Sep 3, 2013, 11:46:04 AM9/3/13
to pan-works...@googlegroups.com, martin....@uni-weimar.de
Hello prof. Paul, 

> In determining whether a 
> qualitatively substantial part has been copied, courts have taken into 
> account various factors. A significant and consistently used factor is the originality 
> – i.e. skill and labour - of the part that has been copied. The more skilful or 
> creative the part that is copied, the more likely it is to be substantial and thus infringing." 

It is really an inspiring idea, so if we want to make this task automatic:
after the detection of the reused passages in the suspicious document we have to check the originality
of these passages in the source documents and develop methods to gauge the skills used to write it ( I wonder if such methods already exist (i.e.methods of measuring skills from a piece of writing) ?)

Imene
>> >>> ---
>> >>> You received this message because you are subscribed to the Google
>> >>> Groups
>> >>> "PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social
>> >>> Software
>> >>> Misuse." group.
>> >>> To unsubscribe from this group and stop receiving emails from it, send
>> >>> an
>> >>> For more options, visit https://groups.google.com/groups/opt_out.
>> >>
>> >>
>> > --
>> > --
>> > You received this message because you are subscribed to the Google Group
>> > "PAN".
>> > Visit this group at http://groups.google.com/group/pan-workshop-series
>> > To unsubscribe send email to
>> > ---
>> > You received this message because you are subscribed to the Google
>> > Groups
>> > "PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social
>> > Software
>> > Misuse." group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> > an
>> > For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
>>
>> --
>> Martin Potthast
>> Bauhaus-Universität Weimar
>> www.webis.de  ---  www.netspeak.org
>>
>> --
>> --
>> You received this message because you are subscribed to the Google Group
>> "PAN".
>> Visit this group at http://groups.google.com/group/pan-workshop-series
>> To unsubscribe send email to
>> ---
>> You received this message because you are subscribed to a topic in the
>> Google Groups "PAN Workshop Series. Uncovering Plagiarism, Authorship, and
>> Social Software Misuse." group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/pan-workshop-series/jIdTlCbbhyc/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>>
>> For more options, visit https://groups.google.com/groups/opt_out.
>
>
> --
> --
> You received this message because you are subscribed to the Google Group
> "PAN".
> Visit this group at http://groups.google.com/group/pan-workshop-series
> To unsubscribe send email to
> ---
> You received this message because you are subscribed to the Google Groups
> "PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software
> Misuse." group.
> To unsubscribe from this group and stop receiving emails from it, send an
> For more options, visit https://groups.google.com/groups/opt_out.



--
Martin Potthast
Bauhaus-Universität Weimar
www.webis.de  ---  www.netspeak.org

--
--
You received this message because you are subscribed to the Google Group "PAN".
Visit this group at http://groups.google.com/group/pan-workshop-series
To unsubscribe send email to pan-workshop-series+unsub...@googlegroups.com.
---
You received this message because you are subscribed to the Google
Groups "PAN Workshop Series. Uncovering Plagiarism, Authorship, and
Social Software Misuse." group.
To unsubscribe from this group and stop receiving emails from it, send
Reply all
Reply to author
Forward
0 new messages