Question about the macro- precision and recall and the plagiarism free documents

39 views
Skip to first unread message

Imene Bensalem

unread,
Dec 26, 2013, 4:29:11 PM12/26/13
to pan-works...@googlegroups.com
Hi PAN group member , 

I'm carrying out some experiments on plagiarism detection, and I would appreciate if some of you answer  my small question about evaluation. 

As you know in order to compute macro-precision and macro-recall we need to compute precision and recall of each document in the evaluation corpus. 
My question is : How these measures are computed in the case of a plagiarism-free document ?
In other terms do the free of plagiarism document affect the macro- precision and recall and how ?

Thank you.

Kind regards. 

Imene

Imene Bensalem

unread,
Dec 26, 2013, 4:48:39 PM12/26/13
to pan-works...@googlegroups.com
Hi again,

I just see that there was a previous similar post , and I understood that macro-precision is affected by the performance of the plagiarism-free documents but actually I'm still wondering how ? if no plagiarism is detected in plagiarism-free document do we consider the precision = 1 ?  

Martin Potthast

unread,
Dec 27, 2013, 5:57:04 AM12/27/13
to pan-workshop-series
Hi Imene,

thanks for your inquiry.

As discussed previously, in a document without plagiarism, precision
and recall are undefined.

However, if you have TWO documents, where one contains a plagiarized
passage and the other contains no plagiarism, and if you compute
macro-recall and macro-precision for both documents, then false
positive detection in the document that contains no plagiarism affect
recall and precision overall.

Since we compute overall performances in the competition, this is how
false positive detections in clean documents affect overall
performance, whereas we cannot report performances in the portion of
the corpus that contains no plagiarism based on precision and recall.

In this year, we will introduce some new measures that make this fact explicit.

Martin
> --
> --
> You received this message because you are subscribed to the Google Group
> "PAN".
> Visit this group at http://groups.google.com/group/pan-workshop-series
> To unsubscribe send email to
> pan-workshop-se...@googlegroups.com.
> ---
> You received this message because you are subscribed to the Google Groups
> "PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software
> Misuse." group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pan-workshop-se...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.



--
Martin Potthast
Bauhaus-Universität Weimar
www.webis.de --- www.netspeak.org

Imene Bensalem

unread,
Dec 28, 2013, 5:58:48 AM12/28/13
to pan-works...@googlegroups.com, martin....@uni-weimar.de
Hi Martin , 

Thank you for your reply.

It seems I was NOT using the term macro-precision as you defined it in your papers/thesis. I meant by macro-precision (or recall) , the one computed in a document level. i.e. compute the precision in each document in the corpus and then, compute the mean of these precision scores.   

So lets recapitulate to wrap this discussion up :

Measures are computed in one document and in a corpus 

In one document :

micro precision = length of the true positive / (length of the true positive + length of the false positive)     ====> semantically not defined in a plagiarism free document because we could not have true positive 

macro precision = (Sum of precision scores in each detected case)/ number of detected cases ====> not defined in a plagiarism free document because all detected cases in a such document are false positive.

BUT what if the document contains plagiarism but no case (even false) has been detected ? here the precision become undefined mathematically : we get a division per 0 , so here should we consider the precision = 0 or undefined ?


In a corpus :
Marco and Micro precision are computed using the same formula in all the corpus =====>  affected by the result of plagiarism-free documents because false positives (their length in micro and their number in macro ) are counted in the formula 

But here I think it is also interesting to speak about a macro-precision in a document level which is = (Sum of precision scores in each document) / number of documents 
and because precision is not defined in plagiarism free documents , this score will not be affected by these kind of document (they are not counted in denominator) .
But I would repeat the question above here : should we consider  a document that contains plagiarism but no case (even false) has been detected, so we put its precision = 0 ?

>  In this year, we will introduce some new measures that make this fact explicit. 

there is really a need to such measures , I was also thinking about a measure in this way, and I actually use it to more understand the behavior of a method I'm developing in intrinsic plagiarism detection against plagiarism-free documents , so may be by chance it is the same measure :-)

Kind regards 

Imene
> ---
> You received this message because you are subscribed to the Google Groups
> "PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software
> Misuse." group.
> To unsubscribe from this group and stop receiving emails from it, send an
Reply all
Reply to author
Forward
0 new messages