Release of the test collections including annotations

29 views
Skip to first unread message

Martin Potthast

unread,
Jul 13, 2010, 3:01:36 PM7/13/10
to pan-workshop-series
Dear all,

I have to apologize for delaying the release of the annotated test
collections so long, but now they're finally available. Please visit
the task Web pages to find the respective downloads.

@Task 1:
We have simply recompiled the whole test collection, this time
including the annotations in the XML files. You will find a couple of
new attributes to the feature tags, detailing the variables you read
about in our COLING paper. Please note that this is NOT the final
release of the PAN-PC-10. We will compile this release some time after
the workshop.

Along the annotation we also release a new version of the perfmeasures
script. Please use this script instead of the old one. There was an
error in the computation of the micro-averaged performance values,
sometimes resulting in precision and recall values above 1.
Fortunately, the implementation of the macro-averaged performance
values was not affected by this error... but you'll imagine, this gave
me the creeps. ;-)

@Task 2:
The annotations for the test collection are those by which your
algorithms were judged. We release these annotations in the format of
the result submissions for your convenience. Please note that this is
NOT the final release of the PAN-WVC-10. Again, this corpus will be
provided later on.

Best regards,
Martin Potthast


--
Martin Potthast
Bauhaus-Universität Weimar
www.webis.de  ---  www.netspeak.cc

Dmitry Chichkov

unread,
Jul 13, 2010, 4:52:58 PM7/13/10
to pan-works...@googlegroups.com
Hi Martin,

Could you possibly release the ROC performance measures script that you have been using for the second task? Or was it Tom Fawcett's ROC_algs.tar.gz (http://home.comcast.net/~tom.fawcett/public_html/ROCCH/index.html) mentioned in the referenced paper?

-- Regards, Dmitry

 

@Task 2:
The annotations for the test collection are those by which your
algorithms were judged. We release these annotations in the format of
the result submissions for your convenience. Please note that this is
NOT the final release of the PAN-WVC-10. Again, this corpus will be
provided later on.

Best regards,
Martin Potthast


--
Martin Potthast
Bauhaus-Universität Weimar
www.webis.de  ---  www.netspeak.cc

--
You received this message because you are subscribed to the Google Group "PAN".
Visit this group at http://groups.google.com/group/pan-workshop-series
To unsubscribe send email to pan-workshop-se...@googlegroups.com.

Martin Potthast

unread,
Jul 13, 2010, 4:56:27 PM7/13/10
to pan-works...@googlegroups.com
Hi Dmitry,

> Could you possibly release the ROC performance measures script that you have
> been using for the second task? Or was it Tom Fawcett's ROC_algs.tar.gz
> (http://home.comcast.net/~tom.fawcett/public_html/ROCCH/index.html)
> mentioned in the referenced paper?

No problem: it was the implementation used at KDDCUP 2004.
http://kodiak.cs.cornell.edu/kddcup/software.html

Best,
Martin

Alzahrani, Salha

unread,
Jul 14, 2010, 1:45:31 AM7/14/10
to pan-works...@googlegroups.com
Hi Martin,

>Along the annotation we also release a new version of the perfmeasures
>script. Please use this script instead of the old one.

Is it the one currently in http://pan.webis.de/ ? or there's another
link for download?

Thanks,

Salha

Martin Potthast

unread,
Jul 14, 2010, 3:20:25 AM7/14/10
to pan-works...@googlegroups.com

Parth Gupta

unread,
Jul 14, 2010, 3:22:08 AM7/14/10
to PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse.
Hello Martin,

I guess the download link consists of the the test corpus with updated
annotations (xmls). I request if we can get only the annotations (only
xmls). Because We are having very low bandwidth to download it for a
week or so and it will take around 24 hrs, also there are only 2 days
left for the submission so can you provide me the only xmls zipped? So
that we can accommodate some error analysis in the report. If not on
the web-page then if you can send me some external link to the zipped
file, We would be highly obliged if we get it. I hope you can
understand the concern.

Thanking you in anticipation.

Regards,
Parth.

On Jul 14, 10:45 am, "Alzahrani, Salha" <ad...@u2learn.net> wrote:
> Hi Martin,
>
> >Along the annotation we also release a new version of the perfmeasures
> >script. Please use this script instead of the old one.
>
> Is it the one currently inhttp://pan.webis.de/? or there's another
> link for download?
>
> Thanks,
>
> Salha
>
> > Visit this group athttp://groups.google.com/group/pan-workshop-series

Martin Potthast

unread,
Jul 14, 2010, 4:01:26 AM7/14/10
to pan-works...@googlegroups.com
Hi Parth,

> I guess the download link consists of the the test corpus with updated
> annotations (xmls). I request if we can get only the annotations (only
> xmls). Because We are having very low bandwidth to download it for a
> week or so and it will take around 24 hrs, also there are only 2 days
> left for the submission so can you provide me the only xmls zipped? So
> that we can accommodate some error analysis in the report. If not on
> the web-page then if you can send me some external link to the zipped
> file, We would be highly obliged if we get it. I hope you can
> understand the concern.

Yes, good idea! In fact, I have changed the downloads so that only the
annotations can be download in addition to the test collection.

So, visit the Web pages and you'll find everything as requested (mind
your browser cache, though).

Martin

Tartessos

unread,
Jul 14, 2010, 4:25:07 AM7/14/10
to PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse.
Hi, Martin:

Would be possible getting only gold standard XML files separately? It
is to avoid downloading 1.7 GB only for the about 6.0 MB we don't yet
have.

Regards,

Diego A.R.Torrejón
I.E.S. "José Caballero" (Spain)
Universidad de Huelva (Spain)

Jan Kasprzak

unread,
Jul 14, 2010, 4:45:38 AM7/14/10
to pan-works...@googlegroups.com
Martin Potthast wrote:
: @Task 1:

: We have simply recompiled the whole test collection, this time
: including the annotations in the XML files. You will find a couple of
: new attributes to the feature tags, detailing the variables you read
: about in our COLING paper. Please note that this is NOT the final
: release of the PAN-PC-10. We will compile this release some time after
: the workshop.

Martin, would it be possible to release the exact version
of evaluation script (perfmeasures.py) which have been used for
computing the official results? I am getting slightly different
output both with the current perfmeasures.py
(MD5 202d4a05c71d40dddca07db47082fe9b, downloaded ~12 hours ago),
the original one I have downloaded some time before the deadline
(MD5 822ad6c12c57f18c217f4e9d3202b4b7) and even my own implementation
in Perl.

If anyone is interested, my detections can be downloaded
from http://www.fi.muni.cz/~kas/tmp/plagiarism-kasprzak-2010-06-23-1418.zip
(MD5 77c3b8bb6654b2b7bda8427483884fa6; the file will expire in ~10 days).

Thanks,

-Jan Kasprzak

--
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| GPG: ID 1024/D3498839 Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E |
| http://www.fi.muni.cz/~kas/ Journal: http://www.fi.muni.cz/~kas/blog/ |
Please don't top post and in particular don't attach entire digests to your
mail or we'll all soon be using bittorrent to read the list. --Alan Cox

Martin Potthast

unread,
Jul 14, 2010, 5:17:05 AM7/14/10
to pan-works...@googlegroups.com
Diego,

see my earlier answer to Parth's request.

Martin

> --
> You received this message because you are subscribed to the Google Group "PAN".

> Visit this group at http://groups.google.com/group/pan-workshop-series


> To unsubscribe send email to pan-workshop-se...@googlegroups.com.
>

--

Parth Gupta

unread,
Jul 14, 2010, 6:15:19 AM7/14/10
to PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse.
Hello Martin,

I faced an error with the rar file while extracting. Still it has
extracted some files. So are there only 9188 suspicious files in /
suspicious_document/ directory or more? in Source/ directory i have
got 11148 docs so it must be fine. But please verify no. of suspicious
docs.

Regards,
Parth.

On Jul 14, 2:17 pm, Martin Potthast <martin.potth...@uni-weimar.de>
wrote:
> Diego,
>
> see my earlier answer to Parth's request.
>
> Martin
>
>
>
> On Wed, Jul 14, 2010 at 10:25 AM, Tartessos <dartsyst...@gmail.com> wrote:
> > Hi, Martin:
>
> > Would be possible getting only gold standard XML files separately? It
> > is to avoid downloading 1.7 GB only for the about 6.0 MB we don't yet
> > have.
>
> > Regards,
>
> >  Diego A.R.Torrejón
> > I.E.S. "José Caballero" (Spain)
> > Universidad de Huelva   (Spain)
>
> > --
> > You received this message because you are subscribed to the Google Group "PAN".
> > Visit this group athttp://groups.google.com/group/pan-workshop-series

Martin Potthast

unread,
Jul 14, 2010, 6:25:17 AM7/14/10
to pan-works...@googlegroups.com
Hi Parth,

> I faced an error with the rar file while extracting. Still it has
> extracted some files. So are there only 9188 suspicious files in /
> suspicious_document/ directory or more? in Source/ directory i have
> got 11148 docs so it must be fine. But please verify no. of suspicious
> docs.

I have just now inflated the archive which is on the Web pages, and it
worked fine. Did you verify the MD5 sum?
There are exactly as many XML files in this RAR as dummy XML files in
the test collection: 15925.

Best,
Martin

Martin Potthast

unread,
Jul 14, 2010, 6:27:28 AM7/14/10
to pan-works...@googlegroups.com
Jan, and everyone else,

we got the same error. Some double-checking revealed that the
performance measure script is fine, but that some error must have
occurred while generating the fully annotated test collection. We are
investigating this issue and get back to you as soon as possible.

Martin

> --
> You received this message because you are subscribed to the Google Group "PAN".

> Visit this group at http://groups.google.com/group/pan-workshop-series


> To unsubscribe send email to pan-workshop-se...@googlegroups.com.
>

--

Tartessos

unread,
Jul 15, 2010, 3:57:16 AM7/15/10
to PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse.
Thanks, and sorry.


On 14 jul, 11:17, Martin Potthast <martin.potth...@uni-weimar.de>
wrote:
> Diego,
>
> see my earlier answer to Parth's request.
>
> Martin

You can see there was not so long time betwen our writing. May be my
cache was old: When I sent the message (much later than start writing)
and after trying for sending, I got an "ERROR: session finished"
message from Google. After inmediately refreshing my inbox, tried
again (without rechecking news) and then it was sent.

Sorry again.

Best,
Diego.
Reply all
Reply to author
Forward
0 new messages