Training datasets updated for author identification and author profiling

70 views
Skip to first unread message

Martin Potthast

unread,
Mar 3, 2015, 11:17:29 AM3/3/15
to pan-workshop-series
Hi everyone,

we have updated the training datasets for author identification and
author profiling. Please download them again from the respective task
web pages at http://pan.webis.de

We have also updated the datasets in TIRA.

For author identification, each training dataset now contains a file
contents.json in which the language of each problem instance in the
dataset is revealed. This file will also be present in the test
dataset. In your software, please use only this file to learn the
language of a problem instance found in the dataset. Do not rely on
file names or folder names to do so.

For author profiling, we have added an attribute "lang" to the
"author" tag of each XML which reveals the language of the tweets in
the respective XML file. In your software, please use only this
attribute to learn the language of a problem instance found in the
dataset. Do not rely on file names or folder names to do so.

Best,
Martin


--
Dr. Martin Potthast
Bauhaus-Universität Weimar
Digital Bauhaus Lab
Bauhausstr. 9a
99423 Weimar
Germany

+49 3643 58 3567
+49 171 809 1945

www.potthast.net

Halvani, Oren

unread,
Mar 3, 2015, 3:42:45 PM3/3/15
to pan-works...@googlegroups.com
Hello Martin,


>> For author identification, each training dataset now contains a file contents.json in which the language of
>> each problem instance in the dataset is revealed. This file will also be present in the test dataset.
>> In your software, please use only this file to learn the language of a problem instance found in the dataset.
>> Do not rely on file names or folder names to do so.

Actually this is a problem for us, since our software relies that a sub-corpus consists only of problems coined from one language,
e.g. "DU" ---> "all problems are dutch". The new specification however can be interpretaed in that way that a sub-corpus can
contain problems from more than one language - this is bad for our software!

Is there any reason why the specification has changed? Why not keeping it in the same fashion like PAN13 & PAN14 ???


best regards,
Oren
________________________________________
Von: pan-works...@googlegroups.com [pan-works...@googlegroups.com]" im Auftrag von "Martin Potthast [martin....@uni-weimar.de]
Gesendet: Dienstag, 3. März 2015 17:16
An: pan-workshop-series
Betreff: [PAN'15] Training datasets updated for author identification and author profiling

Hi everyone,

Best,
Martin

www.potthast.net

--
--
You received this message because you are subscribed to the Google Group "PAN".
Visit this group at http://groups.google.com/group/pan-workshop-series
To unsubscribe send email to pan-workshop-se...@googlegroups.com.
---
You received this message because you are subscribed to the Google Groups "PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse." group.
To unsubscribe from this group and stop receiving emails from it, send an email to pan-workshop-se...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Martin Potthast

unread,
Mar 3, 2015, 4:20:51 PM3/3/15
to pan-workshop-series
Hi Oren,

no worries: all the sub-corpora are still in the same language.
However, we've learned the hard way that encoding vital information
into file names and folder names sometimes leads participants astray
in their implementations. Therefore, we'd rather you extract the
language information from the places I pointed out.

If you need all problem instances to be in one language, you can of
course ensure that before starting to process them, and otherwise
quite with an expressive error message.

Best,
Martin

PS: The specification has not changed compared to previous years.
Rather, we have now made it consistent with last year for both tasks!

Halvani, Oren

unread,
Mar 3, 2015, 4:36:22 PM3/3/15
to pan-works...@googlegroups.com
Hi Martin,

thanks alot for the quick response!


>> no worries: all the sub-corpora are still in the same language.

Great, than anything is fine ;-)


Just one organisational question: Are the VM's going to be deployd this week?


best regards,
Oren
________________________________________
Von: pan-works...@googlegroups.com [pan-works...@googlegroups.com]" im Auftrag von "Martin Potthast [martin....@uni-weimar.de]

Gesendet: Dienstag, 3. März 2015 22:20
An: pan-workshop-series
Betreff: Re: [PAN'15] Training datasets updated for author identification and author profiling

Martin Potthast

unread,
Mar 3, 2015, 4:48:56 PM3/3/15
to pan-workshop-series
Hi Oren,

> Just one organisational question: Are the VM's going to be deployd this week?

Many VMs have already been deployed, and I can deploy one for you
right now: Have you answered to the welcome mail asking what OS you
prefer?

Halvani, Oren

unread,
Mar 6, 2015, 1:55:55 AM3/6/15
to pan-works...@googlegroups.com
Hi Martin,


>> Many VMs have already been deployed, and I can deploy one for you right now:

That would be great!


>> Have you answered to the welcome mail asking what OS you prefer?

Well i can't remember that i've received any welcome-mail...However, i believe its
the same option as last year, right? If so, i'll chose Windows 7 ;-)


Best regards,


Oren
________________________________________
Von: pan-works...@googlegroups.com [pan-works...@googlegroups.com]" im Auftrag von "Martin Potthast [martin....@uni-weimar.de]

Gesendet: Dienstag, 3. März 2015 22:48


An: pan-workshop-series
Betreff: Re: [PAN'15] Training datasets updated for author identification and author profiling

Hi Oren,

Best,
Martin

www.potthast.net

--

Lucie Flekova

unread,
Mar 10, 2015, 5:52:41 AM3/10/15
to pan-works...@googlegroups.com
Hi Martin,

I have a question about the author profiling evaluation. The PAN website states it will consider "the average F1 measure of all the 5 traits in the case of classes and the average Root Mean Squared Error for the scores." Can you clarify how the classes are intended to be derived, if at all? As a binary split around the mean value, or around 0 value, or will you use only the RMSE? I'm asking because an "intuitive" binary split on 0 would actually give a pretty challenging baseline on this data set...

Thanks,
Lucie


Paolo Rosso

unread,
Mar 12, 2015, 7:54:40 AM3/12/15
to fabio celli, pan-workshop-series, scott....@xrce.xerox.com, Francisco Rangel, Martin Potthast



fabio celli <fabio...@live.it> ha scritto:

> Hi Scott,
> we found the source of the ????????????????? sequences:
> they are twitter stickers, as you can see below
>
> have a nice day!
>
>
>
> From: francisc...@autoritas.es
> Date: Thu, 12 Mar 2015 09:38:38 +0100
> Subject: Re: Training datasets updated for author identification and
> author profiling
> To: fabio...@live.it
> CC: martin....@uni-weimar.de; pro...@dsic.upv.es
>
> Hi,
> This ? correspond to special codes used by Twitter to represent stickers.
> For example:
>
>
>
>
>
>
>
>
>
> Best,
>
> 2015-03-12 9:13 GMT+01:00 fabio celli <fabio...@live.it>:
>
>
>
> Hi Martin and Paolo,
> Ok, I missed this one, but I do not know how to answer exactly,
> sincethere were no such sequences of question marks in the original
> data.I guess it is a problem with formats. maybe there are some
> parts in arabic orjapanese that are turned into
> "???????????????????"but Kiko should doublecheck before answering.
> have a nice day
>
>
> ============================
> Fabio Celli - post doc,
> University of Trento.
> Dept. of computer science (DISI): room 133
> Via Sommarive 5, Povo (TN)
> http://clic.cimec.unitn.it/fabio/
> ============================
>
>
>> From: martin....@uni-weimar.de
>> Date: Thu, 12 Mar 2015 01:01:11 +0100
>> Subject: Fwd: Training datasets updated for author identification
>> and author profiling
>> To: pro...@dsic.upv.es; fabio...@live.it
>>
>> Did you mean this one (see 4 below)?
>>
>> Martin
>>
>>
>> ---------- Forwarded message ----------
>> From: Martin Potthast <martin....@uni-weimar.de>
>> Date: Wed, Mar 4, 2015 at 1:12 PM
>> Subject: Re: Training datasets updated for author identification and
>> author profiling
>> To: "NOWSON, Scott" <scott....@xrce.xerox.com>
>> Cc: fabio celli <fabio...@live.it>, "p...@webis.de" <p...@webis.de>
>>
>>
>> Hi Scott,
>>
>> I'll answer your questions as far as I can. Please direct future
>> questions to p...@webis.de, so all the organizers have a chance to
>> answer should I not be around.
>>
>> > 1 - is it correct that for testing, the input files will be a
>> single file per author, titled in the format
>> "<authorid>_<lang>_XX_XX.xml" meaning that the language will be
>> known?
>>
>> Please update your local copy of the training dataset. There shouldn't
>> file named this way, anymore. The file naming scheme is now simply
>> <authorid>.xml, whereas the language information in encoded in the
>> lang attribute of the author tag.
>>
>> The test dataset will be formatted the same way as the training
>> dataset. The language will be revealed.
>>
>> > 2 - For submission in the challenge, must we return results for
>> all languages? Would it be possible, for example, to only work on
>> Spanish?
>>
>> It is alright if you prefer to work only on one of the languages.
>>
>> > 3 - Similarly to above, must we provide responses for all
>> demographics? Would it be possible, for example, to only work on
>> Personality?
>>
>> The overall ranking will be based on the the performance of each
>> software for all demographics. However, if you don't mind that your
>> software may not reach the top position overall, it is probably OK if
>> you not submit predictions about some of the demographics; there will
>> be demographic-specific analyses.
>>
>> > 4 - In the data, there are a lot of "?????????" markers. My
>> suspicion is they replace Unicode characters that are so prevalent
>> these days in twitter, but I wanted to check. They mostly occur in
>> long strings, for example:
>> >
>> > * User 66 - So tired ????????
>> > * User 66 - My family is completely plastered.
>> ????????????????????????????????
>> >
>> > But do sometimes occur in odd isolation where the use of "?"
>> doesn't make sense grammatically
>> >
>> > * User 1, ----- con criterion the term "our system" ? paper.
>> >
>> > I hope these questions make sense.
>>
>> I'm not sure about these artefacts. Prehaps Fabio and Francisco can
>> help out.
>>
>> > On an issue note, it seems that the Author Profiling download
>> link is out of date. It still points to the 18-feb data set,
>> rather than 2nd March.
>>
>> Thanks for the heads-up. I've just now checked it, but found the
>> download link in order. Can it be your browser has an old version of
>> the web page cached?
Reply all
Reply to author
Forward
0 new messages