PAN 2015: Author Profiling Task (Update)

99 views
Skip to first unread message

Martin Potthast

unread,
Feb 13, 2015, 5:39:32 AM2/13/15
to pan-workshop-series
Dear everyone,

this is to let you know that the training data for PAN 2015 author
profiling task is now online. You can find the dataset on the task web
page at http://www.uni-weimar.de/medien/webis/research/events/pan-15/pan15-web/author-profiling.html

Please note the following about the dataset:
- For lack of data, the age class 65-xx used last year has been merged
with the class 50-64, resulting in the class 50-xx.
- For lack of data, the Dutch and Italian portions of the training
data do not consider age, but only gender and personality.

The ZIP archive containing the corpus is password-protected due to
privacy reasons. We will share the password with registered
participants in a separate mail.

To process this training dataset locally, you do not need last year's
download software.

However, if you wish to also analyze the PAN 2014 training data for
Twitter tweets with regard to age and gender, we have updated the
download software to fix an error that some of you experienced. Please
re-download the software at
http://www.uni-weimar.de/medien/webis/research/events/pan-14/pan14-code/pan14-author-profiling-twitter-downloader.zip

Should you find anything amiss, please don't hesitate to contact us.

Best,
Martin
on behalf of the task organizing committee


--
Martin Potthast
Bauhaus-Universität Weimar
www.webis.de --- www.netspeak.org

Paolo Rosso

unread,
Feb 17, 2015, 8:45:04 AM2/17/15
to fabio celli, pan-workshop-series, scott....@xrce.xerox.com, p...@webis.de, Francisco Rangel, Martin Potthast

fabio celli <fabio...@live.it> ha scritto:

> Hi Scott
>
> coming to your questions:
> · Can you say anything about from where the data came?
>
>
> we collected the data from Twitter by means of advertising campaign
>
>
>
>
> o In particular, I?m interested to know how the labels were
> obtained. Are the personality scores gold standard manually
> collected from questionnaire (self-assessed or judged), or ?silver
> standard,? derived from automatic labelling.
>
>
> these labels are gold standard self-assessed with the short big5
> test (BFI-10), normalized between -0.5 and +0.5.
>
>
>
>
> · The website mentions providing personality in terms of ?a)
> scores (between -0.5 and 0.5) and; b) binary classes (y/n that
> correspond to >0 and <=0).? However, there are no binary classes
> provide in any of the truth.txt files.
>
>
> yes, we wanted to release both scores and binary classes, but in the
> end we preferred to distribute only scores, since there is some
> inbalance in classes.
>
>
>
>
> o Related, binary distinction by 0 will result in unbalanced
> datasets (for example English Openness has only 4 users <=0). Is
> this the intention?
>
>
> well no, we suggest to use scores for training and testing rather
> than classes for this reason.
>
>
> · We have only explored the English dataset so far, but in
> truth.txt, we found user213 to be listed twice, with different
> characteristics.
> o user213:::M:::35-49:::0.1:::0.1:::0.1:::0.2:::0.2
> o user213:::M:::18-24:::0.1:::0.4:::0.0:::0.1:::0.2
>
>
> this is surely an error in the anonymization. We will fix this ASAP
>
>
>
>
> · There are no labels in the truth.txt files. Though most
> traits are self-explanatory, can we assume that the personality
> values are in the order as listed in the output example? Ie. E, N,
> A, C, O?
>
>
> yes, I confirm that labels are in the order ENACO. the polarity of
> trait N is: >0 = stable; < 0= neurotic
>
>
> o Though gender and age are in reverse order.
> · In the English data, there is one user (72) still labelled
> 50-64 instead of 50-XX
>
>
> We will fix this ASAP.
>
>
> thanks for the feedback!
>
>
>
>
> regards
>
>
> ============================
> Fabio Celli - post doc,
> University of Trento.
> Dept. of computer science (DISI): room 133
> Via Sommarive 5, Povo (TN)
> http://clic.cimec.unitn.it/fabio/
> ============================
>
>
> From: francisc...@autoritas.es
> Date: Tue, 17 Feb 2015 14:10:53 +0100
> Subject: Re: RE: PAN 2015: Author Profiling Task (Update)
> To: martin....@uni-weimar.de
> CC: p...@webis.de
>
> I'll try to fix all the issues with the data, but wrt. persanility
> questions maybe Fabio could explain them better.
>
>
> 2015-02-17 14:05 GMT+01:00 Martin Potthast <martin....@uni-weimar.de>:
> This one is for Francisco.
> Martin
> ---------- Forwarded message ----------
> From: "NOWSON, Scott" <scott....@xrce.xerox.com>
> Date: Feb 17, 2015 12:08 PM
> Subject: RE: PAN 2015: Author Profiling Task (Update)
> To: "Martin Potthast" <martin....@uni-weimar.de>
> Cc: "PEREZ, Julien" <julien...@xrce.xerox.com>
>
>
>
>
>
>
>
>
>
> Hi Martin,
>
>
>
> Thanks for sending out the details of the data. I have a number of
> questions, if I may. Some about the data generally, and some on
> what seem to be issues.
>
>
>
>
> ·
> Can you say anything about from where the data came?
>
>
> o
> In particular, I?m interested to know how the labels were obtained.
> Are the personality scores gold standard manually collected from
> questionnaire (self-assessed or judged), or ?silver standard,?
> derived from automatic labelling.
>
> ·
> The website mentions providing personality in terms of ?a) scores
> (between -0.5 and 0.5) and; b) binary classes (y/n that correspond
> to >0 and <=0).? However, there are no binary classes provide in any
> of the truth.txt files.
>
>
> o
> Related, binary distinction by 0 will result in unbalanced datasets
> (for example English Openness has only 4 users <=0). Is this the
> intention?
>
> ·
> We have only explored the English dataset so far, but in truth.txt,
> we found user213 to be listed twice, with different characteristics.
>
> o
> user213:::M:::35-49:::0.1:::0.1:::0.1:::0.2:::0.2
>
> o
> user213:::M:::18-24:::0.1:::0.4:::0.0:::0.1:::0.2
>
> ·
> There are no labels in the truth.txt files. Though most traits are
> self-explanatory, can we assume that the personality values are in
> the order as listed in the output example? Ie. E, N, A, C, O?
>
> o
> Though gender and age are in reverse order.
>
> ·
> In the English data, there is one user (72) still labelled 50-64
> instead of 50-XX
>
>
> I hope this is helpful feedback. I look forward to hearing from you.
>
>
>
> Cheers,
>
> Scott
>
>
>
> --
>
> Scott Nowson, Ph.D.
>
> Global Research Lead ? Customer Modelling
>
> Xerox Research Centre Europe
>
> scott....@xrce.xerox.com
>
>
>
>
>
>
>
> -----Original Message-----
>
> From: martin....@gmail.com [mailto:martin....@gmail.com]
> On Behalf Of Martin Potthast
>
> Sent: 13 February 2015 11:39
>
> To: pan-workshop-series
>
> Subject: PAN 2015: Author Profiling Task (Update)
> --
>
> Francisco M. Rangel Pardo
> CTO Autoritas Consulting S.A.http://www.autoritas.esTwitter: @kicorangel
> tlf. +34 656 493 023
>



Francisco Rangel

unread,
Feb 20, 2015, 2:40:43 AM2/20/15
to Paolo Rosso, fabio celli, pan-workshop-series, scott....@xrce.xerox.com, p...@webis.de, Martin Potthast
Dear participants,

A new fixed dataset is available in the webpage. We fixed some issues pointed out by Scott (thank you Scott for your valuable feedback). We recommend you to download the new version in order to work with the right truth files. We maintain the same password.

Happy researching!


-- 
Francisco M. Rangel Pardo
CTO Autoritas Consulting S.A.
Twitter: @kicorangel


Reply all
Reply to author
Forward
0 new messages