Re: PAN Author profiling task - blog corpus

20 views
Skip to first unread message

Francisco Rangel

unread,
Apr 7, 2014, 2:12:00 PM4/7/14
to jam...@uw.edu, pan-workshop-series
Dear James (and all the rest of participants),

Thank you for your question.

We were discussing about it and finally we decided to leave the
dataset as it is because it is the best way to ensure that all contents
belong to the author. We discussed to retrieve contents from permalinks but
it is not feasible to do it automatically and besides ensure that all the
retrieved contents are written by the author, for example, due to ads,
iframes and so on. Also, we decided not to allow participants to retrieve
more data in order to all of them work with the same data. We manually
tried to ensure that, although some posts may be short, in average every
author has contents enough.

Thus, answering your question, you should work only with the released data
and we will evaluate the proposal with similar data, without allowing
participants to download permalinks from internet.

Best regards and good luck!

On the behalf of Author Profiling co-organisers.


-- 
Francisco M. Rangel Pardo
CTO Autoritas Consulting S.A.
Twitter: @kicorangel


2014-04-02 8:41 GMT+02:00 Martin Potthast <martin....@uni-weimar.de>:
Hi Francisco and Paolo,

this one is for you.

Martin


---------- Forwarded message ----------
From: James Andrew Marquardt <jam...@uw.edu>
Date: Wed, Apr 2, 2014 at 6:06 AM
Subject: PAN Author profiling task - blog corpus
To: Martin Potthast <martin....@uni-weimar.de>
Cc: Golnoosh Farnadi <Golnoosh...@ugent.be>, gayathri Vasudevan
<gv...@uw.edu>, Martine De Cock <Martine...@ugent.be>


Mr. Potthast,

I have a question that has just come up today regarding the blog
corpus for the author profiling task.

It appears in the blog posts that have been provided are incomplete
posts, in that the text is cut off after a certain point in each post.
The full text can be viewed by following the url in the document xml
tag.

My question is whether the full text will be made available for the
blogs for the final submission or if we should anticipate needing to
retrieve the full text ourselves.

Thank you,
-James Marquardt


--
Martin Potthast
Bauhaus-Universität Weimar
www.webis.de  ---  www.netspeak.org




Reply all
Reply to author
Forward
0 new messages