[Bots and Gender Profiling] Human-bias in the bot-documents ?

50 views
Skip to first unread message

Oren Halvani

unread,
Feb 20, 2019, 11:40:25 AM2/20/19
to PAN Workshop Series on Digital Text Forensics
Dear Author Profiling organizers,

I would like to highlight an observation I've made regarding the provided training corpus.
There are several XML files, where the contained documents represent citations from famous humans.

Here are a number of such examples, taken from the XML file "2855276210aea6dad744fdcbca0e633e.xml"

        <document><![CDATA[Money is the best deodorant. - Elizabeth Taylor]]></document>
        <document><![CDATA[So, where's the Cannes Film Festival being held this year?  Christina Aguilera]]></document>
        <document><![CDATA[We must accept finite disappointment, but we must never lose infinite hope. - Martin Luther King, Jr.]]></document>
        <document><![CDATA[Never, never, never, never give up. - Winston Churchill]]></document>
        <document><![CDATA[You can't win unless you learn how to lose. - Kareem Abdul-Jabbar]]></document>
        <document><![CDATA[Everyone has a right to a university degree in America, even if it's in Hamburger Technology. - Clive James]]></document>
        <document><![CDATA[A lifetime of training for just ten seconds. - Jesse Owens]]></document>

According to the "train-truth.txt", this specific XML file represents an example for a bot:

        2855276210aea6dad744fdcbca0e633e:::bot:::bot

However, the citations are from humans.
Therefore, my question: Isn't the given corpus somehow biased?


Best regards
Oren Halvani

Francisco Rangel

unread,
Feb 21, 2019, 9:01:12 AM2/21/19
to pan-workshop-series
Hi Oren, how are you?

You're right, this account is automatically tweeting famous humans citations. By definition, a software that automatically republishes contents is often considered a bot.

Best regards,


--
--
You received this message because you are subscribed to the Google Group "PAN".
Visit this group at http://groups.google.com/group/pan-workshop-series
To unsubscribe send email to pan-workshop-se...@googlegroups.com.
---
You received this message because you are subscribed to the Google Groups "PAN Workshop Series on Digital Text Forensics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pan-workshop-se...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Francisco M. Rangel Pardo
CTO Autoritas Consulting S.A.
Twitter: @kicorangel

Oren

unread,
Feb 24, 2019, 4:26:09 PM2/24/19
to PAN Workshop Series on Digital Text Forensics
Hola Francisco,



>> Hi Oren, how are you?

Gracias, very fine ;-)
Hope you too...



>> By definition, a software that automatically republishes contents is often considered a bot.

OK, thank you very much for the clarification!
This simplifies the task :-)


Best regards
Oren
----------
To unsubscribe send email to pan-workshop-series+unsub...@googlegroups.com.

---
You received this message because you are subscribed to the Google Groups "PAN Workshop Series on Digital Text Forensics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pan-workshop-series+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages