Dear Author Profiling organizers,
I would like to highlight an observation I've made regarding the provided training corpus.
There are several XML files, where the contained documents represent citations from famous humans.
Here are a number of such examples, taken from the XML file "2855276210aea6dad744fdcbca0e633e.xml"
<document><![CDATA[Money is the best deodorant. - Elizabeth Taylor]]></document>
<document><![CDATA[So, where's the Cannes Film Festival
being held this year? Christina Aguilera]]></document>
<document><![CDATA[We must accept finite disappointment, but we must never lose infinite hope. - Martin Luther King, Jr.]]></document>
<document><![CDATA[Never, never, never, never give up. - Winston Churchill]]></document>
<document><![CDATA[You can't win unless you learn how to lose. - Kareem Abdul-Jabbar]]></document>
<document><![CDATA[Everyone has a right to a university degree in America, even if it's in Hamburger Technology. - Clive James]]></document>
<document><![CDATA[A lifetime of training for just ten seconds. - Jesse Owens]]></document>
According to the "train-truth.txt", this specific XML file represents an example for a bot:
2855276210aea6dad744fdcbca0e633e:::bot:::bot
However, the citations are from humans.
Therefore, my question: Isn't the given corpus somehow biased?
Best regards
Oren Halvani