authorship identification competition

Tim Snider

unread,

Jun 20, 2011, 4:23:42 PM6/20/11

to pan-works...@googlegroups.com

let me apologize in advance if this comment seems naive and the rest of the competitors realized this all along ... after submitting our results on the 8th, we went exploring on the internet ... the emails were clearly from the enron corpus so we went looking for it ... once we had it, we decided to spend a little time confirming that this was indeed the source for the competition and ultimately compute our own score ... after a bit of scripting work and some manual checking it appears that in a number of cases different authors in the training corpus were in fact the same author in the underlying corpus ... in fact there appear to be 11 cases (that is 11 pairs) in the large set and 5 in the small ... is this correct and intentional ??? if this is true, how will the scoring be done, since an email assigned to either author in a pair would have to be considered correct ... or have we misunderstood the structure of the enron corpus ???

c u o, tim

Shlomo Argamon

unread,

Jun 20, 2011, 5:34:52 PM6/20/11

to pan-works...@googlegroups.com

Dear Tim,

Thanks very much for your feedback. Could you please send me some of the details of the results of your tests, so that I can compare them to our annotated data? We will then be in a better position to evaluate the situation.

Thanks very much,

Shlomo

Sent from my iPad

On Jun 20, 2011, at 1:23 PM, Tim Snider <twsn...@gmail.com> wrote:

let me apologize in advance if this comment seems naive and the rest of the competitors realized this all along ... after submitting our results on the 8th, we went exploring on the internet ... the emails were clearly from the enron corpus so we went looking for it ... once we had it, we decided to spend a little time confirming that this was indeed the source for the competition and ultimately compute our own score ... after a bit of scripting work and some manual checking it appears that in a number of cases different authors in the training corpus were in fact the same author in the underlying corpus ... in fact there appear to be 11 cases (that is 11 pairs) in the large set and 5 in the small ... is this correct and intentional ??? if this is true, how will the scoring be done, since an email assigned to either author in a pair would have to be considered correct ... or have we misunderstood the structure of the enron corpus ???

c u o, tim

--
You received this message because you are subscribed to the Google Group "PAN".
Visit this group at http://groups.google.com/group/pan-workshop-series
To unsubscribe send email to pan-workshop-se...@googlegroups.com.

Tim Snider

unread,

Jun 20, 2011, 5:59:41 PM6/20/11

to pan-works...@googlegroups.com

we originally detected it with some scripts and then went back and did some manual comparisons of emails ... all done by our university of waterloo co-op students ... so i'll put the details together tomorrow and send them off to you

c u o, tim

Shlomo Argamon

unread,

Jun 20, 2011, 6:06:00 PM6/20/11

to pan-works...@googlegroups.com

Thanks!

Sent from my iPad

Tim Snider

unread,

Jun 21, 2011, 2:30:40 PM6/21/11

to pan-works...@googlegroups.com

here's the details of what we found ...

Attached is a zip folder with comparisons between duplicate authors. The LargeTrain folder has two text files:

Author_Pairs, a text file mapping 1 author from the Enron corpus to 2 authors from the training set, and
Text_Matches, a file containing a mapping from every text from both authors in the duplicate training set to a text by the original author in the Enron corpus

There are also 3 subfolders, each named after an Enron author. Each folder contains 2 folders named after the duplicate authors in the Training set. Each of these subfolders contains 2 files ... the xml file is the file that was given to us as part of the training set for the author identified by the parent directory name ... the other file is the 'matching' file from the enron corpus under a directory identified by the grandparent directory name ...

so, for example, taylor-m is a directory in the enron corpus that contains email from 'Mark Taylor' ... under that we have two directories, 640334 and 693864 ... these are two of the training authors ... under 640334 we have the files '20.' and 'a2800.xml' ... a2800.xml is a file from the training corpus for author 640334 ... 20. is the 'matching' file from the enron corpus for taylor_m ... under 693864 we have two files, '231.' and 'a1526.xml' ... a1526.xml is a file from the training corpus for author 693864 ... 231. is the 'matching' file, also from the enron corpus for taylor_m ... reading the enron versions, 20. and 231., makes a strong case that these are in fact from the same author ... the directories lavorato_j and martin_t follow the same pattern ...

in the Text_Matches file you'll see that some enron authors have been split into two training set authors with almost all the mail in one and very little in the other ... but there are also examples such as lavorato_j where there are 50 or more emails in each of the training sets, all of which 'match' emails from the same enron author ...

sorry for the complexity but we're not sure how to better organize the cases

c u o, tim

duplicates.zip

Reply all

Reply to author

Forward