here's the details of what we found ...
Attached is a zip folder with comparisons between duplicate authors. The LargeTrain folder has two text files:
Author_Pairs, a text file mapping 1 author from the Enron corpus to 2 authors from the training set, and
Text_Matches,
a file containing a mapping from every text from both authors in the
duplicate training set to a text by the original author in the Enron
corpus
There are also 3 subfolders, each named after an Enron author. Each
folder contains 2 folders named after the duplicate authors in the
Training set. Each of these subfolders contains 2 files ... the xml file is the file that was given to us as part of the training set for the author identified by the parent directory name ... the other file is the 'matching' file from the enron corpus under a directory identified by the grandparent directory name ...
so, for example, taylor-m is a directory in the enron corpus that contains email from 'Mark Taylor' ... under that we have two directories, 640334 and 693864 ... these are two of the training authors ... under 640334 we have the files '20.' and 'a2800.xml' ... a2800.xml is a file from the training corpus for author 640334 ... 20. is the 'matching' file from the enron corpus for taylor_m ... under 693864 we have two files, '231.' and 'a1526.xml' ... a1526.xml is a file from the training corpus for author 693864 ... 231. is the 'matching' file, also from the enron corpus for taylor_m ... reading the enron versions, 20. and 231., makes a strong case that these are in fact from the same author ... the directories lavorato_j and martin_t follow the same pattern ...
in the Text_Matches file you'll see that some enron authors have been split into two training set authors with almost all the mail in one and very little in the other ... but there are also examples such as lavorato_j where there are 50 or more emails in each of the training sets, all of which 'match' emails from the same enron author ...
sorry for the complexity but we're not sure how to better organize the cases
c u o, tim