Hello everyone,
I've been working on a project to restore as much original information
to the Enron dataset as possible. This includes attachments, header
information, formatting, proprietary email addresses, etc. Once this
done, it may be useful to make derivatives, e.g. where all email
addresses are seamlessly converted between say various formats.
I've started posting some initial information and discussion online at
http//
enrondata.org and I'm interested to find out if anyone would
like to provide suggestions and/or participate. I've posted some
initial summary files at
http://enrondata.org/data and some larger
datasets are becoming available.
My initial goal is to reconstruct the Enron email in MIME and original
native format (350+ PST and NSF files) in as as-original-as-possible
format as this will give users a dataset that is more similar to email
that is encountered in the real world. From there, enhanced datasets
can be provided that are more geared towards specific research
requirements. Initially this effort will produce 3 datasets:
1) FERC "PST" MySQL Database: I had to do some minor file corruption
and inconsistency correction to load the 1.3+ million records. This is
my working source to create the MIME email.
2) Primary MIME Format Email: conversion from the database format to
email format. This is the source for the PST and NSF email. This
format will include attachments as MIME parts. The core work in this
area has been performed but some clean up and outliers still need to
be handled.
3) Primary Native Format Email: PST and NSF files. I'm starting to
look at the MIME to PST conversion now.
An interesting related effort is the creation of directory server load
files to store user information including user names, SMTP email
address, Exchange Legacy DNs, Domino addreses, etc. For some Enron
employees, additional information is available that can be added to
create a richer dataset including title, department, address, job
role, etc. Some of this is already available (e.g. Jitesh and Jafar's
Ex-Employee Status report). I've assembled other information from
primary sources. This can be made available in one place in multiple
formats including: LDIF, database and flat file formats.
Derivative datasets can be constructed from these using similar email
address creation / mapping and various levels of de-duplication. This
is more similar to the CALO effort and other derivative datasets. For
this effort, I want to label all enhanced information so there's less
of a question as to what information is original vs. enhanced. I'm
also adding this to the naming convention in the enhanced email
headers I'm adding.
One reason for labeling enhanced information is to reduce the number
of questions with the data. For example, one area I've seen questions
on are the folder naming conventions, e.g. all_documents, sent,
sent_items, _sent_mail, etc. These are folders created by Exchange and
Notes that I'll probably discuss on the project blog.
I've started posting some smaller preliminary information on the blog,
but I also have some larger datasets that I use. Most of the FERC
extracts are from the MySQL database which can be made available now.
The MIME files should be available soon followed by the native files.
If anyone is interested in these non-deduplicated datasets, let me
know and I can see about getting them hosted. I may also be able to
get the raw FERC iCONECT files hosted if there is interest in that as
well.
If anyone is interested in participating, please let me know as
there's still a lot of work that can be done. I'm thinking of setting
up EnronData.org as an open source project so multiple people can
collaborate.
Please take a look at the site and let me know what you think.
Thanks,
John Wang
EnronData.org