RE: [Enron-corpus] Using Enron Corpus as a text corpus for Word Sense Disambiguation

Andrew....@csiro.au

unread,

May 20, 2009, 7:58:33 PM5/20/09

to stuart...@cl.cam.ac.uk, enron-...@sgi.nu, email-r...@googlegroups.com

Hi Stuart,

I don't have any WSD specific comments to make, but I have done a reasonable amount of annotation of other phenomena in the Enron corpus. Like you're suggesting, I have found using one of the database versions of the corpus very convenient for this purpose. Specifically, I use the database dump prepared by Andrew Fiore and Jeff Heer at UC Berkeley and add my own tables (with references to ids in the existing tables) for storing annotations that I make on the data. This also allows you to easily add multi-layer annotations over time (i.e., annotations that refer to other annotations, such as classification annotations that refer to text unit annotations).

I've cross-posted your query to the Email Research mailing list too, in case anyone has additional advice to offer, particularly in reference to word-sense disambiguation.

Cheers,
Andrew
________________________________________
From: enron-corp...@sgi.nu [enron-corp...@sgi.nu] On Behalf Of Stuart Moore [stuart...@cl.cam.ac.uk]
Sent: Wednesday, 20 May 2009 11:08 PM
To: enron-...@sgi.nu
Subject: [Enron-corpus] Using Enron Corpus as a text corpus for Word Sense Disambiguation

I'm looking to use the Enron corpus as a flat text corpus for Word
Sense Disambiguation research - and hopefully to make my annotations
publicly available. I currently plan just to use the body text of each
email, rather than any of the email headers. Personal emails, list
emails etc. are all useful to me (but Spam isn't).

I'm currently trying to work out the best way to organise my data -
has anyone done anything similar? Does anyone have any suggestions?

My current plan is to use one of the database versions mentioned on
http://sgi.nu/enron/corpora.php and add extra tables for my
annotations.

Many thanks

Stuart Moore
PhD Student, University of Cambridge
_______________________________________________
Enron-corpus mailing list
Enron-...@sgi.nu
http://lists.sgi.nu/mailman/listinfo/enron-corpus

Mark Dredze

unread,

May 20, 2009, 9:37:22 PM5/20/09

to email-r...@googlegroups.com, stuart...@cl.cam.ac.uk, enron-...@sgi.nu

Hi Andrew,

I'm curious to know what specific characteristics of WSD do you plan
on exploring in the context of email? How do you feel email differs
from more traditional WSD domains?

Mark

Mark Dredze

unread,

May 21, 2009, 7:53:11 AM5/21/09

to Stuart Moore, email-r...@googlegroups.com, enron-...@sgi.nu

Sorry, meant to write Stuart.

I would imagine that the interesting aspect of email is context from
both the thread and the relationship between the sender/recipient.
Conditioning WSD output on these seems like a reasonable idea. You can
also look at the recent ACL literature on domain adaptation for WSD.

Good luck with your work. Let us know how it turns out.

Mark

On Thu, May 21, 2009 at 1:58 AM, Stuart Moore <stuart...@cl.cam.ac.uk> wrote:
> 2009/5/21 Mark Dredze <ma...@dredze.com>:

>> Hi Andrew,
>>
>> I'm curious to know what specific characteristics of WSD do you plan
>> on exploring in the context of email? How do you feel email differs
>> from more traditional WSD domains?
>>
>

> I'll reply since it's me planning to do the research, not Andrew.
>
> Basically I just want to use it as a different domain from newswire
> text, to see how systems cope with it, since emails generally have a
> different written style. Sometimes replies are briefer, or might
> require context from quoted material for disambiguation. As much as
> anything else, email is a common 'real world use' domain and therefore
> deserves looking at in it's own right.

Reply all

Reply to author

Forward