Leaked Email Corpus from the Venezuelan Government

42 views
Skip to first unread message

Andrew....@csiro.au

unread,
Sep 2, 2008, 9:56:33 PM9/2/08
to enron-...@sgi.nu, email-r...@googlegroups.com

Hi Email Researchers,

I've recently found out about an alleged corpus of 8000 email messages from the Venezuelan government, which I've blogged about at: <http://www.sgi.nu/diary/2008/09/02/venezuelan-government-email-corpus/>

It's currently on offer to the highest bidder, but WikiLeaks (who have obtained the messages) claim they will publicly release it after a period of exclusive access for the winning bidder.

I'm wondering whether anyone knows more about this?

Thanks,
Andrew

--------------
Andrew Lampert
Research Engineer
Information Engineering Laboratory
CSIRO ICT Centre
<http://www.ict.csiro.au/staff/Andrew.Lampert/>

Post: Locked Bag 17, North Ryde, NSW 1670, Australia
Office: Building E6B, Macquarie University, North Ryde, 2113
Tel: +61 2 9325 3129, Fax: +61 2 9325 3200

Mark Dredze

unread,
Sep 5, 2008, 2:33:07 PM9/5/08
to email-r...@googlegroups.com
This is certainly an interesting source of information, especially
since the emails are likely non-English. A major question I have is
the legitimacy of using mail that has not been legally obtained. When
AOL released the search queries and then recalled them, there was a
mix of feelings in the IR community as to whether or not they should
be used. This case is clearly different- this is a government
organization and email may fall under the general ideas about open
access to government information, but these laws likely don't apply to
Venezuela.

Does anyone have thoughts about using this type of data?

Mark

Michael Freed

unread,
Sep 5, 2008, 2:53:57 PM9/5/08
to email-r...@googlegroups.com
Because it's Venezuela.gov it's tempting to apply different standards.  But using this would set a bad precedent and undermine trust we might hope for from legit providers.  I won't use it.

-M

Andrew....@csiro.au

unread,
Sep 5, 2008, 6:52:02 PM9/5/08
to email-r...@googlegroups.com

Hi Mark,

You raise a really interesting question, related to one I raised at the workshop in Chicago about the ethics of using the Media Defender emails. (See <http://www.sgi.nu/diary/2007/09/18/mediadefender-email-corpus-6600-email-messages-released/> if you're not familiar with the MediaDefender "corpus").

It seems fairly unambiguous that the MediaDefender emails are unusable for research, given the way they were originally obtained and distributed. Without knowing more, I think it's hard to say whether the same is true for the Venezuela.gov corpus.

Would it make any difference if extracts from the Venezuela.gov emails (or the whole corpus) were published in or by a mainstream newspaper or news outlet? What if the emails were also used as the basis for public-interest investigative journalism? Is the WikiLeaks release of the data significantly different from the emails being released by a news outlet?

Michael, you say that "using this would set a bad precedent and undermine trust we might hope for from legit providers". That might be true - it's hard to know, I think. Is the potential to undermine trust still there once emails are in the public domain? What has to happen to lend legitimacy to a real-world email corpus (other than where an organisation volunteers their data)? Enron is one example, though even there some people raise ethical questions about using it. I guess we can just keep hoping that the Clinton.gov corpus you mentioned comes to fruition, Mark!

Another interesting point: despite some discussion in the IR community, a quick search on Google Scholar shows that the AOL corpus does is being used in published IR research.

Cheers,
Andrew

________________________________________
From: email-r...@googlegroups.com [email-r...@googlegroups.com] On Behalf Of Mark Dredze [ma...@dredze.com]
Sent: Saturday, 6 September 2008 4:33 AM
To: email-r...@googlegroups.com
Subject: [email-research] Re: Leaked Email Corpus from the Venezuelan Government

Michael Freed

unread,
Sep 7, 2008, 10:02:46 PM9/7/08
to email-r...@googlegroups.com
On Fri, Sep 5, 2008 at 3:52 PM, <Andrew....@csiro.au> wrote:

Michael, you say that "using this would set a bad precedent and undermine trust we might hope for from legit providers". That might be true - it's hard to know, I think. Is the potential to undermine trust still there once emails are in the public domain? What has to happen to lend legitimacy to a real-world email corpus (other than where an organisation volunteers their data)? Enron is one example, though even there some people raise ethical questions about using it. I guess we can just keep hoping that the Clinton.gov corpus you mentioned comes to fruition, Mark!

Do we ever want to be in the position of justifying use of private data?  I wouldn't want to have that conversation with my colleagues, much less a potential data provider. 

A decent alternative is to have an ethical standard that we can point to.  I think we can easily outline something that covers the obvious cases: permission from native owners of the data, and legal action that renders a corpus public. 

I'd hesitate to treat the press, whistle-blowers etc as sources of legitimacy, even when they're exposing those who probably deserve it. 

-M



Mark Dredze

unread,
Sep 8, 2008, 3:44:39 PM9/8/08
to email-r...@googlegroups.com
"I think we can easily outline something that covers the obvious
cases: permission from native owners of the data, and legal action
that renders a corpus public."
That makes sense to me. If you treat emails as copyrighted material,
then you want to make sure you respect applicable copyrights. I would
also say that no local law be broken by your use of the data.

That said, there are some tricky cases. Consider:
1) The government subpoenas a large amount of email and releases it,
as with Enron. I think this is ok to use. What if one of the people in
the corpus ask for all of their mail to be removed?
2) I decide to release all my email data. Can my contacts who sent me
the email object and ask that their messages be removed?

I think both of these cases would be ok. The Venezuela case is not
because the permission broke the law by releasing government
information, even if the people have a right to see it, it doesn't
mean that legally we can use it.

The AOL data is unclear to me. AOL released the data under a license,
but then removed the data and asked people not to use it. Can you
withdraw the license that was originally provided (depends on the
license)?

I certainly do not want to use any data in my research that people
would view as suspect. However, I suspect that other researchers
(journalists and political scientists) would use the released
Venezuelan email. If they would, do we have a different standard?

Mark

Reply all
Reply to author
Forward
0 new messages