Fwd: LDC holdings at CMU

72 views

Skip to first unread message

William Cohen (CMU)

unread,

Apr 18, 2012, 8:48:28 AM4/18/12

to machine-learning-with-large-d...@googlegroups.com

FYI - more information on large datasets

---------- Forwarded message ----------
From: Alex Rudnicky <Alex.R...@cs.cmu.edu>
Date: Tue, Apr 17, 2012 at 7:56 PM
Subject: LDC holdings at CMU
To: "lti-fac...@cs.cmu.edu" <lti-fac...@cs.cmu.edu>,
"lti-st...@cs.cmu.edu" <lti-st...@cs.cmu.edu>,
"lti-...@cs.cmu.edu" <lti-...@cs.cmu.edu>
Cc: "Angela Brookins [ang...@cs.cmu.edu]" <ang...@cs.cmu.edu>

This is a reminder to all that CMU has a (more-or-less) complete
collection of LDC corpora available to everyone, for educational or
research purposes, within the University. You can view our holdings at
http://www.speech.cs.cmu.edu/inner/LDC/table.html

Note that the collection is only accessible from CMU IP addresses, due
to licensing issues.

Please read the intro material, and remember that hovering over the
icons next to the corpus entry will inform you of their status. You
can browse the corpora directly or pull them down using ‘wget’ and its
friends. If you need ongoing access, it can be arranged.

The collection is not complete. One reason is that in the early days,
the LDC allowed members only a fixed number of corpora. We acquired
only those that were relevant to ongoing projects. If you need one of
the missing ones, you should be prepared to contribute the cost from
your project.

Before disk storage became cheap, we only had CDs of corpora. People
borrowed these; not everyone turned them in. This is the other reason
some corpora are missing. In many cases we have their name; it’s in
square brackets at the entry of the entry description. Feel free to
hunt them down or otherwise vocally bring this issue up in their
presence. The end goal is to get the disc(s) back into the collection
so that the data are available to all.

We have some other corpora in the collection. Most of these are in
speech (since that’s nominally my area). If you have corpora that
might be of interest to other (believe me, they will be), please feel
free to contributes copies to this collection.

Angela Brookins ang...@cs.cmu.edu is the official librarian for the
collection. You should contact her for lending and other issues.

LDC corpora were originally focused on the needs of the speech
community, but over time have come to include materials of interest to
the text, video and other communities. Until a few years ago
acquisition was directly subsidized by the Speech Group (but still
available to all). More recently this role has been transferred to the
LTI, which assesses ongoing contracts that use LDC corpora to meet the
cost of our LDC membership (btw, we’re a charter member!). While I’m
at it I would like to acknowledge Brian MacWhinney, a fellow corpus
geek, for his support of various needs of the collection.

----

Alexander Rudnicky

--
William W. Cohen
wco...@cs.cmu.edu
http://www.wcohen.com
Research Professor
Machine Learning Department, CMU

Reply all

Reply to author

Forward

0 new messages