Large Data

Chris K Wensel

unread,

Jun 16, 2008, 4:34:43 PM6/16/08

to get.theinfo

Hey all

I'm considering benchmarking some clustered data processing tools. And
am in need of a nice huge dataset that is reasonably interesting, and
preferably publicly available.

Obviously I could just crawl the web and make a large collection of
pages. But I'd rather do something a little different, if possible.

Some examples would be the AOL logs, but they are a bit small (only .
5G compressed). Tim Bray has 64G of (I think compressed) apache logs
(search for his widefinder posts), but he has no plans to share. But
neither of those are terribly interesting (except the later could be
compared to his multi-core benchmarking).

So, any known interesting really large datasets lying around out there?

ckw

--
Chris K Wensel
ch...@wensel.net
http://chris.wensel.net/
http://www.cascading.org/

Aaron Swartz

unread,

Jun 16, 2008, 4:37:54 PM6/16/08

to get-t...@googlegroups.com

bulk.resource.org usually has some interesting large data sets...

Matt Wood

unread,

Jun 17, 2008, 4:29:06 AM6/17/08

to get-t...@googlegroups.com

Hello,

The various genome projects house some pretty large data sets too.

Most of these are publicly available for download:
http://www.ensembl.org/info/downloads/index.html

Hope that helps,

~ Matt

Lukasz Szybalski

unread,

Jun 17, 2008, 9:58:26 AM6/17/08

to get-t...@googlegroups.com

Books: (14.5GB)
http://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages#Getting_All_EBook_Files

Lucas

--
Where was my car manufactured?
http://cars.lucasmanual.com/vin
TurboGears Manual-Howto
http://lucasmanual.com/pdf/TurboGears-Manual-Howto.pdf

Chris K Wensel

unread,

Jun 17, 2008, 3:53:38 PM6/17/08

to get-t...@googlegroups.com

thanks all.

Rufus Pollock

unread,

Jun 26, 2008, 8:40:22 AM6/26/08

to get-t...@googlegroups.com

Missed this when it first came through but here are all the 'packages'
on CKAN tagged with size-large:

<http://www.ckan.net/tag/read/size-large>

May be useful.

Regards,

Rufus

Philip (flip) Kromer

unread,

Jun 26, 2008, 3:50:56 PM6/26/08

to get-t...@googlegroups.com

Bunch of junk piled here:
http://infochimp.info/ics/huge/
http://infochimp.info/ics/data/

Three great blog corpus datasets:
http://stuff.metafilter.com/infodump/
http://news.ycombinator.com/item?id=213891
http://www.cs.biu.ac.il/~koppel/BlogCorpus.htm

Daylife and Flickr both have open APIs; from talking to people at each
they don't mind how broadly you spider as long as you respect their
request rate (i.e. don't hammer 5 reqs/second for a week, that
they'll notice and mind.)

Enron email database:
http://bailando.sims.berkeley.edu/enron_email.html
I've heard that some people have the MediaDefender email database
(http://torrentfreak.com/mediadefender-emails-leaked-070915/ ) If
you're interested please email me, maybe I know one of them. There
are also a variety of public mailing list archive tarballs out there
-- linux kernel, etc.

If you want a great big pile of Server logs see
http://waxy.org/2008/05/star_wars_kid_the_data_dump/

flip

--
http://www.infochimps.org
Connected Open Free Data

Joseph Turian

unread,

Jun 26, 2008, 5:26:36 PM6/26/08

to get-t...@googlegroups.com

If you are interested in text corpora, you can also use the WAC
(web-as-corpora) datasets.
I forget the exact URL, but http://www.drni.de/wac-tk/ should be a
start. The SIGWAC chair sent me the datasets a while ago.

Another example is Wikipedia. Collobert and Weston (2008) induce good
low-dimensional vector representations for words using Wikipedia
(snowbird.djvuzone.org/abstracts/158.pdf + forthcoming work).
Visualizing these embeddings would be interesting.

--
Academic: http://www-etud.iro.umontreal.ca/~turian/
Business: http://www.metaoptimize.com/

Reply all

Reply to author

Forward