I'm considering benchmarking some clustered data processing tools. And
am in need of a nice huge dataset that is reasonably interesting, and
preferably publicly available.
Obviously I could just crawl the web and make a large collection of
pages. But I'd rather do something a little different, if possible.
Some examples would be the AOL logs, but they are a bit small (only .
5G compressed). Tim Bray has 64G of (I think compressed) apache logs
(search for his widefinder posts), but he has no plans to share. But
neither of those are terribly interesting (except the later could be
compared to his multi-core benchmarking).
So, any known interesting really large datasets lying around out there?
ckw
--
Chris K Wensel
ch...@wensel.net
http://chris.wensel.net/
http://www.cascading.org/
The various genome projects house some pretty large data sets too.
Most of these are publicly available for download:
http://www.ensembl.org/info/downloads/index.html
Hope that helps,
~ Matt
Lucas
--
Where was my car manufactured?
http://cars.lucasmanual.com/vin
TurboGears Manual-Howto
http://lucasmanual.com/pdf/TurboGears-Manual-Howto.pdf
<http://www.ckan.net/tag/read/size-large>
May be useful.
Regards,
Rufus
Three great blog corpus datasets:
http://stuff.metafilter.com/infodump/
http://news.ycombinator.com/item?id=213891
http://www.cs.biu.ac.il/~koppel/BlogCorpus.htm
Daylife and Flickr both have open APIs; from talking to people at each
they don't mind how broadly you spider as long as you respect their
request rate (i.e. don't hammer 5 reqs/second for a week, that
they'll notice and mind.)
Enron email database:
http://bailando.sims.berkeley.edu/enron_email.html
I've heard that some people have the MediaDefender email database
(http://torrentfreak.com/mediadefender-emails-leaked-070915/ ) If
you're interested please email me, maybe I know one of them. There
are also a variety of public mailing list archive tarballs out there
-- linux kernel, etc.
If you want a great big pile of Server logs see
http://waxy.org/2008/05/star_wars_kid_the_data_dump/
flip
--
http://www.infochimps.org
Connected Open Free Data
Another example is Wikipedia. Collobert and Weston (2008) induce good
low-dimensional vector representations for words using Wikipedia
(snowbird.djvuzone.org/abstracts/158.pdf + forthcoming work).
Visualizing these embeddings would be interesting.
--
Academic: http://www-etud.iro.umontreal.ca/~turian/
Business: http://www.metaoptimize.com/