I'm considering benchmarking some clustered data processing tools. And
am in need of a nice huge dataset that is reasonably interesting, and
preferably publicly available.
Obviously I could just crawl the web and make a large collection of
pages. But I'd rather do something a little different, if possible.
Some examples would be the AOL logs, but they are a bit small (only .
5G compressed). Tim Bray has 64G of (I think compressed) apache logs
(search for his widefinder posts), but he has no plans to share. But
neither of those are terribly interesting (except the later could be
compared to his multi-core benchmarking).
So, any known interesting really large datasets lying around out there?
The various genome projects house some pretty large data sets too.
Most of these are publicly available for download:
Hope that helps,
Where was my car manufactured?
May be useful.
Daylife and Flickr both have open APIs; from talking to people at each
they don't mind how broadly you spider as long as you respect their
request rate (i.e. don't hammer 5 reqs/second for a week, that
they'll notice and mind.)
Enron email database:
I've heard that some people have the MediaDefender email database
(http://torrentfreak.com/mediadefender-emails-leaked-070915/ ) If
you're interested please email me, maybe I know one of them. There
are also a variety of public mailing list archive tarballs out there
-- linux kernel, etc.
If you want a great big pile of Server logs see
Connected Open Free Data
Another example is Wikipedia. Collobert and Weston (2008) induce good
low-dimensional vector representations for words using Wikipedia
(snowbird.djvuzone.org/abstracts/158.pdf + forthcoming work).
Visualizing these embeddings would be interesting.