I'm considering benchmarking some clustered data processing tools. And am in need of a nice huge dataset that is reasonably interesting, and preferably publicly available.
Obviously I could just crawl the web and make a large collection of pages. But I'd rather do something a little different, if possible.
Some examples would be the AOL logs, but they are a bit small (only . 5G compressed). Tim Bray has 64G of (I think compressed) apache logs (search for his widefinder posts), but he has no plans to share. But neither of those are terribly interesting (except the later could be compared to his multi-core benchmarking).
So, any known interesting really large datasets lying around out there?
> I'm considering benchmarking some clustered data processing tools. And > am in need of a nice huge dataset that is reasonably interesting, and > preferably publicly available.
> Obviously I could just crawl the web and make a large collection of > pages. But I'd rather do something a little different, if possible.
> Some examples would be the AOL logs, but they are a bit small (only . > 5G compressed). Tim Bray has 64G of (I think compressed) apache logs > (search for his widefinder posts), but he has no plans to share. But > neither of those are terribly interesting (except the later could be > compared to his multi-core benchmarking).
> So, any known interesting really large datasets lying around out there?
> I'm considering benchmarking some clustered data processing tools. And > am in need of a nice huge dataset that is reasonably interesting, and > preferably publicly available.
> Obviously I could just crawl the web and make a large collection of > pages. But I'd rather do something a little different, if possible.
> Some examples would be the AOL logs, but they are a bit small (only . > 5G compressed). Tim Bray has 64G of (I think compressed) apache logs > (search for his widefinder posts), but he has no plans to share. But > neither of those are terribly interesting (except the later could be > compared to his multi-core benchmarking).
> So, any known interesting really large datasets lying around out there?
> On Mon, Jun 16, 2008 at 9:34 PM, Chris K Wensel <ch...@wensel.net> wrote:
>> Hey all
>> I'm considering benchmarking some clustered data processing tools. And >> am in need of a nice huge dataset that is reasonably interesting, and >> preferably publicly available.
>> Obviously I could just crawl the web and make a large collection of >> pages. But I'd rather do something a little different, if possible.
>> Some examples would be the AOL logs, but they are a bit small (only . >> 5G compressed). Tim Bray has 64G of (I think compressed) apache logs >> (search for his widefinder posts), but he has no plans to share. But >> neither of those are terribly interesting (except the later could be >> compared to his multi-core benchmarking).
>> So, any known interesting really large datasets lying around out there?
>> On Mon, Jun 16, 2008 at 9:34 PM, Chris K Wensel <ch...@wensel.net> >> wrote:
>>> Hey all
>>> I'm considering benchmarking some clustered data processing tools. >>> And >>> am in need of a nice huge dataset that is reasonably interesting, >>> and >>> preferably publicly available.
>>> Obviously I could just crawl the web and make a large collection of >>> pages. But I'd rather do something a little different, if possible.
>>> Some examples would be the AOL logs, but they are a bit small >>> (only . >>> 5G compressed). Tim Bray has 64G of (I think compressed) apache logs >>> (search for his widefinder posts), but he has no plans to share. But >>> neither of those are terribly interesting (except the later could be >>> compared to his multi-core benchmarking).
>>> So, any known interesting really large datasets lying around out >>> there?
> I'm considering benchmarking some clustered data processing tools. And > am in need of a nice huge dataset that is reasonably interesting, and > preferably publicly available.
> Obviously I could just crawl the web and make a large collection of > pages. But I'd rather do something a little different, if possible.
> Some examples would be the AOL logs, but they are a bit small (only . > 5G compressed). Tim Bray has 64G of (I think compressed) apache logs > (search for his widefinder posts), but he has no plans to share. But > neither of those are terribly interesting (except the later could be > compared to his multi-core benchmarking).
> So, any known interesting really large datasets lying around out there?
Daylife and Flickr both have open APIs; from talking to people at each they don't mind how broadly you spider as long as you respect their request rate (i.e. don't hammer 5 reqs/second for a week, that they'll notice and mind.)
On Mon, Jun 16, 2008 at 1:37 PM, Aaron Swartz <m...@aaronsw.com> wrote:
> bulk.resource.org usually has some interesting large data sets...
> On Mon, Jun 16, 2008 at 1:34 PM, Chris K Wensel <ch...@wensel.net> wrote:
>> Hey all
>> I'm considering benchmarking some clustered data processing tools. And >> am in need of a nice huge dataset that is reasonably interesting, and >> preferably publicly available.
>> Obviously I could just crawl the web and make a large collection of >> pages. But I'd rather do something a little different, if possible.
>> Some examples would be the AOL logs, but they are a bit small (only . >> 5G compressed). Tim Bray has 64G of (I think compressed) apache logs >> (search for his widefinder posts), but he has no plans to share. But >> neither of those are terribly interesting (except the later could be >> compared to his multi-core benchmarking).
>> So, any known interesting really large datasets lying around out there?
If you are interested in text corpora, you can also use the WAC (web-as-corpora) datasets. I forget the exact URL, but http://www.drni.de/wac-tk/ should be a start. The SIGWAC chair sent me the datasets a while ago.
Another example is Wikipedia. Collobert and Weston (2008) induce good low-dimensional vector representations for words using Wikipedia (snowbird.djvuzone.org/abstracts/158.pdf + forthcoming work). Visualizing these embeddings would be interesting.
>> On Mon, Jun 16, 2008 at 9:34 PM, Chris K Wensel <ch...@wensel.net> wrote:
>>> Hey all
>>> I'm considering benchmarking some clustered data processing tools. And >>> am in need of a nice huge dataset that is reasonably interesting, and >>> preferably publicly available.
>>> Obviously I could just crawl the web and make a large collection of >>> pages. But I'd rather do something a little different, if possible.
>>> Some examples would be the AOL logs, but they are a bit small (only . >>> 5G compressed). Tim Bray has 64G of (I think compressed) apache logs >>> (search for his widefinder posts), but he has no plans to share. But >>> neither of those are terribly interesting (except the later could be >>> compared to his multi-core benchmarking).
>>> So, any known interesting really large datasets lying around out there?