Web Images Videos Maps News Shopping Gmail more »
Recently Visited Groups | Help | Sign in
Google Groups Home
Large Data
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  8 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Chris K Wensel  
View profile  
 More options Jun 16 2008, 4:34 pm
From: Chris K Wensel <ch...@wensel.net>
Date: Mon, 16 Jun 2008 13:34:43 -0700
Local: Mon, Jun 16 2008 4:34 pm
Subject: Large Data
Hey all

I'm considering benchmarking some clustered data processing tools. And  
am in need of a nice huge dataset that is reasonably interesting, and  
preferably publicly available.

Obviously I could just crawl the web and make a large collection of  
pages. But I'd rather do something a little different, if possible.

Some examples would be the AOL logs, but they are a bit small (only .
5G compressed). Tim Bray has 64G of (I think compressed) apache logs  
(search for his widefinder posts), but he has no plans to share. But  
neither of those are terribly interesting (except the later could be  
compared to his multi-core benchmarking).

So, any known interesting really large datasets lying around out there?

ckw

--
Chris K Wensel
ch...@wensel.net
http://chris.wensel.net/
http://www.cascading.org/


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Aaron Swartz  
View profile  
 More options Jun 16 2008, 4:37 pm
From: "Aaron Swartz" <m...@aaronsw.com>
Date: Mon, 16 Jun 2008 13:37:54 -0700
Local: Mon, Jun 16 2008 4:37 pm
Subject: Re: [get.theinfo] Large Data
bulk.resource.org usually has some interesting large data sets...

On Mon, Jun 16, 2008 at 1:34 PM, Chris K Wensel <ch...@wensel.net> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Matt Wood  
View profile  
 More options Jun 17 2008, 4:29 am
From: "Matt Wood" <matt.j.w...@gmail.com>
Date: Tue, 17 Jun 2008 09:29:06 +0100
Local: Tues, Jun 17 2008 4:29 am
Subject: Re: [get.theinfo] Large Data
Hello,

The various genome projects house some pretty large data sets too.

Most of these are publicly available for download:
http://www.ensembl.org/info/downloads/index.html

Hope that helps,

~ Matt

On Mon, Jun 16, 2008 at 9:34 PM, Chris K Wensel <ch...@wensel.net> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Lukasz Szybalski  
View profile  
 More options Jun 17 2008, 9:58 am
From: "Lukasz Szybalski" <szybal...@gmail.com>
Date: Tue, 17 Jun 2008 08:58:26 -0500
Local: Tues, Jun 17 2008 9:58 am
Subject: Re: [get.theinfo] Re: Large Data
Books: (14.5GB)
http://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Acces...

Lucas

--
Where was my car manufactured?
http://cars.lucasmanual.com/vin
TurboGears Manual-Howto
http://lucasmanual.com/pdf/TurboGears-Manual-Howto.pdf

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Chris K Wensel  
View profile  
 More options Jun 17 2008, 3:53 pm
From: Chris K Wensel <ch...@wensel.net>
Date: Tue, 17 Jun 2008 12:53:38 -0700
Local: Tues, Jun 17 2008 3:53 pm
Subject: Re: [get.theinfo] Re: Large Data
thanks all.

On Jun 17, 2008, at 6:58 AM, Lukasz Szybalski wrote:

--
Chris K Wensel
ch...@wensel.net
http://chris.wensel.net/
http://www.cascading.org/

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Rufus Pollock  
View profile  
 More options Jun 26 2008, 8:40 am
From: Rufus Pollock <rufus.poll...@okfn.org>
Date: Thu, 26 Jun 2008 13:40:22 +0100
Local: Thurs, Jun 26 2008 8:40 am
Subject: Re: [get.theinfo] Large Data
Missed this when it first came through but here are all the 'packages'
on CKAN tagged with size-large:

   <http://www.ckan.net/tag/read/size-large>

May be useful.

Regards,

Rufus

On 16/06/08 21:34, Chris K Wensel wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Philip (flip) Kromer  
View profile  
 More options Jun 26 2008, 3:50 pm
From: "Philip (flip) Kromer" <f...@infochimps.org>
Date: Thu, 26 Jun 2008 14:50:56 -0500
Local: Thurs, Jun 26 2008 3:50 pm
Subject: Re: [get.theinfo] Re: Large Data
Bunch of junk piled here:
  http://infochimp.info/ics/huge/
  http://infochimp.info/ics/data/

Three great blog corpus datasets:
  http://stuff.metafilter.com/infodump/
  http://news.ycombinator.com/item?id=213891
  http://www.cs.biu.ac.il/~koppel/BlogCorpus.htm

Daylife and Flickr both have open APIs; from talking to people at each
they don't mind how broadly you spider as long as you respect their
request rate  (i.e. don't hammer 5 reqs/second for a week, that
they'll notice and mind.)

Enron email database:
  http://bailando.sims.berkeley.edu/enron_email.html
I've heard that some people have the MediaDefender email database
(http://torrentfreak.com/mediadefender-emails-leaked-070915/ )  If
you're interested please email me, maybe I know one of them.  There
are also a variety of public mailing list archive tarballs out there
-- linux kernel, etc.

If you want a great big pile of Server logs see
  http://waxy.org/2008/05/star_wars_kid_the_data_dump/

flip

--
http://www.infochimps.org
Connected Open Free Data

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Joseph Turian  
View profile  
 More options Jun 26 2008, 5:26 pm
From: "Joseph Turian" <tur...@gmail.com>
Date: Thu, 26 Jun 2008 17:26:36 -0400
Local: Thurs, Jun 26 2008 5:26 pm
Subject: Re: [get.theinfo] Re: Large Data
If you are interested in text corpora, you can also use the WAC
(web-as-corpora) datasets.
I forget the exact URL, but http://www.drni.de/wac-tk/ should be a
start. The SIGWAC chair sent me the datasets a while ago.

Another example is Wikipedia. Collobert and Weston (2008) induce good
low-dimensional vector representations for words using Wikipedia
(snowbird.djvuzone.org/abstracts/158.pdf + forthcoming work).
Visualizing these embeddings would be interesting.

--
Academic: http://www-etud.iro.umontreal.ca/~turian/
Business: http://www.metaoptimize.com/

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2009 Google