Total size of just metadata?

58 views

Skip to first unread message

phantom...@gmail.com

unread,

Apr 3, 2014, 6:32:26 PM4/3/14

to common...@googlegroups.com

What's the approximate total size of all the metadata of the latest common crawl.

I'm not an active Amazon developer and Amazon has a bug with my S3 account now and I'm waiting for them to fix it...

That and this would generally save me a bunch of time.

My goal is to just download all the metadata if it's reasonably sized and process it within our cluster.

David Parks

unread,

Apr 13, 2014, 12:29:29 PM4/13/14

to common...@googlegroups.com

Just a very cursory look here, but I see that there are 100 WAT files for each segment, and they look to be roughly 225MB each (eyeballed average). Looks like 557 segments in the most recent crawl. So I'm getting about 12.5 TB total compressed metadata size.

I once downloaded 300GB from S3, it took me about a week and a reasonable vocabulary of swear words. I'd suggest sending them (snail mail) a few disks and having them load the data directly onto disk (AWS offers a service for loading S3 data onto physical media you mail them). That'll run you a few hundreds of dollars I guess. Or suck it up and just use their hadoop cluster.