Hi Akshay,
Thanks for your reply! I've looked at your python project before, and
it was helpful :-). But I'm kind of hoping for some more official
response, because there's a lot of misinformation floating around --
in fact I think you've fallen prey to some of it below.
On Mon, Jul 7, 2014 at 3:42 AM, Akshay Bhat <
aksha...@gmail.com> wrote:
> Hi,
>
> The common crawl data is divided into individual crawls according to when
> they were conducted, e.g. 2012, 2013,2014 etc.
> To my knowledge there are four crawls with the recent three having same
> format (WARC).
My list above includes official data releases going back to 2008, see e.g.:
https://commoncrawl.atlassian.net/wiki/display/CRWL/About+the+Data+Set
This is potentially pretty interesting for some use cases, e.g. if you
want to see how the web has changed over time.
> Now regarding downloading an entire crawl e.g. latest, its going to be very
> difficulty since a single craw consists of terabytes of compress data (I
> think around 100 Terabytes).
The CC-MAIN-2014-10 crawl's .warc.gz files are "only" 34 terabytes
:-). My impression is that this is typical for the recent crawls, but
I haven't checked the older ones.
> Thus the standard way of accessing crawl data is by using individual
> servers/hadoop cluster (EC2/EMR ) insider an Amazon Web Services data
> center, this way you don't incur costs for transferring data, (approx 0.1 $
> per Gb).
I've seen docs that talk about how the crawl data used to be stored in
an S3 "requester pays" bucket, which did require special
authentication, and incurred download charges if you wanted to get the
data outside of AWS. But this isn't true anymore. Now that the data is
in the aws-publicdatasets bucket, anyone can download it from
anywhere, anonymously and for free.
-n
P.S.: Little script I used for computing the size of the 2014-10
warc.gz files, in case anyone finds it useful:
import boto
conn = boto.connect_s3(anon=True)
pds = conn.get_bucket("aws-publicdatasets")
total = 0
for segment in pds.list("common-crawl/crawl-data/CC-MAIN-2014-10/segments/",
delimiter="/"):
for warc in pds.list(
segment.name + "warc/"):
total += warc.size
print(total)
--
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org