download ALL the data?

1,027 views
Skip to first unread message

n...@vorpus.org

unread,
Jul 6, 2014, 7:22:01 PM7/6/14
to common...@googlegroups.com
Hi all,

Somewhat to my surprise, I haven't been able to find any canonical document describing what data is available and how to access it. Suppose I wanted to download ALL the raw crawl data. From various hints and poking around, I've found the following. Eliding the s3://aws-publicdatasets/common-crawl/ at the beginning of each url, the dumps are:

crawl-001/.../*.arc.gz
  -- where ... is any arbitrary chain of directories

crawl-002/.../*.arc.gz
  -- where ... is any arbitrary chain of directories

parse-output/segment/XXX/*.arc.gz
  -- where XXX is one of the lines in parse-output/valid-segments.txt

crawl-data/XXX/segments/*/warc/*.warc.gz
  -- where for the "main" datasets, XXX should be one of CC-MAIN-2013-20, CC-MAIN-2013-48, or CC-MAIN-2014-10. (And we should pretend for now that CC-MAIN-2014-15 doesn't exist.)

Is this list correct? Is it complete? Is there a canonical, up-to-date source of this information that I've missed?

-n

Akshay Bhat

unread,
Jul 6, 2014, 10:42:31 PM7/6/14
to common...@googlegroups.com
Hi,

The common crawl data is divided into individual crawls according to when they were conducted, e.g. 2012, 2013,2014 etc.
To my knowledge there are four crawls with the recent three having same format (WARC).

Now regarding downloading an entire crawl e.g. latest, its going to be very difficulty since a single craw consists of terabytes of compress data (I think around 100 Terabytes). 
Thus the standard way of accessing crawl data is by using individual servers/hadoop cluster (EC2/EMR ) insider an Amazon Web Services data center, this way you don't incur costs for transferring data, (approx 0.1 $ per Gb).

I have a small python library that lists and can be used to retrieve files from latest three crawl.


To actually use an entire crawl for analysis requires significant knowledge of setting up a distributed system in AWS environment and is quite challenging.
If however you want a single file it can be downloaded at least temporarily by clicking on following S3 link. 



Best,
Akshay

Nathaniel Smith

unread,
Jul 7, 2014, 7:11:47 AM7/7/14
to common...@googlegroups.com
Hi Akshay,

Thanks for your reply! I've looked at your python project before, and
it was helpful :-). But I'm kind of hoping for some more official
response, because there's a lot of misinformation floating around --
in fact I think you've fallen prey to some of it below.

On Mon, Jul 7, 2014 at 3:42 AM, Akshay Bhat <aksha...@gmail.com> wrote:
> Hi,
>
> The common crawl data is divided into individual crawls according to when
> they were conducted, e.g. 2012, 2013,2014 etc.
> To my knowledge there are four crawls with the recent three having same
> format (WARC).

My list above includes official data releases going back to 2008, see e.g.:
https://commoncrawl.atlassian.net/wiki/display/CRWL/About+the+Data+Set
This is potentially pretty interesting for some use cases, e.g. if you
want to see how the web has changed over time.

> Now regarding downloading an entire crawl e.g. latest, its going to be very
> difficulty since a single craw consists of terabytes of compress data (I
> think around 100 Terabytes).

The CC-MAIN-2014-10 crawl's .warc.gz files are "only" 34 terabytes
:-). My impression is that this is typical for the recent crawls, but
I haven't checked the older ones.

> Thus the standard way of accessing crawl data is by using individual
> servers/hadoop cluster (EC2/EMR ) insider an Amazon Web Services data
> center, this way you don't incur costs for transferring data, (approx 0.1 $
> per Gb).

I've seen docs that talk about how the crawl data used to be stored in
an S3 "requester pays" bucket, which did require special
authentication, and incurred download charges if you wanted to get the
data outside of AWS. But this isn't true anymore. Now that the data is
in the aws-publicdatasets bucket, anyone can download it from
anywhere, anonymously and for free.

-n

P.S.: Little script I used for computing the size of the 2014-10
warc.gz files, in case anyone finds it useful:

import boto
conn = boto.connect_s3(anon=True)
pds = conn.get_bucket("aws-publicdatasets")
total = 0
for segment in pds.list("common-crawl/crawl-data/CC-MAIN-2014-10/segments/",
delimiter="/"):
for warc in pds.list(segment.name + "warc/"):
total += warc.size
print(total)

--
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

Stephen Merity

unread,
Jul 7, 2014, 5:15:26 PM7/7/14
to common...@googlegroups.com
Hey =]

As you've pointed out, there is no good canonical source currently. This is as we're about to revamp our website which includes content updates, but the revamp has taken longer than expected.

Akshay filled in many of the points (and with a very quick response as well!) so I will just fill in the edges and provide some more detail for anyone else who stumbles onto this discussion.

If you do want access to the files, you can get them freely thanks to the Amazon Public Datasets program. You can either access them using S3 with null credentials (example in Java or using boto.connect_s3(anon=True) as you've done in your listing code) or use the HTTP endpoints that S3 makes available (as shown by Akshay above).
Using the anonymous S3 credentials is probably easiest as it allows for ls'ing of directories and other niceties.

To extend to the data list mentioned earlier, there are six crawls (with the seventh coming imminently):
  • [ARC] Archived Crawl #1 - s3://aws-publicdatasets/common-crawl/crawl-001/ - crawl data from 2008/2010
  • [ARC] Archived Crawl #2 - s3://aws-publicdatasets/common-crawl/crawl-002/ - crawl data from 2009/2010
  • [ARC] Archived Crawl #3 - s3://aws-publicdatasets/common-crawl/parse-output/ - crawl data from 2012
  • [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/
  • [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-48/
  • [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-10/
The S3 file format for all crawls since WARC is described in the blog post New Crawl Data.

Recent WARC crawls are smaller in size than the older crawls as the older crawls were longer running. We're aiming to increase the frequency of our crawls to allow for experiments over time periods though that does impact size.

Downloading the full set of data is a sizable task but quite doable if you've the storage and the bandwidth. I'd be interested in hearing your experiences as it might have some valuable insights for others trying to pull down a large portion of the corpus to private clusters.



--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages