Download CDX files from S3: Forbidden

246 views
Skip to first unread message

Alex

unread,
Dec 11, 2015, 11:56:10 PM12/11/15
to Common Crawl
Hi,

I was trying to get a list of all urls in the particular crawl from it's CDX files.

AFAIK, as noted elsethread, CDX files used for creation of the URL index are available on S3. 
Indeed one can see s3://aws-publicdatasets/common-crawl/cc-index/cdx/CC-MAIN-2015-40/segments/*/warc/*.cdx.gz but download attempt results in "Forbidden"

$aws --no-sign-request s3 cp s3://aws-publicdatasets/common-crawl/cc-index/cdx/CC-MAIN-2015-40/segments/1443736678409.42/warc/CC-MAIN-20151001215758-00253-ip-10-137-6-227.ec2.internal.cdx.gz .
A client error (403) occurred when calling the HeadObject operation: Forbidden

Could you please help me to understand:
 - whether it is indeed the simplest way of getting all the urls in the crawl? 
 - if so, are these access problems caused by some S3 permissions issue that can be fixed, or is it an expected result? 

Thank in advance!

--
Alex

Scott

unread,
Jan 5, 2016, 8:12:59 PM1/5/16
to Common Crawl
Alex,

I found them at

/common-crawl/cc-index/collections/CC-MAIN-2015-40/indexes/

There is a series of 300 files for 2015-40.

I used an application Cyberduck to browse and download a sample file (https://cyberduck.io/?l=en).

I also wrote some Python code to download.  Have been able to match 1.32B records.  Not all records in cdx are separated by newline.  Have yet to confirm uniqueness.  cdx-00220.gz has greater than average record count.

Scott

Greg Lindahl

unread,
Jan 5, 2016, 11:08:03 PM1/5/16
to common...@googlegroups.com
On Tue, Jan 05, 2016 at 05:12:59PM -0800, Scott wrote:
> Alex,
>
> I found them at
>
> /common-crawl/cc-index/collections/CC-MAIN-2015-40/indexes/
>
> There is a series of 300 files for 2015-40.

If the process for handling CDX is similar to what goes on inside the
Internet Archive,

1) A .cdx is generated for every WARC file
2) The contents of these files are globally sorted into a set of output files
3) These sorted files are used to power the index.

I think you're looking at (2), and Alex was getting permission denied
for the files in (1). And Alex probably wants the files from (2) instead.

-- greg

Tom Morris

unread,
Jan 7, 2016, 10:39:36 AM1/7/16
to common...@googlegroups.com
Scott is on the right track.  More comments inline below: 

On Tue, Jan 5, 2016 at 8:12 PM, Scott <scott....@gmail.com> wrote:

I found them at

/common-crawl/cc-index/collections/CC-MAIN-2015-40/indexes/

Yup, that's the correct location.  The filenames are of the form cdx-00000.gz through cdx-00299.gz and there's a master index at cluster.idx.
 
Not all records in cdx are separated by newline. 

This is caused by a bug in the index creation program which is fixed in the week 48 (November) crawl.  In previous crawl indexes, there's no newline at the end of each chunk of 3000 URLs.  Easy enough to work around in Python, but a real nuisance if you're trying to process with command line tools.
 
Have yet to confirm uniqueness. 

URLs are not unique in a crawl.  Each page may be fetched multiple times.  In the 2015-48 crawl, it looks like each page was fetched about 1.05 times on average.

Here's a one-liner which will extract all the URLs from latest crawl (although you probably want to do this in parallel and on an EC2 machine where there's more bandwidth available):

$ for i in $(seq -f "%05g" 0 299); do echo $i; aws s3 --no-sign-request cp s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2015-48/indexes/cdx-$i.gz  - | gunzip | cut -f 3-9999 -d ' ' | jq .url | gzip > url-$i.gz; done
 
It requires the very cool jq utility https://stedolan.github.io/jq/  If you've got GNU Parallel, here's a parallel version for you:

parallel -j 4 'echo {}; aws s3 --no-sign-request cp s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2015-48/indexes/cdx-{}.gz  - | gunzip | cut -f 3-9999 -d ' ' | jq .url | gzip > url-{}.gz' ::: $(seq -f "%05g" 0 299)

Adjust the job parameter (-j) to suit your system if necessary, or just use the default.  I think the CPU time in this pipe is dominated by the cut command, but you'll also want to look at the balance with available network bandwidth.

If you just want the count, the 2015-48 Common Crawl index contains 1,824,170,527 (1.8B) pages.  The full (compressed) URL list weighs in at under 15 GB.

Tom

Tom Morris

unread,
Jan 7, 2016, 10:58:51 AM1/7/16
to common...@googlegroups.com
p.s. Yes, the distribution of records is a kind of "lumpy."

On Tue, Jan 5, 2016 at 8:12 PM, Scott <scott....@gmail.com> wrote:
 cdx-00220.gz has greater than average record count.

Most (234) of the files are in the 5-7M record range, but there are three files with less than 4 million records while cdx-00182 has almost 10 million and cdx-00220 has a whopping 13.8M records.  The splits are determined using a reservoir sampling algorithm which won't give perfect accuracy, but I suspect that the distribution has drifted since the original sampling was done and the sampling hasn't been re-run. 

The uneven distribution isn't really a big deal though since cluster.idx contains an index record for each group of 3000 records in the main index.

Brainstorm - cluster.idx provides a quick proxy for a rough count : 549054*3000 = 1.6B which is reasonably close to the actual 1.8B.

Tom

Tom Morris

unread,
Jan 7, 2016, 11:39:54 AM1/7/16
to common...@googlegroups.com
Ooops, that's the wrong cluster.idx file

On Thu, Jan 7, 2016 at 10:58 AM, Tom Morris <tfmo...@gmail.com> wrote:

Brainstorm - cluster.idx provides a quick proxy for a rough count : 549054*3000 = 1.6B which is reasonably close to the actual 1.8B.

The cluster.idx file from the current crawl is in accessible (protected), but reconstructing its contents from the part-00nnn files gives a count of 608213.

608213*3000 = 1824639000 which is pretty darn close to the actual count of 1,824,170,527

Tom
Reply all
Reply to author
Forward
0 new messages