Mistake in Common Crawl Index announcement blog post

63 views
Skip to first unread message

Tom Morris

unread,
Aug 25, 2015, 11:22:46 AM8/25/15
to common...@googlegroups.com
There's a small but critical mistake in the blog post announcing the Common Crawl Index.  The S3 location given for the index data files is missing the first segment (ie /common-crawl) in the path.

The correct locations for the top level index is:

s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-YYYY-WW/indexes/cluster.idx

and the individual index files are in 300 chunks (currently) at:

s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2015-14/indexes/cdx-00000.gz
...
s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2015-14/indexes/cdx-00299.gz

Because their part of the aws-public data sets, you don't need to pay to fetch them, so you can use the --no-sign-request switch on your copy commands 

$ aws --no-sign-request s3 cp s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2015-14/indexes/cluster.idx .

Tom

Stephen Merity

unread,
Aug 25, 2015, 5:56:17 PM8/25/15
to common...@googlegroups.com
Hi Tom,

Thanks for pointing out the error in the CC Index path! I've fixed the relevant path on the blog post.

If there is any other information you think that would be useful to include, I'd be interested to hear. The Common Crawl Index created by Ilya Kreymer is not only a hugely useful tool in its own right, but also a really interesting way to perform analysis over the crawl archives without having to download or process large amounts of data. Encouraging even more use and exploration from it is of great interest :)

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.



--
Regards,
Stephen Merity
Data Scientist @ Common Crawl

Tom Morris

unread,
Aug 26, 2015, 1:27:22 PM8/26/15
to common...@googlegroups.com
Thanks for the quick fix!

On Tue, Aug 25, 2015 at 5:55 PM, Stephen Merity <ste...@commoncrawl.org> wrote:
If there is any other information you think that would be useful to include, I'd be interested to hear. The Common Crawl Index created by Ilya Kreymer is not only a hugely useful tool in its own right, but also a really interesting way to perform analysis over the crawl archives without having to download or process large amounts of data. Encouraging even more use and exploration from it is of great interest :)

I agree it's a great resource.  Some more concrete documentation, perhaps with a graphical treatment, might help people understand what's available and how to access it. e.g. 

cluster.idx --> cdx-00100.gz --> .../segments/<segmentID>/warc/xxx.warc.gz

==cluster.idx==
com,idakoos)/tshirt+ringer/love-ann-arbor-barcode,1033180 20150330012118 cdx-00100.gz 0 205937 181894
com,idakoos)/tshirt+ringer/uneven-bars 20150327090812 cdx-00100.gz 205937 197382 181895

==cdx-00100.gz==
com,idakoos)/tshirt+ringer/love-ann-arbor-barcode,1033180 20150330012118 {"url": "http://www.idakoos.com/tshirt+ringer/love-ann-arbor-barcode,1033180", "mime": "text/html", "status": "200", "digest": "IVLXZS767CUTQ3QCNS6IZGYBVMDC3UKJ", "length": "15588", "offset": "561878738", "filename": "common-crawl/crawl-data/CC-MAIN-2015-14/segments/1427131298871.15/warc/CC-MAIN-20150323172138-00093-ip-10-168-14-71.ec2.internal.warc.gz"}
com,idakoos)/tshirt+ringer/love-annaba,976331 20150401225526 {"url": "http://www.idakoos.com/tshirt+ringer/love-annaba,976331", "mime": "text/html", "status": "200", "digest": "7YBMEXSTSTXFOZLQO2VXNFQU7VQQ4EWY", "length": "15495", "offset": "574204490", "filename": "common-crawl/crawl-data/CC-MAIN-2015-14/segments/1427131309963.95/warc/CC-MAIN-20150323172149-00186-ip-10-168-14-71.ec2.internal.warc.gz"}

The CC-MAIN-2015-14 index totals 104 GB split across 300 gzipped files (cdx-*.gz).  The top level index, cluster.idx, is 67 MB uncompressed, containing the starting URLs for 549K compressed chunks stored in the 300 gzipped files containing 1.65B URLs.


Or something along those lines...

The information is all there now, but presenting it in a less abstract form might help people who need/prefer concrete examples.

One derivative asset which might be useful is a list of domain names, perhaps including a URL count.  This would help people see coverage at a glance (and might also be useful for spotting anomalies).

Tom
Reply all
Reply to author
Forward
0 new messages