Hi,
I was trying to get a list of all urls in the particular crawl from it's CDX files.
AFAIK, as noted elsethread, CDX files used for creation of the URL index are available on S3.
Indeed one can see s3://aws-publicdatasets/common-crawl/cc-index/cdx/CC-MAIN-2015-40/segments/*/warc/*.cdx.gz but download attempt results in "Forbidden"
$aws --no-sign-request s3 cp s3://aws-publicdatasets/common-crawl/cc-index/cdx/CC-MAIN-2015-40/segments/1443736678409.42/warc/CC-MAIN-20151001215758-00253-ip-10-137-6-227.ec2.internal.cdx.gz .
A client error (403) occurred when calling the HeadObject operation: Forbidden
Could you please help me to understand:
- whether it is indeed the simplest way of getting all the urls in the crawl?
- if so, are these access problems caused by some S3 permissions issue that can be fixed, or is it an expected result?
Thank in advance!
--
Alex