Hi Martin,
As I don't have the aws s3 tool installed, I'm not able to replicate your results. Using my non Common Crawl credentials and s3cmd I get. By default not all the directories will be able to list the files or directories they contain. If you run the same command on one of the crawl archive directories, as listed on the
Getting Started page, you should be able to see all of the segments.
Additionally released alongside each crawl archive is a list of paths for the WARC, WAT, and WET files respectively. An example of that can be seen in the
July 2015 blog post.
Using my personal AWS account and s3cmd, I'm able to see:
smerity@pegasus:~$ s3cmd ls s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-32/
DIR s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-32/segments/
2015-08-14 00:45 632 s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-32/segment.paths.gz
2015-08-14 00:45 104599 s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-32/warc.paths.gz
2015-08-14 00:45 104321 s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-32/wat.paths.gz
2015-08-14 00:45 104322 s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-32/wet.paths.gz
and:
smerity@pegasus:~$ s3cmd ls s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-32/segments/1438044271733.81/warc/ | head -n 1
2015-08-07 18:41 935345553 s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-32/segments/1438044271733.81/warc/CC-MAIN-20150728004431-00000-ip-10-236-191-2.ec2.internal.warc.gz
Could you try using "aws s3 ls --summarize" on the direct crawl archive path and report back if you have further issues?
Thanks!