common-crawl folder is empty

52 views
Skip to first unread message

Martin Thurn

unread,
Aug 18, 2015, 11:37:51 AM8/18/15
to Common Crawl
I can not see anything in the common-crawl folder.  How do I get the data?

$ aws s3 ls --summarize s3://aws-publicdatasets/
                           PRE common-crawl/
                           PRE stats/
                           PRE tcga/
                           PRE trec/

Total Objects: 0
   Total Size: 0


$ aws s3 ls --summarize s3://aws-publicdatasets/common-crawl
                           PRE common-crawl/

Total Objects: 0
   Total Size: 0

Stephen Merity

unread,
Aug 18, 2015, 5:05:01 PM8/18/15
to common...@googlegroups.com
Hi Martin,

As I don't have the aws s3 tool installed, I'm not able to replicate your results. Using my non Common Crawl credentials and s3cmd I get. By default not all the directories will be able to list the files or directories they contain. If you run the same command on one of the crawl archive directories, as listed on the Getting Started page, you should be able to see all of the segments.

Additionally released alongside each crawl archive is a list of paths for the WARC, WAT, and WET files respectively. An example of that can be seen in the July 2015 blog post.

Using my personal AWS account and s3cmd, I'm able to see:
smerity@pegasus:~$ s3cmd ls s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-32/
                       DIR   s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-32/segments/
2015-08-14 00:45       632   s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-32/segment.paths.gz
2015-08-14 00:45    104599   s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-32/warc.paths.gz
2015-08-14 00:45    104321   s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-32/wat.paths.gz
2015-08-14 00:45    104322   s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-32/wet.paths.gz
and:
smerity@pegasus:~$ s3cmd ls s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-32/segments/1438044271733.81/warc/ | head -n 1
2015-08-07 18:41 935345553   s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-32/segments/1438044271733.81/warc/CC-MAIN-20150728004431-00000-ip-10-236-191-2.ec2.internal.warc.gz

Could you try using "aws s3 ls --summarize" on the direct crawl archive path and report back if you have further issues?

Thanks!

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.



--
Regards,
Stephen Merity
Data Scientist @ Common Crawl

Tom Morris

unread,
Aug 18, 2015, 7:12:05 PM8/18/15
to common...@googlegroups.com
Try adding --recursive.  It's not the default.

Tom

Martin Thurn

unread,
Aug 19, 2015, 8:21:50 AM8/19/15
to Common Crawl
Thank you guys.  Apparently "directory browsing" is disabled at the path levels I tried.  Using the lists Stephen referenced, I am able to go deeper and see the files.  Problem solved.   


Reply all
Reply to author
Forward
0 new messages