https access

28 views
Skip to first unread message

José González-Brenes

unread,
Oct 18, 2016, 8:47:05 PM10/18/16
to Common Crawl
Hello there,

Thank you for the fabulous project! Unfortunately, I'm getting stuck.

I mounted the commoncrawl bucket in Spark, but the server stalls when I try to open a file.  So I thought of retrieving them using https access, but I only hit 404 :(  For example:  

Says that the key doesn't exist.  I thought I could download the data through https?

Thanks,
JPG

José González-Brenes

unread,
Oct 18, 2016, 8:48:15 PM10/18/16
to Common Crawl
PS: I also tried: https://commoncrawl.s3.amazonaws.com/CC-MAIN-2016-40  with no better luck  :(

Sebastian Nagel

unread,
Oct 19, 2016, 3:42:37 AM10/19/16
to common...@googlegroups.com
Hi José,

unfortunately, the S3 web/http end points do not list objects for a given prefix or "directory".
Either use an AWS S3 library (e.g, boto for Python) or the provided list of WARC files
https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-40/warc.paths.gz
see also
http://commoncrawl.org/the-data/get-started/
http://commoncrawl.org/2016/10/september-2016-crawl-archive-now-available/

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages