https access

José González-Brenes

unread,

Oct 18, 2016, 8:47:05 PM10/18/16

to Common Crawl

Hello there,

Thank you for the fabulous project! Unfortunately, I'm getting stuck.

I mounted the commoncrawl bucket in Spark, but the server stalls when I try to open a file. So I thought of retrieving them using https access, but I only hit 404 :( For example:

https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-40

Says that the key doesn't exist. I thought I could download the data through https?

Thanks,

JPG

José González-Brenes

unread,

Oct 18, 2016, 8:48:15 PM10/18/16

to Common Crawl

PS: I also tried: https://commoncrawl.s3.amazonaws.com/CC-MAIN-2016-40 with no better luck :(

Sebastian Nagel

unread,

Oct 19, 2016, 3:42:37 AM10/19/16

to common...@googlegroups.com

Hi José,

unfortunately, the S3 web/http end points do not list objects for a given prefix or "directory".
Either use an AWS S3 library (e.g, boto for Python) or the provided list of WARC files
https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-40/warc.paths.gz
see also
http://commoncrawl.org/the-data/get-started/
http://commoncrawl.org/2016/10/september-2016-crawl-archive-now-available/

Best,
Sebastian

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward