PRE WARC Indexes (2008-2012)

65 views
Skip to first unread message

Sergey Ivanov

unread,
Apr 8, 2021, 3:13:17 AM4/8/21
to common...@googlegroups.com
Hi All,

I've tried to download indexes for pre WARC date (2008-2012), but amazon reports error 404 on requested paths


What is correct path to theese indexes? Thank you

Sebastian Nagel

unread,
Apr 8, 2021, 3:37:20 AM4/8/21
to common...@googlegroups.com
Hi Sergey,

good point. Sorry, these listings are not yet available. To get a listing for all CDX index files,
download the AWS CLI (https://aws.amazon.com/cli/) and run (here for the 2012 index):

aws --no-sign-request s3 ls --recursive s3://commoncrawl/cc-index/collections/CC-MAIN-2012/

The problem is that the old crawls (2008 - 2012) have a different location (path prefix) on the
bucket s3://commoncrawl/:
- crawl-001/ : 2008 - 2009
- crawl-002/ : 2009 - 1010
- parse-output/ : 2012

I'll prepare the missing listings during the next days.

Thanks,
Sebastian


On 4/8/21 9:13 AM, Sergey Ivanov wrote:
> Hi All,
>
> I've tried to download indexes for pre WARC date (2008-2012), but amazon reports error 404 on requested paths
>
> https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2012/cc-index.paths.gz
> <https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2012/cc-index.paths.gz>
> https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2009-2010/cc-index.paths.gz
> <https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2009-2010/cc-index.paths.gz>
> https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2008-2009/cc-index.paths.gz
> <https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2008-2009/cc-index.paths.gz>
>
> What is correct path to theese indexes? Thank you
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/CACYvjaUvTBz%3DwcTLY8NiO2UZWFu9Gim%3DO-kf1EKJpTnVAdJzWw%40mail.gmail.com
> <https://groups.google.com/d/msgid/common-crawl/CACYvjaUvTBz%3DwcTLY8NiO2UZWFu9Gim%3DO-kf1EKJpTnVAdJzWw%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Sergey Ivanov

unread,
Apr 8, 2021, 3:57:48 AM4/8/21
to Common Crawl
Thank you for reply, 

I'm not in a hurry, so I'll wait until indexes will be ready

четверг, 8 апреля 2021 г. в 13:37:20 UTC+6, Sebastian Nagel:

Sebastian Nagel

unread,
Apr 19, 2021, 6:20:10 AM4/19/21
to common...@googlegroups.com
Hi Sergey,

done. You'll find the path listings for the cdx-*.gz files here:

- s3://commoncrawl/crawl-data/CC-MAIN-2012/cc-index.paths.gz
(resp. https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2012/cc-index.paths.gz)
- s3://commoncrawl/crawl-data/CC-MAIN-2009-2010/cc-index.paths.gz
- s3://commoncrawl/crawl-data/CC-MAIN-2008-2009/cc-index.paths.gz

Best,
Sebastian


On 4/8/21 9:57 AM, Sergey Ivanov wrote:
> Thank you for reply,
>
> I'm not in a hurry, so I'll wait until indexes will be ready
>
> четверг, 8 апреля 2021 г. в 13:37:20 UTC+6, Sebastian Nagel:
>
> Hi Sergey,
>
> good point. Sorry, these listings are not yet available. To get a listing for all CDX index files,
> download the AWS CLI (https://aws.amazon.com/cli/ <https://aws.amazon.com/cli/>) and run (here for the 2012 index):
> <https://groups.google.com/d/msgid/common-crawl/CACYvjaUvTBz%3DwcTLY8NiO2UZWFu9Gim%3DO-kf1EKJpTnVAdJzWw%40mail.gmail.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/common-crawl/CACYvjaUvTBz%3DwcTLY8NiO2UZWFu9Gim%3DO-kf1EKJpTnVAdJzWw%40mail.gmail.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/ff1c7936-8fec-46fc-b288-114c0be1fd05n%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/ff1c7936-8fec-46fc-b288-114c0be1fd05n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Reply all
Reply to author
Forward
0 new messages