Notification of new crawl archives

54 views
Skip to first unread message

catheri...@hotmail.com

unread,
Jun 11, 2018, 8:26:11 AM6/11/18
to Common Crawl
Can anyone suggest if there is a way to automatically detect if there is a new crawl archive available? My only thoughts at the moment are to parse any new blog or group entry for the title "Month Year Crawl Archive Now Available". Or checking the bucket list at https://commoncrawl.s3.amazonaws.com/ for new results. Any other ideas gratefully received. 

Thanks,
Cat

Sebastian Nagel

unread,
Jun 11, 2018, 9:47:06 AM6/11/18
to common...@googlegroups.com
Hi Cat,

the URL index server lists all available collections.
There is also a hook to get a JSON list of collections:
https://index.commoncrawl.org/collinfo.json
see also
https://groups.google.com/forum/#!msg/common-crawl/o_MuZViu0O0/HvUV82avAQAJ
https://groups.google.com/forum/#!topic/common-crawl/3QmQjFA_3y4

Please note that file lists (warc.paths.gz) WAT and WET files may not yet be ready
when the URL index is updated. If you need these you may also check
https://commoncrawl.s3.amazonaws.com/crawl-data/index.html

Checking the subfolders of crawl-data/ on S3
aws s3 ls s3://commoncrawl/crawl-data/
although quite efficient, isn't ideal because WARC files are uploaded instantly
after content has been crawled. It may take up to 14 days until a monthly crawl
including WAT/WET files is completed.

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

catheri...@hotmail.com

unread,
Jun 11, 2018, 10:17:13 AM6/11/18
to Common Crawl
Many Thanks Sebastian. 
Reply all
Reply to author
Forward
0 new messages