You do not have permission to delete messages in this group
Copy link
Report message
Sign in to report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Common Crawl
Can anyone suggest if there is a way to automatically detect if there is a new crawl archive available? My only thoughts at the moment are to parse any new blog or group entry for the title "Month Year Crawl Archive Now Available". Or checking the bucket list at https://commoncrawl.s3.amazonaws.com/ for new results. Any other ideas gratefully received.
Thanks,
Cat
Sebastian Nagel
unread,
Jun 11, 2018, 9:47:06 AM6/11/18
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Sign in to report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
Checking the subfolders of crawl-data/ on S3
aws s3 ls s3://commoncrawl/crawl-data/
although quite efficient, isn't ideal because WARC files are uploaded instantly
after content has been crawled. It may take up to 14 days until a monthly crawl
including WAT/WET files is completed.