Are index files of older crawls changing?

31 views
Skip to first unread message

brano199

unread,
Sep 3, 2017, 2:44:34 PM9/3/17
to Common Crawl
Hello, i have downloaded index files for CC-MAIN-2017-26 a while ago. 

However i can say there is one file missing and it is not uploaded to S3. 


Thank you for help.

brano199

unread,
Sep 3, 2017, 4:09:48 PM9/3/17
to Common Crawl
I have written a simple script that checks if any of other files are missing. It checks if files present in warc_paths have consecutive numbering for each datasets.

Seems like only this file
crawl-data/CC-MAIN-2017-26/segments/1498128320063.74/warc/CC-MAIN-20170623133357-20170623153357-00226.warc.gz
doesn' t have preceding file which is this one i mentioned above 

Sebastian Nagel

unread,
Sep 4, 2017, 1:14:55 PM9/4/17
to common...@googlegroups.com

Hi,

thanks, you got me. This single WARC file was deleted by accident after the index has been created, see the announcement of the June crawl on this list. Normally, all data is write-once, except in case of errors where some cleanup might be necessary.

Thanks,
Sebastian


--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages