No. There is some overlap between the crawls but a previous crawl is never a subset of a newer one:
- pages disappear
- get excluded by robots.txt
- are detected as duplicates and (re)fetched less often
- low ranking pages are scheduled for refetch not every month
You'll find the numbers here:
https://commoncrawl.github.io/cc-crawl-statistics/
(overlap, size, amount of new URLs/pages, etc.)
Best,
Sebastian
On 07/21/2017 12:39 PM, brano199 wrote:
> Hello,
>
> i have a question, if lets say CC-MAIN-2017-26 contains all data from all previous crawls(i mean all
> urls with refreshed timestamp that were there before) + new urls contained in this crawl.
>
> Thank you for the answer.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to