Hi Hadar,
that's an interesting question. By now, the rough answer is: yes, there is large overlap
because the seed list for the last crawls hasn't been changed. Of course, there are
also smaller differences:
- documents (temporarily) disappear
- fetch lists are shuffled when crawling and cut off after some time
(differences are expected to be bigger for hosts with long fetch lists
or slowly responding hosts)
I'm not aware of any comprehensive analysis (maybe I've missed it!?)
I hope to have exact numbers available next week - overlaps (URLs and domains)
and also other metrics, at least, for the last 5 crawls. The analysis will be done
on the Common Crawl index. If you want to know the overlap only for few domains,
it's possible to get the URL lists from
http://index.commoncrawl.org/ and calculate
the overlaps.
Regards,
Sebastian