> I wondered about why the implication from these notices that
> around 60% of the URLs _have_ been crawled before doesn't contradict
> the "URL overlap between crawls" plot is because you're reporting on
> overlap with the _union_ of all previously crawled URLs, right?
Yes, it means URLs which never before were captured by Common Crawl.
Of course, it's no guarantee that a duplicate of the content behind
> Have you done a similar union-based calculation on the content
Yes, see the points/lines of "digest estim." in plots "Crawl Size
Cumulative" and "New Items per Crawl" on
But for the sha1 digests on the binary payload (WARC-Payload-Digest)
there is little overlap with previous crawls.
Also note that the overlaps and unique numbers over multiple crawls are
calculated via HyperLogLog cardinality estimates with 1% error rate.
Only the number of unique URLs for a single crawl is a real count
not an estimate.
> And, just wondering about a curious artefact revealed by the URL
> similarity plot: why is the peak similarity for almost every crawl N
> with crawl N-2?
Yes, I know about this. The decision when a page is refetched is done
the following way:
- start with the page score (derived from hyperlink centrality)
- for every day elapsed after the page was visited the last time
increment the score
- when a crawl is started and the score reaches a certain threshold
the URL is scheduled for refetch
The page scores have an exponential distribution, and most scores
are close to zero. Many the low-score pages reach the threshold at
the same time. And this happens in most cases after the following crawl.
I remember the discussion about the refetch scheduling and we decided
not to introduce any random factor here.