"URLs ... not visited in any .. prior crawls"

16 views
Skip to first unread message

Henry S. Thompson

unread,
Jul 27, 2022, 8:53:56 AM (11 days ago) Jul 27
to common...@googlegroups.com
These words have appeared in release notices since, I think, September
2020. Thinking about this, and looking again at the crawl statistics
page [1], I wondered about why the implication from these notices that
around 60% of the URLs _have_ been crawled before doesn't contradict
the "URL overlap between crawls" plot is because you're reporting on
overlap with the _union_ of all previously crawled URLs, right?

Have you done a similar union-based calculation on the content
digests?

And, just wondering about a curious artefact revealed by the URL
similarity plot: why is the peak similarity for almost every crawl N
with crawl N-2?

Thanks,

ht

[1] https://commoncrawl.github.io/cc-crawl-statistics/plots/crawloverlap
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 650-4587, e-mail: h...@inf.ed.ac.uk
URL: http://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Sebastian Nagel

unread,
Jul 28, 2022, 8:45:36 AM (10 days ago) Jul 28
to common...@googlegroups.com
Hi Henry,

> I wondered about why the implication from these notices that
> around 60% of the URLs _have_ been crawled before doesn't contradict
> the "URL overlap between crawls" plot is because you're reporting on
> overlap with the _union_ of all previously crawled URLs, right?

Yes, it means URLs which never before were captured by Common Crawl.
Of course, it's no guarantee that a duplicate of the content behind

> Have you done a similar union-based calculation on the content
> digests?

Yes, see the points/lines of "digest estim." in plots "Crawl Size
Cumulative" and "New Items per Crawl" on
https://commoncrawl.github.io/cc-crawl-statistics/plots/crawlsize
But for the sha1 digests on the binary payload (WARC-Payload-Digest)
there is little overlap with previous crawls.

Also note that the overlaps and unique numbers over multiple crawls are
calculated via HyperLogLog cardinality estimates with 1% error rate.
Only the number of unique URLs for a single crawl is a real count
not an estimate.

> And, just wondering about a curious artefact revealed by the URL
> similarity plot: why is the peak similarity for almost every crawl N
> with crawl N-2?

Yes, I know about this. The decision when a page is refetched is done
the following way:
- start with the page score (derived from hyperlink centrality)
- for every day elapsed after the page was visited the last time
increment the score
- when a crawl is started and the score reaches a certain threshold
the URL is scheduled for refetch
The page scores have an exponential distribution, and most scores
are close to zero. Many the low-score pages reach the threshold at
the same time. And this happens in most cases after the following crawl.
I remember the discussion about the refetch scheduling and we decided
not to introduce any random factor here.

Best,
Sebastian
Reply all
Reply to author
Forward
0 new messages