Which percentage of the web is contained in common crawl?

348 views
Skip to first unread message

Michael Behrendt

unread,
Jan 8, 2022, 7:01:55 AM1/8/22
to Common Crawl
is there a way for expressing a rough percentage or similar number on which percentage the common crawl content is relative to the 'entire' web? I think using total text content as a metric should be fine (i.e. leaving size of images and videos out of scope) ?

Henry S. Thompson

unread,
Jan 10, 2022, 5:17:46 AM1/10/22
to common...@googlegroups.com
[with Academic Pedant hat on]

Actually, this question has no answer, because the Web is, and has
been for something like 20 years, unbounded: that is, there _is_ no
'entire' Web. A significant number of the pages you see are being
created on the fly, by servers, customised by parameters in the
requesting URI, cookie-based material, advertisements, the date and
time of day, the client IP address, ...

It's not even clear that it makes sense to ask how many
'retrieval-enabled' URIs there are, for similar reasons.

The amount of text content in all the type-200 HTTP response messages
sent during the same period as a particular Common Crawl sample is, of
necessity, finite. But there's no practical way to measure that, as
far as I can see.

I and an MSc student have done some work on trying to come up with
ways of quantifying the extent to which, in what respects, Common
Crawl can be taken to be _representative_ of the Web as a whole, which
may be what you are actually interested in. If so, let me know.

ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 650-4587, e-mail: h...@inf.ed.ac.uk
URL: http://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Michael Behrendt

unread,
Jan 10, 2022, 5:29:48 AM1/10/22
to Common Crawl
thanks a lot for the feedback, some really good points. 

I think it would be very interesting to better understand the results of your studies, would very much appreciate if you could share more details around it.

In addition -- maybe a different way of thinking about my question is -- are there big known gaps in what is being downloaded (and we're consciously accepting these gaps) -- e.g. some trivial example would be -- how much of github.com, youtube.com, amazon.com, etc. is left out, which number of all registered domains are being left out, etc.

In other words, maybe even more basic words -- is there a way for us to express the order of magnitude of the gap? E.g. is the gap, 0.01%, 1%, 99% ? I can imagine that would already go a long way (and I'd guess you would have used some number like this also for the studies you mentioned?

Greg Lindahl

unread,
Jan 10, 2022, 11:48:19 AM1/10/22
to common...@googlegroups.com
On Mon, Jan 10, 2022 at 02:29:48AM -0800, Michael Behrendt wrote:

> In addition -- maybe a different way of thinking about my question is --
> are there big known gaps in what is being downloaded (and we're consciously
> accepting these gaps) -- e.g. some trivial example would be -- how much of
> github.com, youtube.com, amazon.com, etc. is left out, which number of all
> registered domains are being left out, etc.

I'm with Henry that I've been telling people for over a decade that
the web is infinite. And at one point I was supposed to figure out a
scheme to measure how good the blekko search engine crawler's choices
were -- without any success.

The suggestions you're making are good ones, but they get fuzzy as you
go deeper. Does it matter if an unpopular YouTube video is in the
crawl? Does it matter if an out-of-stock Amazon product page is in the crawl?
Does it matter if low-value parked domains are in the crawl?

Sebastian Nagel has tackled this problem with his choices for Common
Crawl -- for example, he crawls at % of links found in sitemaps, but
only for "important" domains. This involves making some choices, and
it would be great if we had an independent measurements of these choices.

So to go back to my list: is there a way to quantify YouTube video
popularity, and measure that in the crawl? Is there a way to measure
Amazon product page quality? Is there a way to score the value of
registered domains, and see what % are crawled at all and what % of
their sitemaps are used?

This last one is already attacked by Sebastian's computation of host
and domain rank by harmonic centrality and pagerank. This is exactly
the kind of feedback that search engine crawlers use. But as in my
blekko example, it would be nice to have an independent measure of
whether this algorithm was working well or not.

-- greg


Sebastian Nagel

unread,
Jan 10, 2022, 4:44:58 PM1/10/22
to common...@googlegroups.com
Hi Michael, Henry, Greg,

thanks for the great discussion and for the question, of course!


> Does it matter if an unpopular YouTube video is in the crawl?
> Does it matter if an out-of-stock Amazon product page is in the crawl?
> Does it matter if low-value parked domains are in the crawl?

I remember a discussion with Gil Elbaz years ago whether it makes sense
at all to crawl a few 100k pages (see [1] for the numbers) from Youtube:
it's a ridiculously small fraction of the site's content and we only
archive the HTML page (ie. some metadata, little text and maybe a few
hyperlinks) but not the video which is core to Youtube. We stopped the
discussion because if you continue you'll find for many sites reasons
why not to include them. Also: there are definitely more serious issues
affecting relevancy of the content, for example (near)duplicates.


> and what % of their sitemaps are used?

Good idea! Will have a look as the number of URLs listed in the sitemaps
is known, in case there are not too many sitemaps for a site. Using
sitemaps would avoid most of the "duplicate noise" when using links/URLs
(optional query parameters, etc.)

Unfortunately, using only sitemaps for crawling is no option because
not all sites provide sitemaps and there's never a guarantee that the
sitemaps are regularly updated. Via sitemaps the crawler gets URLs from
about 12 million domains, compared to 30+ million crawled domains.


> But as in my blekko example, it would be nice to have an independent
> measure of whether this algorithm was working well or not.

For sure!

There are other publicly available site rankings and I've tried to
compare them with CC's harmonic centrality ranks in [2]. But I feel
unable to say whether our HC ranks are better or worse than the other
site rankings. It doesn't really matter because the others share only
the top million and we need the metrics for many more sites/domains to
"steer" the crawler.

> I and an MSc student have done some work ...

@Henry: I'm interested. Thanks!


Best,
Sebastian

[1] https://commoncrawl.github.io/cc-crawl-statistics/plots/domains.html
[2]
https://github.com/commoncrawl/cc-notebooks/blob/master/cc-webgraph-statistics/comparison_domain_ranks.ipynb
Reply all
Reply to author
Forward
0 new messages