Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

CC as % of the web?

104 views
Skip to first unread message

Tim Allison

unread,
Mar 13, 2025, 1:15:00 PMMar 13
to Common Crawl
I realize this is somewhat of a crazy question. My answer so far has been ¯\_(ツ)_/¯.

Is there any way to estimate the coverage of CC over the "open web"?

That the crawl added 1 billion pages in the Feb crawl (with a total of 2.6 billion) suggests the % captured is smallish (???).

Thank you.

Best,

         Tim

Henry S. Thompson

unread,
Mar 13, 2025, 2:34:51 PMMar 13
to common...@googlegroups.com
Tim Allison writes:

> I realize this is somewhat of a crazy question. My answer so far has been ¯\_(ツ)_/¯.
>
> Is there any way to estimate the coverage of CC over the "open web"?

0%, to quite a few decimal places :-).

Because the open web is, these days, unbounded is size. It's mostly
generated just-in-time in response to, and parameterised by,
properties of HTTP requests (cookies, country of origin, request
parameters). You could in principle ask how many HTTP responses there
are on the wires within any particular interval, indeed even the
interval that a crawl was being conducted, but I'm not aware of any
attempt to even estimate such a number.

And even that would not be what you really want, because by design CC
is only looking at what we might call the 'headline' or 'landing page'
responses to the requests, which these days are mostly very different
from what you might think of as the resulting web page, which is
likely to be built by dozens if not hundereds of json-scripted further
requests.

So, apologies, but what seems like a simple question turns out to need
a lot more detail before you can even begin to get a useful answer.

ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND
e-mail: h...@inf.ed.ac.uk
URL: https://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]

Rich Skrenta

unread,
Mar 13, 2025, 2:50:29 PMMar 13
to common...@googlegroups.com
Great answer. I'm sure Sebastian could give a very good response. I'll give my high level "exec" answer to the question.

The web is fractally infinite. CCbot tries to sort the frontier of discovered URLs and crawl a sample of them each month. This is trying to skim a sample of some useful cream off of the top of the frontier.

It is by no means comprehensive. We would like to crawl more - deeper, and with more coverage of under-represented languages and segments - but we are very concious of wanting to maintain dataset quality. Naive attempts to crawl aggressively often easily fall into endless pools of unhelpful content

Rich

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/common-crawl/f5becz0ok3l.fsf%40lochinver.inf.ed.ac.uk.

Tim Allison

unread,
Mar 13, 2025, 3:07:01 PMMar 13
to Common Crawl
Hahahaha. Right. You'll notice that I described the question as "crazy". I was interested in exactly this kind of setting out of the dimensions along which it was crazy. Thank you.

Tim Allison

unread,
Mar 13, 2025, 3:10:16 PMMar 13
to Common Crawl
>Naive attempts to crawl aggressively often easily fall into endless pools of unhelpful content

Yes, yes indeed. Thank you Rich!
Reply all
Reply to author
Forward
0 new messages