[with Academic Pedant hat on]
Actually, this question has no answer, because the Web is, and has
been for something like 20 years, unbounded: that is, there _is_ no
'entire' Web. A significant number of the pages you see are being
created on the fly, by servers, customised by parameters in the
requesting URI, cookie-based material, advertisements, the date and
time of day, the client IP address, ...
It's not even clear that it makes sense to ask how many
'retrieval-enabled' URIs there are, for similar reasons.
The amount of text content in all the type-200 HTTP response messages
sent during the same period as a particular Common Crawl sample is, of
necessity, finite. But there's no practical way to measure that, as
far as I can see.
I and an MSc student have done some work on trying to come up with
ways of quantifying the extent to which, in what respects, Common
Crawl can be taken to be _representative_ of the Web as a whole, which
may be what you are actually interested in. If so, let me know.
ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 650-4587, e-mail:
h...@inf.ed.ac.uk
URL:
http://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.