Monthly crawl size and content generation

30 views
Skip to first unread message

Pablo Villalobos

unread,
Jul 17, 2022, 12:19:10 PMJul 17
to Common Crawl
Hello CC!

it seems the monthly crawl size has been relatively stable at around 3B items over the past 4 years.

My question is, why is that the case? Is ~3B items the maximum amount you can crawl in a month with your available resources? Or could you crawl more but decide not to for some other reason? If you received a large enough donation would you increase the monthly crawl size?

The reason I'm asking this is that I'm interested in the rate of web content generation. I know the web is "infinite" but the number of published blog posts in a given year (for example) is finite and increasing. I'm trying to figure out if the number of new items in CC is a good proxy for that.

Thank you,
Pablo

Sebastian Nagel

unread,
Jul 28, 2022, 7:56:14 AM (10 days ago) Jul 28
to common...@googlegroups.com
Hi Pablo,

sorry for the delayed response...

> it seems the monthly crawl size has been relatively stable at around
> 3B items over the past 4 years.

Yes, since October 2016 the crawls are between 2.5 and 3.5 billion
pages (successfully fetched, not counting: 404s, redirects, robots.txt
captures).

> Is ~3B items the maximum amount you can crawl in a month with your
> available resources? Or could you crawl more but decide not to for
> some other reason?

This size is a compromise in many directions given that
- the main crawls are released as kind of closed collections and
- crawled in a relatively short amount of time (two weeks crawling
plus preparation and post-processing)
- to crawl more data we'd need to extend the time a crawl is running,
because crawling individual sites faster is not really an option. -
- alternatively, we'd need to switch to a continuous release of
crawl data, as done for our news collection


> If you received a large enough donation would you increase the monthly
> crawl size?

Spending more machinery for crawling would be possible, of course.
However, we'd likely first invest the resources to improve
- the crawls (take care for less duplicates, avoid poor quality pages,
duplicates, etc.)
- secondary data formats and metadata
- documentation and examples

> the number of published blog posts in a given year

> I'm trying to figure out if the number of new items in CC is
> a good proxy for that.

I'm not sure whether you'd get reliable estimates by looking at absolute
numbers in CC. Maybe comparing selected blogging domains in multiple
snapshots gives a better estimation?


Best,
Sebastian



On 7/17/22 18:19, Pablo Villalobos wrote:
> Hello CC!
>
> Looking at
> https://commoncrawl.github.io/cc-crawl-statistics/plots/crawlsize
> <https://commoncrawl.github.io/cc-crawl-statistics/plots/crawlsize>,
Reply all
Reply to author
Forward
0 new messages