sorry for the delayed response...
> it seems the monthly crawl size has been relatively stable at around
> 3B items over the past 4 years.
Yes, since October 2016 the crawls are between 2.5 and 3.5 billion
pages (successfully fetched, not counting: 404s, redirects, robots.txt
> Is ~3B items the maximum amount you can crawl in a month with your
> available resources? Or could you crawl more but decide not to for
> some other reason?
This size is a compromise in many directions given that
- the main crawls are released as kind of closed collections and
- crawled in a relatively short amount of time (two weeks crawling
plus preparation and post-processing)
- to crawl more data we'd need to extend the time a crawl is running,
because crawling individual sites faster is not really an option. -
- alternatively, we'd need to switch to a continuous release of
crawl data, as done for our news collection
> If you received a large enough donation would you increase the monthly
> crawl size?
Spending more machinery for crawling would be possible, of course.
However, we'd likely first invest the resources to improve
- the crawls (take care for less duplicates, avoid poor quality pages,
- secondary data formats and metadata
- documentation and examples
> the number of published blog posts in a given year
> I'm trying to figure out if the number of new items in CC is
> a good proxy for that.
I'm not sure whether you'd get reliable estimates by looking at absolute
numbers in CC. Maybe comparing selected blogging domains in multiple
snapshots gives a better estimation?