Hi,
there is a huge database of URLs and the page fetch status, fetch time, score, content signature. It
contains about 15 billion URLs right now. The fetch lists of the monthly crawls is sampled from this
15 billion URLs:
- select URLs for which
threshold >= ((score * time_elapsed_since_last_fetch) + status_penalty)
- penalties are applied to 404s, robots.txt exclusions, duplicates, not modifies pages, etc.
- there is a good chance that we get the same status again
Last month we've added one billion new URLs to the crawl database using three different approaches.
One of them is a breadth-first crawl, see
http://commoncrawl.org/2017/09/september-2017-crawl-archive-now-available/
> news crawling
is different. The news crawler uses RSS and Atom feeds and news sitemaps to find links to articles,
see
https://groups.google.com/d/topic/common-crawl/eQC0nLVqmQs/discussion
Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> To post to this group, send email to
common...@googlegroups.com
> <mailto:
common...@googlegroups.com>.
> Visit this group at
https://groups.google.com/group/common-crawl.
> For more options, visit
https://groups.google.com/d/optout.