Hi all,
the October 2017 crawl archive is now available. The crawl was run from Oct 16 to Oct 24, 2017
and covers 3.65 billion web pages or more than 300 TiB of uncompressed content. As usual, details
on how to access and use the data can be found on our blog [1].
To improve coverage and freshness we added over 900 million new URLs (not contained in any crawl
archive before):
- 350 million URLs are a random sample extracted from sitemaps [2] if provided by any of the top 80
million hosts from the May/June/July 2017 webgraph data set [3]
- 250 million URLs were found by a side crawl within a maximum of 3 links (“hops”) away from the
home pages of the top 80 million hosts
- 150 million URLs are randomly chosen from WAT files of the September crawl
- 180 million URLs are links donated by
mixnode.com
About 3% of the crawl archive's 3.65 billion URLs overlap with the preceding September 2017 crawl.
The last two monthly archives (September and October) taken together cover more than 6 billion URLs [4].
Best,
Sebastian
[1]
http://commoncrawl.org/2017/10/october-2017-crawl-archive-now-available/
[2]
http://www.sitemaps.org/
[3]
http://commoncrawl.org/2017/08/webgraph-2017-may-june-july/
[4]
https://commoncrawl.github.io/cc-crawl-statistics/plots/crawlsize