October 2017 Crawl Archive Now Available

24 views
Skip to first unread message

Sebastian Nagel

unread,
Oct 29, 2017, 3:58:31 PM10/29/17
to common...@googlegroups.com
Hi all,

​the October 2017 crawl archive is now available. The crawl was run from Oct 16 to Oct 24, 2017
and covers 3.65 billion web pages or more than 300 TiB of uncompressed content. As usual, details
​on​ how to access and use the data can be found on our blog [1].

To improve coverage and freshness we added over 900 million new URLs (not contained in any crawl
archive before):

- 350 million URLs are a random sample extracted from sitemaps [2] if provided by any of the top 80
million hosts from the May/June/July 2017 webgraph data set [3]

- 250 million URLs were found by a side crawl within a maximum of 3 links (“hops”) away from the
home pages of the top 80 million hosts

- 150 million URLs are randomly chosen from WAT files of the September crawl

- 180 million URLs are links donated by mixnode.com


About 3% of the crawl archive's 3.65 billion URLs overlap with the preceding September 2017 crawl.
The last two monthly archives (September and October) taken together cover more than 6 billion URLs [4].


Best,
Sebastian


[1] http://commoncrawl.org/2017/10/october-2017-crawl-archive-now-available/
[2] http://www.sitemaps.org/
[3] http://commoncrawl.org/2017/08/webgraph-2017-may-june-july/
[4] https://commoncrawl.github.io/cc-crawl-statistics/plots/crawlsize
Reply all
Reply to author
Forward
0 new messages