July 2017 crawl archive now available

22 views

Skip to first unread message

Sebastian Nagel

unread,

Aug 1, 2017, 6:13:04 AM8/1/17

to common...@googlegroups.com

Hi all,

the July 2017 crawl archive is now available. The crawl was run from July 20 to July 29, 2017 and
covers 2.89 billion web pages or more than 240 TiB of uncompressed content. Details on how to
access and use the data can be found on our blog [1].

To extend the crawl we used the top 50 million most "popular" hosts from the recently released
host-level webgraph data set [2] and added

- 300 million new URLs within a maximum of 4 links ("hops") away from the 50 million home pages

- another 250 million pages selected by random sampling URLs found in sitemaps [3] "announced"
in the robots.txt of these 50 million hosts

About 18% of the crawl archive's 2.89 billion URLs overlap with the preceding June 2017 crawl. The
last two monthly archives (June and July) taken together cover over 5 billion URLs, the last three
archives (May, June, July) contain content from more than 6 billion unique URLs (cf. [4]).

Best,
Sebastian

[1] http://commoncrawl.org/2017/07/july-2017-crawl-archive-now-available/
[2] http://commoncrawl.org/2017/05/hostgraph-2017-feb-mar-apr-crawls/
[3] http://www.sitemaps.org/
[4] https://commoncrawl.github.io/cc-crawl-statistics/plots/crawlsize

Reply all

Reply to author

Forward

0 new messages