June 2017 crawl archive now available

16 views
Skip to first unread message

Sebastian Nagel

unread,
Jul 4, 2017, 4:22:09 AM7/4/17
to common...@googlegroups.com
Hi all,

​the June 2017 crawl archive is now available. The crawl was run from June 22 to June 29, 2017 and
covers 3.16 billion web pages or more than 260 TiB of uncompressed content. Details ​on​ how to
access and use the data can be found on our blog [1].


To extend the crawl we used the top 40 million most "popular" hosts from the recently released
host-level webgraph data set [2] and added

- 500 million new URLs within a maximum of 3 links (“hops”) away from the 40 million home pages

- another 300 million pages selected by random sampling URLs found in sitemaps [3] provided by
these 40 million hosts

About 33% of the URLs overlap with the preceding April crawl, about 800 million URLs are not
contained in any crawl archive before.


The crawl workflow has been changed so that WARC files are written immediately after content was
fetched [4]. WAT and WET files are still generated in a post-processing step. For the future plan to
announce the start of the monthly crawls, so that "impatient" users can use the WARC data before the
final announcement.

Unfortunately, one WARC and one WET file (out of 71840) have been irrecoverably lost due to an
operational mistake. We try to avoid such errors in the future [5].


Best,
Sebastian

[1] http://commoncrawl.org/2017/07/june-2017-crawl-archive-now-available/
[2] http://commoncrawl.org/2017/05/hostgraph-2017-feb-mar-apr-crawls/
[3] http://www.sitemaps.org/
[4] https://github.com/commoncrawl/nutch/commit/3ac27446286394cd7d77afecc0b2b071173eb435
https://github.com/commoncrawl/nutch/commit/ccc558a5a081dd83dea103ec0e343d76721951a0
[5] https://github.com/commoncrawl/ia-hadoop-tools/issues/2
Reply all
Reply to author
Forward
0 new messages