You do not have permission to delete messages in this group
Copy link
Report message
Sign in to report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to common...@googlegroups.com
Hi all,
the June 2017 crawl archive is now available. The crawl was run from June 22 to June 29, 2017 and
covers 3.16 billion web pages or more than 260 TiB of uncompressed content. Details on how to
access and use the data can be found on our blog [1].
To extend the crawl we used the top 40 million most "popular" hosts from the recently released
host-level webgraph data set [2] and added
- 500 million new URLs within a maximum of 3 links (“hops”) away from the 40 million home pages
- another 300 million pages selected by random sampling URLs found in sitemaps [3] provided by
these 40 million hosts
About 33% of the URLs overlap with the preceding April crawl, about 800 million URLs are not
contained in any crawl archive before.
The crawl workflow has been changed so that WARC files are written immediately after content was
fetched [4]. WAT and WET files are still generated in a post-processing step. For the future plan to
announce the start of the monthly crawls, so that "impatient" users can use the WARC data before the
final announcement.
Unfortunately, one WARC and one WET file (out of 71840) have been irrecoverably lost due to an
operational mistake. We try to avoid such errors in the future [5].