June 2016 crawl archive now available

26 views
Skip to first unread message

Sebastian Nagel

unread,
Jul 14, 2016, 8:23:17 AM7/14/16
to common...@googlegroups.com
Hi all,

the June 2016 crawl archive is now available. It contains 1.23 billion web pages.
Details how to access and use the data can be found on our blog [1].

The June crawl is based on the same URL seed list as the preceding May crawl.
The crawler configuration was modified to prefer pages not fetched for a longer
interval of time in addition to the rank/score of a page. This impacts only hosts
with a large number of pages when not all pages can be fetched within the 8-9
days our monthly crawls are run.

Pavel Smrz

unread,
Jul 14, 2016, 8:45:36 AM7/14/16
to common...@googlegroups.com
Dear Sebastian,

Could you please share the current configuration files?

Many thanks

Pavel

--
Pavel Smrz
Associate professor
Faculty of Information Technology
Brno University of Technology
Bozetechova 2, 61266 Brno
Czech Republic
Phone: +420 541 141 282
Fax: +420 541 141 290
http://www.fit.vutbr.cz/~smrz

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

Sebastian Nagel

unread,
Jul 14, 2016, 9:43:58 AM7/14/16
to common...@googlegroups.com
Dear Pavel,

unluckily the crawler configuration is in the same repository together with private scripts
and configuration files.  I'll see how to get these sorted.  I don't see a problem to make
the crawler-specific configuration (thresholds, fetch intervals, etc.) public.

Best,
Sebastian
Reply all
Reply to author
Forward
0 new messages