July 2016 crawl archive now available

36 views
Skip to first unread message

Sebastian Nagel

unread,
Aug 9, 2016, 1:15:13 PM8/9/16
to common...@googlegroups.com
Hi all,

the July 2016 crawl archive is now available. It contains 1.73 billion
web pages.  Details how to access and use the data can be found on our
blog [1].

The July crawl is based on the same URL seed list as the preceding
June crawl. There is one important change to the crawler configuration:
the crawler now follows redirects without delay up to three hubs. This
was the behavior in prior crawls up and including February 2016. It
has been changed for the last 3 crawls to avoid URL-level duplicates.

Redirects are typically used for canonicalization. The crawler has to
check whether a redirect target is already fetched in order to avoid
duplicates. The same is true if web servers send a redirect (eg, to the
homepage) instead of a 404 "not found" error message. On the other side
web masters often use redirects in order to set session id's or cookies
or to track which outgoing links are followed. The crawler has to
follow those redirects within a short delay to successfully fetch
the page content of the redirect target. If it does not we miss a
significant amount of content.

To keep the amount of URL-level duplicates in an acceptable scope
redirects are deduplicated: if two redirects point to the same target,
one of them is discarded. This is done in advance based on the data
from the previous month. The distributed architecture of the crawler
does not allow a synchronous on-line deduplication. The crawler also
cannot differentiate between "must follow now" and "follow later and
dedup" redirects.

For the July crawl the crawl archives contain 6% URL-level redirects
while in February there have been 10%. We are trying to further reduce
this number.

Sebastian

[1] http://commoncrawl.org/2016/08/july-2016-crawl-archive-now-available/
Reply all
Reply to author
Forward
0 new messages