Duplicates

Olexiy Lytvynenko

unread,

Nov 29, 2016, 5:33:22 AM11/29/16

to Common Crawl

Hello,

May be this matter was already discussed. I noticed that there are a lot of duplicates in robotstxt warc files, a lot of hosts are in several files.

Just curious, what is the reason for this?

Sebastian Nagel

unread,

Nov 29, 2016, 6:16:52 AM11/29/16

to common...@googlegroups.com

Hi Olexiy,

yes, there are duplicate robots.txt responses for a couple of reasons:

robots.txt responses should be only cached for a certain amount of time
At present for commoncrawl, they're cached for 2 hours or the time used
for fetching one segment. For large sites there may be robots.txt responses
in many or even all 100 segments.

If there are http and https URLs for one host, both are used to get the
robots.txt. In rare cases the rules for http and https may differ.
The original robots.txt RFC draft [1] does not mention protocols and ports.
However, Google's spec says that a robots.txt location "is not valid for
other subdomains, protocols or port numbers." [2]

A smaller amount of duplicates may be also caused by redirects.

There must be some duplicates for politeness and technical reasons:
2 hours caching is quite short. From webserver logs I know that Googlebot
refetches the robots.txt after 12 hours. But the 2 hours is just by
technical constraints: because we use AWS EC2 spot instances we need
to checkpoint the data in not too long intervals. That's why the crawl
is split into 100 segments each fetched within 2 hours.

For smaller sites the distribution of URLs over segments is at present not optimal
regarding robots.txt caching: If there are 100 URLs they're distributed over
100 segments (as much distant as possible). This will be improved in the next
crawl. In October we had 180 million fetched robots.txt for 50 million hosts.
The target is to get the number of robots.txt fetches to 90 million or
1.5 - 2 times the number of hosts.

Best,
Sebastian

[1] http://www.robotstxt.org/norobots-rfc.txt
[2]
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt#file-location--range-of-validity

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Olexiy Lytvynenko

unread,

Nov 29, 2016, 8:39:14 AM11/29/16

to Common Crawl

Hi Sebastian,

Thank you for your comprehensive answer.

Regards,

Reply all

Reply to author

Forward