Hi Olexiy,
yes, there are duplicate robots.txt responses for a couple of reasons:
robots.txt responses should be only cached for a certain amount of time
At present for commoncrawl, they're cached for 2 hours or the time used
for fetching one segment. For large sites there may be robots.txt responses
in many or even all 100 segments.
If there are http and https URLs for one host, both are used to get the
robots.txt. In rare cases the rules for http and https may differ.
The original robots.txt RFC draft [1] does not mention protocols and ports.
However, Google's spec says that a robots.txt location "is not valid for
other subdomains, protocols or port numbers." [2]
A smaller amount of duplicates may be also caused by redirects.
There must be some duplicates for politeness and technical reasons:
2 hours caching is quite short. From webserver logs I know that Googlebot
refetches the robots.txt after 12 hours. But the 2 hours is just by
technical constraints: because we use AWS EC2 spot instances we need
to checkpoint the data in not too long intervals. That's why the crawl
is split into 100 segments each fetched within 2 hours.
For smaller sites the distribution of URLs over segments is at present not optimal
regarding robots.txt caching: If there are 100 URLs they're distributed over
100 segments (as much distant as possible). This will be improved in the next
crawl. In October we had 180 million fetched robots.txt for 50 million hosts.
The target is to get the number of robots.txt fetches to 90 million or
1.5 - 2 times the number of hosts.
Best,
Sebastian
[1]
http://www.robotstxt.org/norobots-rfc.txt
[2]
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt#file-location--range-of-validity
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> To post to this group, send email to
common...@googlegroups.com
> <mailto:
common...@googlegroups.com>.
> Visit this group at
https://groups.google.com/group/common-crawl.
> For more options, visit
https://groups.google.com/d/optout.