What does `isDeferVisits` mean?

Skip to first unread message

Laurence Gonsalves

Apr 5, 2018, 2:54:02 PM4/5/18
to crawler-commons

What is the intended use of `BaseRobotRules.isDeferVisits()`? I tried searching on github to see how open source crawlers are using this, and I couldn't find any real uses.

Looking at the code for `SimpleRobotRulesParser` I see that it's set at exactly the same times that `failedFetch()` returns an `ALLOW_NONE`, so I can infer that it should probably be checked after either `failedFetch()` or `parseContent()`, but then what does its value mean?

Ken Krugler

Apr 5, 2018, 5:37:38 PM4/5/18
to crawler...@googlegroups.com
Hi Laurence,

If the result of fetching robots.txt is a failure, then we don’t know whether or not any URLs should (or shouldn’t) be fetched.

Which means it doesn’t make sense to blindly allow or disallow URLs, as often a crawler wants to treat an explicit “disallow” as “set status to blocked, and check again in a really long time" .

So if isDeferVisits() returns true, then typically a crawler would want to:

(a) set the fetch time of URLs for that domain to the current time plus some recheck interval (though only for URLs that are being checked, not every URL, of course), and 

(b) ensure that the robots file is refetched within that recheck interval.

— Ken

You received this message because you are subscribed to the Google Groups "crawler-commons" group.
To unsubscribe from this group and stop receiving emails from it, send an email to crawler-commo...@googlegroups.com.
Visit this group at https://groups.google.com/group/crawler-commons.
For more options, visit https://groups.google.com/d/optout.


Reply all
Reply to author
0 new messages