The spider I am building needs to crawl sites but only follow urls to external sites.
The rule I am using is as follows:
Rule (SgmlLinkExtractor(allow=("((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)", ),), callback="parse_items", follow= True),
and the regex does work to filter out and return a list of only those urls that begin with mailto, news, http, etc. But, I need to be able to remove any fully qualified links with the same domain as the request url. For example, if the current request url is
www.dmoz.org, the spider must
not follow any links that have the
domain
dmoz.org in them.
I would like to use the
process_links which is defined as a rule parameter for filtering links, but I have been unable to find an example of such a method in action. Primarily, what is a method signature that works here and what parameters do I need to pass in.
What would a method look like to which I can assign the process_links parameter in the rule? I already have the code to filter the unwanted links, I just need to get it into the rule.
Much thanks.
** full disclosure: I am learning python and simultaneously trying to unlearn years of static language programming