Example of process_links in Rule

1,558 views

Skip to first unread message

Michael Pastore

unread,

Jan 7, 2014, 6:45:03 PM1/7/14

to scrapy...@googlegroups.com

The spider I am building needs to crawl sites but only follow urls to external sites.

The rule I am using is as follows:

        Rule (SgmlLinkExtractor(allow=("((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)", ),), callback="parse_items", follow= True),

and the regex does work to filter out and return a list of only those urls that begin with mailto, news, http, etc. But, I need to be able to remove any fully qualified links with the same domain as the request url. For example, if the current request url is www.dmoz.org, the spider must not follow any links that have the domain dmoz.org in them.

I would like to use the process_links which is defined as a rule parameter for filtering links, but I have been unable to find an example of such a method in action. Primarily, what is a method signature that works here and what parameters do I need to pass in.

What would a method look like to which I can assign the process_links parameter in the rule? I already have the code to filter the unwanted links, I just need to get it into the rule.

Much thanks.

** full disclosure: I am learning python and simultaneously trying to unlearn years of static language programming

Paul Tremberth

unread,

Jan 8, 2014, 4:54:16 AM1/8/14

to scrapy...@googlegroups.com

Hi Michael,

"process_links" takes a list of scrapy.link.Link objects and is expected to return a list of scrapy.link.Link objects

(see scrapy.link.Link class definition at https://github.com/scrapy/scrapy/blob/master/scrapy/link.py#L8

and how CrawlSpider uses process_links https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spiders/crawl.py#L53)

You can look for inspiration in SgmlLinkExtractor around the _link_allowed method that works on one link at a time

https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/linkextractors/sgml.py#L137

and how it uses "deny_domain"

you can define "process_links" as an independent method or a spider method (which is probably what you want, to hold the domain to filter out).

Try something like:

from urlparse import urlparse
from scrapy.utils.url import url_is_from_any_domain

class MySpider(CrawlSpider):
    ...
    filtered_domains = ['dmoz.org']
    rules = (
        Rule(
            SgmlLinkExtractor(

allow=("((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)", ),),

                process_links='link_filtering',
                callback="parse_items",
                follow= True,
            ),
    )
        
    def link_filtering(self, links):
        ret = []
        for link in links:
            parsed_url = urlparse(link.url)
            if not url_is_from_any_domain(parsed_url, self.filtered_domains)
                ret.append(link)
        return ret
    ...

Hope this helps.

/Paul.

Reply all

Reply to author

Forward

0 new messages