Nested Rules for crawlspider

274 views

Skip to first unread message

scrapysnax

unread,

Aug 30, 2012, 1:02:33 PM8/30/12

to scrapy...@googlegroups.com

I need to use different rules on each level of crawl for a site I'm spidering. For instance Rule 1 allows all .asp links except for things like sitemap and about 12 other links I've defined. Rule 2 needs to only apply on the 2nd level links as there's a ton of cruft here i don't want, Rule 3 needs to only apply to that 3rd level as the links on this page are where the data I want to extract lies.

I can't use this 3rd level as my starting point because theres about 2.5 million paths from the root site I need to follow to extract data. 75 link on first level, up to 50 on 2nd level and then the 3rd level may have up to 700 or so links I need to follow and then those 700 for each of the 50 for each of the 75 is what I need to actually parse. Yes that means I'm pulling data from over 2 million pages. How do I tell the spider to apply different sets of rules at each link depth?

Thanks!

Pablo Hoffman

unread,

Sep 1, 2012, 2:31:09 PM9/1/12

to scrapy...@googlegroups.com

You can't do that with CrawlSpider, in case you are wondering. CrawlSpider is specifically designed to deal with shallow/flatten rules, so that each page on the site can be treated the same way in terms in links to follow.

What you can do is override BaseSpider (instead of CrawlSpider) and implement your custom logic there. If you think this will be a common pattern among the spiders of your project the next step would be to implement a new base/generic spider that abstracts this logic and provides a way to declare multi-level rules. If you reached this point (wrote the generic spider, and are using it successfully with a few spiders in your project) and you believe this spider could be useful in general (outside your project), you can consider documenting it and contributing to scrapy.contrib, so that it becomes a new builtin generic spider (like CrawlSpider is). This is just to illustrate the lifetime of a generic spider. As you can see, many spiders don't get enough motivation to make it to scrapy.contrib, but who knows... maybe the MultiLevelCrawlSpider could be the next one :)

Best,

Pablo.

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/GcihGpByI38J.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

Reply all

Reply to author

Forward

0 new messages