How do I follow links in rss feeds

John

unread,

Nov 9, 2012, 5:26:11 AM11/9/12

to scrapy...@googlegroups.com

I'm not sure what the best way to do this is. Scrapy comes with a feed crawler, but according on the docs it is regex based. I would prefer a real parser, such as the feedparser python module.

Specifically what I need to do is: parse rss feeds, and, depending on the content field, follow some links, not all of them.

How would I do this. I could to all this manually but of course, I would like to take advantage of scrapy's scheduling and deploying built in features.

Which spider should I start with? which methods should I implement?

Steven Almeroth

unread,

Nov 10, 2012, 2:26:31 PM11/10/12

to scrapy...@googlegroups.com

The XMLFeedSpider attribute iterator can be set to use either a regex engine or a markup parser:

iterator = 'iternodes' # regex: fast, doesn't load DOM into memeory at once
iterator = 'html' # markup parser, but needs to load the entire DOM into memory
iterator = 'xml' # markup parser, but needs to load the entire DOM into memory

Try and subclass XMLFeedSpider and override methods process_results and parse_node.

John

unread,

Nov 12, 2012, 8:33:32 AM11/12/12

to scrapy...@googlegroups.com

Ok, I maybe didn't use the most accurate phrasing.

The XMLFeedSpider has three possibilities for parsing, but I would prefer to use this module instead:

http://packages.python.org/feedparser/

It is a more full featured parser.

It appears to me that I cannot extend the xmlfeedspider as it only offers the possibility to choose from three predifined parsers.

I maybe will need to extend the basespider instead.

I looked into the Crawler spider implementation, but it pulls in functions from quite a multitude of places and it also feels like a too evolved example for a beginner.

Let's say I start with the base spider. What's confusing me is that I don't understand how link following is achieved. Is it by yelding request objects from the parse() method? Which spider will then handle them?

What I take from the documentation is that my parse() method should return a request object if I want to continue spidering, an item object if I want to save something in some way and nothing if do not want to do anything anymore. Is this correct?

Steven Almeroth

unread,

Nov 12, 2012, 4:14:41 PM11/12/12

to scrapy...@googlegroups.com

On Monday, November 12, 2012 7:33:32 AM UTC-6, John wrote:

It appears to me that I cannot extend the xmlfeedspider as it only offers the possibility to choose from three predifined parsers.

Of course you can extend the feed parser:

class MyFeedSpider(XMLFeedSpider):

I maybe will need to extend the basespider instead. I looked into the Crawler spider implementation, but it pulls in functions from quite a multitude of places and it also feels like a too evolved example for a beginner.

It's not a bad idea to start with the basics.

Let's say I start with the base spider. What's confusing me is that I don't understand how link following is achieved. Is it by yelding request objects from the parse() method?

Yes, that's right, Request`s are generated automatically for urls in start_urls and parse() is the response callback.

Which spider will then handle them?

Well, we are talking about only one spider here, the Request`s are made from... and the callback functions are all defined in... this one spider.

What I take from the documentation is that my parse() method should return a request object if I want to continue spidering, an item object if I want to save something in some way and nothing if do not want to do anything anymore. Is this correct?

This is exactly correct.

John

unread,

Nov 14, 2012, 11:42:15 AM11/14/12

to scrapy...@googlegroups.com

Thank you Steven. This was not obvious to me at first.

Reply all

Reply to author

Forward