How do I follow links in rss feeds

224 views
Skip to first unread message

John

unread,
Nov 9, 2012, 5:26:11 AM11/9/12
to scrapy...@googlegroups.com
I'm not sure what the best way to do this is. Scrapy comes with a feed crawler, but according on the docs it is regex based. I would prefer a real parser, such as the feedparser python module.

Specifically what I need to do is: parse rss feeds, and, depending on the content field, follow some links, not all of them.

How would I do this. I could to all this manually but of course, I would like to take advantage of scrapy's scheduling and deploying built in features.

Which spider should I start with? which methods should I implement?

Steven Almeroth

unread,
Nov 10, 2012, 2:26:31 PM11/10/12
to scrapy...@googlegroups.com
The XMLFeedSpider attribute iterator can be set to use either a regex engine or a markup parser:
  • iterator = 'iternodes'  # regex: fast, doesn't load DOM into memeory at once
  • iterator = 'html'  # markup parser, but needs to load the entire DOM into memory
  • iterator = 'xml'  # markup parser, but needs to load the entire DOM into memory
Try and subclass XMLFeedSpider and override methods process_results and parse_node.

John

unread,
Nov 12, 2012, 8:33:32 AM11/12/12
to scrapy...@googlegroups.com
Ok, I maybe didn't use the most accurate phrasing.

The XMLFeedSpider has three possibilities for parsing, but I would prefer to use this module instead:
It is a more full featured parser.

It appears to me that I cannot extend the xmlfeedspider as it only offers the possibility to choose from three predifined parsers.
I maybe will need to extend the basespider instead.

I looked into the Crawler spider implementation, but it pulls in functions from quite a multitude of places and it also feels like a too evolved example for a beginner. 

Let's say I start with the base spider. What's confusing me is that I don't understand how link following is achieved. Is it by yelding request objects from the parse() method? Which spider will then handle them?

What I take from the documentation is that my parse() method should return a request object if I want to continue spidering, an item object if I want to save something in some way and nothing if do not want to do anything anymore. Is this correct?

Steven Almeroth

unread,
Nov 12, 2012, 4:14:41 PM11/12/12
to scrapy...@googlegroups.com
On Monday, November 12, 2012 7:33:32 AM UTC-6, John wrote:

It appears to me that I cannot extend the xmlfeedspider as it only offers the possibility to choose from three predifined parsers.

Of course you can extend the feed parser:

class MyFeedSpider(XMLFeedSpider):
 
I maybe will need to extend the basespider instead.  I looked into the Crawler spider implementation, but it pulls in functions from quite a multitude of places and it also feels like a too evolved example for a beginner. 

It's  not a bad idea to start with the basics. 

Let's say I start with the base spider. What's confusing me is that I don't understand how link following is achieved. Is it by yelding request objects from the parse() method?

Yes, that's right, Request`s are generated automatically for urls in start_urls and parse() is the response callback.
 
Which spider will then handle them?

Well, we are talking about only one spider here, the Request`s are made from... and the callback functions are all defined in... this one spider. 
 
What I take from the documentation is that my parse() method should return a request object if I want to continue spidering, an item object if I want to save something in some way and nothing if do not want to do anything anymore. Is this correct?

This is exactly correct.

John

unread,
Nov 14, 2012, 11:42:15 AM11/14/12
to scrapy...@googlegroups.com
Thank you Steven. This was not obvious to me at first.
Reply all
Reply to author
Forward
0 new messages