request getting filtered?

112 views
Skip to first unread message

Bill Ebeling

unread,
Oct 2, 2013, 8:50:58 AM10/2/13
to scrapy...@googlegroups.com
Hi all,

I've got a method in a spider that recursively walks from top menu to a product index page, then when it hits bottom sends that page off to another method.  Thing is, that other method doesn't seem to get the response.

Here's the line that i think is causing the problems:

                if self.is_index(response):
                        yield Request(url=response.url, callback=self.get_products_from_index)


I think submitting response.url is getting filtered by something that prevents multiple requests for the same page?

I tried previously to send the response directly to the method:

                if self.is_index(response):
                        yield self.get_products_from_index(response)


but that produces this error: ERROR: Spider must return Request, BaseItem or None, got 'generator'

Anyone try something like this?  Is there a good way of getting a response, then passing it to another method that I'm simply overlooking?

Thanks,

Bill

Paul Tremberth

unread,
Oct 2, 2013, 8:55:20 AM10/2/13
to scrapy...@googlegroups.com
Hi,

You could try setting dont_filter to True for your request

If that doesnt work, you probably need to share more code (type of spider, settings maybe), and your console log

Paul.

Bill Ebeling

unread,
Oct 2, 2013, 9:01:53 AM10/2/13
to scrapy...@googlegroups.com
dont_filter did the trick!

I wish there was a way to do it that didn't mean making the request twice.  Seems hacky to me.

Thanks though!

Paul Tremberth

unread,
Oct 2, 2013, 9:15:21 AM10/2/13
to scrapy...@googlegroups.com
dont_filter is useful when you *know* the same URL will result in different content, most typically when the content is dynamically generated

if it's the same content/page you want to parse with 2 methods, to extract different items or requests,
you could have a method calling 2 other methods.

We probably need to see more of your spider code to guide you through this, or if we spot anything fishy.

Cheers,
Paul.

Bill Ebeling

unread,
Oct 2, 2013, 9:25:40 AM10/2/13
to scrapy...@googlegroups.com
Well, sure.

Here's the heart of the spider:

        def parse(self, response):
                '''Hits the landing page sends it straight for menu crawling'''
                print "in parse for %s" % response.url
                hxs = HtmlXPathSelector(response)
                top_menus = hxs.select( initVals['nav_xpath']['follow']).extract()

                for top_menu in top_menus:
                        next_menu=urljoin(self.base_url, self.mid_url +  top_menu)
                        print 'next_menu: %s' % next_menu
                        yield Request(url=next_menu, callback=self.drill_down)


        def drill_down(self, response):
                '''receives a response, looks for menus to follow'''
                print "drilling down in %s" % response.url
                hxs=HtmlXPathSelector(response)
                if self.is_index(response):# we've struck oil!
                        # dont_filter here because we've made this request once already.  There's probably a better way.
                        yield Request(url=response.url, callback=self.get_products_from_index, dont_filter=True)
                else:
                        if not initVals['nav_xpath']['submenu_xpath']:
                                pass #we've reached a dead end
                        else:
                                # this tries every submenu xpath on every page..  might want to restrict it to prevent a lot of redundant crawling
                                next_pages = hxs.select(initVals['nav_xpath']['submenu_xpath'])
                                # try to join everything, won't harm absolute paths if domain is the same (and it should be)
                                for page in next_pages:
                                        follow_urls= urljoin(self.base_url, self.mid_url + page)
                                for follow_url in follow_urls:
                                        print "Following %s" % follow_url
                                        yield Request(url=follow_url, callback=self.drilldown)

        def get_products_from_index(self, response):
                '''determines if this is the only page with products, if not, gets the next page, too'''
                print 'getting products from index page: %s' % response.url
                hxs = HtmlXPathSelector(response)
                # gather items from page, send to harvest
                for product_page in hxs.select(initVals['nav_xpath']['product_pages']).extract():
                                yield Request(url=product_page, callback=harvest)
                # if there are more pages, get them
                if self.has_next(response):
                        next_index=initVals['nav_xpath']['next_index']
                        follow_url=urljoin(self.base_url, self.mid_url + next_index)
                        yield Request(url=follow_url, callback=self.get_products_from_index)


What happens:

parse gets the landing page, takes the top level menus and sends them to drill_down
drill_down then checks to see if there's products on the page, if yes, it sends that page (via Request) to get_products_from_index, otherwise it tries to find another level of menus and submits the response to itself
get_products_from_index grabs all the links for products and sends the pages to harvest (which is just fills in the item()) and checks for a 'next' option, if there is one, it chooses next and submits the response to itself

The concept:

What I'm trying to do is write a sort of 'omni' crawler that will take a predefined dict and crawl a bunch of 'easy' pages.  This version should be able to crawl mostly static pages without ajax.  If I get this working then the next iteration I will look at dynamic pages.

Thanks for taking a look!

B

Paul Tremberth

unread,
Oct 2, 2013, 9:36:41 AM10/2/13
to scrapy...@googlegroups.com
it seems to me you could simply call the callback with the response
and yield whatever it generates

                if self.is_index(response):# we've struck oil!
                        for something in self.get_products_from_index(response):
                            yield something
                else:
                        if not initVals['nav_xpath']['submenu_xpath']:
                                pass #we've reached a dead end
                        else:

does that work?

Bill Ebeling

unread,
Oct 2, 2013, 9:55:49 AM10/2/13
to scrapy...@googlegroups.com
Brilliant!

That did do it, thanks so much.
Reply all
Reply to author
Forward
0 new messages