How to login and then start a CrawlSpider with rules?

1,308 views
Skip to first unread message

Fer

unread,
Jul 11, 2013, 1:48:57 PM7/11/13
to scrapy...@googlegroups.com
Hi everyone!
I'm trying to mix the LoginSpider with CrawlSpider, but I do not find a way. The idea is first login and then parse using the rules, but in the example of LoginSpider the method parse was modified, and in the CrawlSpider says "if you override the parse method, the crawl spider will no longer work".  I would be grateful if you could help me.

Paul Tremberth

unread,
Jul 11, 2013, 4:02:31 PM7/11/13
to scrapy...@googlegroups.com
Hi
CrawlSpider has an overridable method parse_start_url() that could be used in your case (I think)

It's not mentioned in the docs for 0.16 (the links your provided) but it's in the code for 0.16 and 0.17

It's called in CrawlSpider's parse() method, so when the first URL is fetched and processed (especially the start_urls you will define for your LoginSpider).

So I would try and define parse_start_url() just as the LoginSpider example

    def parse_start_url(self, response):
        return [FormRequest.from_response(response,
                    formdata={'username': 'john', 'password': 'secret'},
                    callback=self.after_login)]

Note: as another user in the group recently had issues with this parse_start_url() method being called several times,
be sure to define a callback that is NOT parse() for your Rules()

Tell us how it goes.

Paul.

Fer

unread,
Jul 12, 2013, 1:40:36 PM7/12/13
to scrapy...@googlegroups.com
Thank you for your idea. I tried this option, but for some reason the login occurred at the end, so the spider first parsed the pages and then got logged in. 
I tried as second option to write a custom spider based in this example https://scrapy.readthedocs.org/en/latest/topics/debug.html, and it worked well :)
"yield" worked like magic, and I could log in and parse all the pages. 

Paul Tremberth

unread,
Jul 12, 2013, 3:23:56 PM7/12/13
to scrapy...@googlegroups.com
That's good news.
Would you mind sharing your working spider code if someone has the same use case in the future?
You can replace URLs and anonylize Rules() if you want.

I'm curious why the suggested method would not work. Probably due to the way Requests are queued.
Do you still have the code for this initial trial?

Paul.

Fer

unread,
Jul 12, 2013, 5:47:17 PM7/12/13
to scrapy...@googlegroups.com
This is the code:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from myproject.items import MyItem
from scrapy.http import FormRequest, Request

class MySpiderLogin(BaseSpider):
    name = 'spiderlogin'
    allowed_domains = ['example.com']
    start_urls = ['example.com/login']


    def parse(self, response):
        return [FormRequest.from_response(response, formnumber = 1,
                    formdata={'username': 'user', 'password': 'pass'},
                    callback=self.after_login)]

    def after_login(self, response):
        # check login succeed before going on
        if "authentication failed" in response.body:
            self.log("Login failed", level=log.ERROR)
            return
return Request("http://example.com/page_to_parse", callback=self.parse_logged)
    
    def parse_logged(self, response):
        self.log("\nHi, I'm collecting links in this page! %s\n" % response.url)

        hxs = HtmlXPathSelector(response)
#The "Rules" to get links
        url_list = hxs.select('//somethingtoget/@href').extract()
        
for item_url in url_list:
   yield Request(url=item_url, callback=self.parse_item)

    def parse_item(self, response):
        self.log("\nHi, I'm parsing in this page! %s\n" % response.url)
        hxs = HtmlXPathSelector(response)
        item = MyItem()
        item['desc'] = hxs.select('//somethingtogetdescription').extract()
        return item

Capi Etheriel

unread,
Jul 16, 2013, 7:12:07 PM7/16/13
to scrapy...@googlegroups.com

Karen Oganesyan

unread,
Mar 26, 2014, 5:37:33 PM3/26/14
to scrapy...@googlegroups.com
Hi everyone,
I faced with same situation (login and crawling with crawlspider) and done it with overridden parse_start_url (all code below should placed in your-spider-file.py):

# Here you setting the start page, where your spider can login
start_urls = ["http://forums.website.com"]

# Here you override the function to login on website and setting the callback function to check is everything ok after your login
def parse_start_url(self, response):
return [FormRequest.from_response(response,
formdata={'login': 'myUsername', 'password': 'myPassword'}, callback=self.after_login)]

# Here you doing after-login stuff to check is everything ok, and if it's, making request object with real start page from where your spider can start to crawl and parse
def after_login(self, response):
if "Incorrect login or password" in response.body:
self.log("### Login failed ###", level=log.ERROR)
exit()
else:
self.log("### Successfully logged in! ###")
request = Request(lnk)
return request

To make it work, don't forget to import request modules in the beggining of your spider file:
from scrapy.http import Request, FormRequest

Hope it helps someone



среда, 17 июля 2013 г., 3:12:07 UTC+4 пользователь Capi Etheriel написал:

Paul Tremberth

unread,
Mar 26, 2014, 5:46:39 PM3/26/14
to scrapy...@googlegroups.com
HI,

I recently posted an answer on StackOverflow with a way to combine login and CrawlSpider:

Feedback is welcome.

/Paul.
Reply all
Reply to author
Forward
0 new messages