How to login and then start a CrawlSpider with rules?

Fer

unread,

Jul 11, 2013, 1:48:57 PM7/11/13

to scrapy...@googlegroups.com

Hi everyone!

I'm trying to mix the LoginSpider with CrawlSpider, but I do not find a way. The idea is first login and then parse using the rules, but in the example of LoginSpider the method parse was modified, and in the CrawlSpider says "if you override the parse method, the crawl spider will no longer work". I would be grateful if you could help me.

Paul Tremberth

unread,

Jul 11, 2013, 4:02:31 PM7/11/13

to scrapy...@googlegroups.com

Hi

CrawlSpider has an overridable method parse_start_url() that could be used in your case (I think)

http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.contrib.spiders.CrawlSpider.parse_start_url

It's not mentioned in the docs for 0.16 (the links your provided) but it's in the code for 0.16 and 0.17

https://github.com/scrapy/scrapy/blob/0.16/scrapy/contrib/spiders/crawl.py

It's called in CrawlSpider's parse() method, so when the first URL is fetched and processed (especially the start_urls you will define for your LoginSpider).

So I would try and define parse_start_url() just as the LoginSpider example

def parse_start_url(self, response):

return [FormRequest.from_response(response,

formdata={'username': 'john', 'password': 'secret'},

callback=self.after_login)]

Note: as another user in the group recently had issues with this parse_start_url() method being called several times,

be sure to define a callback that is NOT parse() for your Rules()

Tell us how it goes.

Paul.

Fer

unread,

Jul 12, 2013, 1:40:36 PM7/12/13

to scrapy...@googlegroups.com

Thank you for your idea. I tried this option, but for some reason the login occurred at the end, so the spider first parsed the pages and then got logged in.

I tried as second option to write a custom spider based in this example https://scrapy.readthedocs.org/en/latest/topics/debug.html, and it worked well :)

"yield" worked like magic, and I could log in and parse all the pages.

Paul Tremberth

unread,

Jul 12, 2013, 3:23:56 PM7/12/13

to scrapy...@googlegroups.com

That's good news.

Would you mind sharing your working spider code if someone has the same use case in the future?

You can replace URLs and anonylize Rules() if you want.

I'm curious why the suggested method would not work. Probably due to the way Requests are queued.

Do you still have the code for this initial trial?

Paul.

Fer

unread,

Jul 12, 2013, 5:47:17 PM7/12/13

to scrapy...@googlegroups.com

This is the code:

from scrapy.spider import BaseSpider

from scrapy.selector import HtmlXPathSelector

from myproject.items import MyItem

from scrapy.http import FormRequest, Request

class MySpiderLogin(BaseSpider):

name = 'spiderlogin'

allowed_domains = ['example.com']

start_urls = ['example.com/login']

def parse(self, response):

return [FormRequest.from_response(response, formnumber = 1,

formdata={'username': 'user', 'password': 'pass'},

callback=self.after_login)]

def after_login(self, response):

# check login succeed before going on

if "authentication failed" in response.body:

self.log("Login failed", level=log.ERROR)

return

return Request("http://example.com/page_to_parse", callback=self.parse_logged)

def parse_logged(self, response):

self.log("\nHi, I'm collecting links in this page! %s\n" % response.url)

hxs = HtmlXPathSelector(response)

#The "Rules" to get links

url_list = hxs.select('//somethingtoget/@href').extract()

for item_url in url_list:

yield Request(url=item_url, callback=self.parse_item)

def parse_item(self, response):

self.log("\nHi, I'm parsing in this page! %s\n" % response.url)

hxs = HtmlXPathSelector(response)

item = MyItem()

item['desc'] = hxs.select('//somethingtogetdescription').extract()

return item

Capi Etheriel

unread,

Jul 16, 2013, 7:12:07 PM7/16/13

to scrapy...@googlegroups.com

it's documented in 0.17: http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.contrib.spiders.CrawlSpider.parse_start_url

Karen Oganesyan

unread,

Mar 26, 2014, 5:37:33 PM3/26/14

to scrapy...@googlegroups.com

Hi everyone,

I faced with same situation (login and crawling with crawlspider) and done it with overridden parse_start_url (all code below should placed in your-spider-file.py):

# Here you setting the start page, where your spider can login

start_urls = ["http://forums.website.com"]

# Here you override the function to login on website and setting the callback function to check is everything ok after your login

def parse_start_url(self, response):

return [FormRequest.from_response(response,

formdata={'login': 'myUsername', 'password': 'myPassword'}, callback=self.after_login)]

# Here you doing after-login stuff to check is everything ok, and if it's, making request object with real start page from where your spider can start to crawl and parse

def after_login(self, response):

if "Incorrect login or password" in response.body:

self.log("### Login failed ###", level=log.ERROR)

exit()

else:

self.log("### Successfully logged in! ###")

lnk = 'http://website.com/realstartpage.php'

request = Request(lnk)

return request

To make it work, don't forget to import request modules in the beggining of your spider file:

from scrapy.http import Request, FormRequest

Hope it helps someone

среда, 17 июля 2013 г., 3:12:07 UTC+4 пользователь Capi Etheriel написал:

Paul Tremberth

unread,

Mar 26, 2014, 5:46:39 PM3/26/14

to scrapy...@googlegroups.com

HI,

I recently posted an answer on StackOverflow with a way to combine login and CrawlSpider:

http://stackoverflow.com/a/22569515/2572383

Feedback is welcome.

/Paul.

Reply all

Reply to author

Forward