crawling after login page with scrapy

2,130 views
Skip to first unread message

Ana Carolina Assis Jesus

unread,
Sep 10, 2013, 3:17:08 AM9/10/13
to scrapy...@googlegroups.com
Hello!

I am new at the group and to scrapy.
Last week I had a post saying that I wasn't being able to scrapy a https page... But by friday I was just able to pass the login page! 
YEY! :-D

Well, now that I am able to login, I want to continue crawling but the code complains that the parse function I call doesn't exist.
Do I need to predefine the function before? or at least make a "mention" in some settings page?
As far as I could see I did the exactly same thing I was following in some discussion page but then I get this message.

Below you can see my code and the error message!
Thanks for the help!
Cheers!
Ana


from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request, FormRequest
from scrapy.item import Item, Field

# Data Definition
class BlaItem(Item):
    title = Field()
    name_type = Field()
    name_value = Field()
    
# Spider Definition (Login Page Spider)
class LoginSpider(BaseSpider):
    name="blalogin"
    allowed_domains=["secure.blablabla.com"]
    
    def parse(self, response):
        filename = response.url.split("/")[-2]
        open(filename, 'wb').write(response.body) 

        hxs = HtmlXPathSelector(response)
        bla_title = hxs.select("//title/text()").extract()
        bla_name = hxs.select("//h1/text()").extract() 
        bla_form = hxs.select("//form[@id = 'MainForm']/text()").extract()  
        print bla_title
        print bla_name
        print bla_form
        print ''

        return [FormRequest.from_response(response,formdata={'USERID': 'Username', 'Password': 'password'},callback=self.after_login)]

    def after_login(self, response):
        # check login succeed before going on
        if "authentication failed" in response.body:
            self.log("Login failed", level=log.ERROR)
            return

        else:
        print 'Login worked!!! YEY'
        return Request(url="https://secure.blablabla.com//Prod/BackOffice/Impersonation/MerchantList?",callback=self.parse_impersonate)

def parse_impersonate(self, response):
message = 'yes, I am here'
print 'Impersonate here!'
return message


ERROR MESSAGE
[...]
--- <exception caught here> ---
 File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 551, in _runCallbacks
   current.result = callback(current.result, *args, **kw)
 File "/Users/ANA/Documents/Scraper/blablablaProject/blablablaProject/spiders/blablablaScrapy.py", line 73, in after_login
   return Request(url="https://secure.blablabla.com/Prod/BackOffice/Impersonation/MerchantList?",callback=self.parse_impersonate)
exceptions.AttributeError: 'LoginSpider' object has no attribute 'parse_impersonate'

Ana Carolina Assis Jesus

unread,
Sep 10, 2013, 3:20:04 AM9/10/13
to scrapy...@googlegroups.com
Another question.

When I ask 

def after_login(self, response):
        # check login succeed before going on
        if "authentication failed" in response.body:
            self.log("Login failed", level=log.ERROR)
            return

        else:
         print 'Login worked!!! YEY'

The webpage I am working in should give me the message "authentication failed" or is this general for ANY login scrapy?
Meaning, since I didn't get the message but the else, I assumed I logged in successfully... Is it possible I just think I had but in fact had not?

Thanks!
Ana

Ana Carolina Assis Jesus

unread,
Sep 10, 2013, 3:30:49 AM9/10/13
to scrapy...@googlegroups.com
Sorry guys!

Me again!
Forgot to add the top error message:

[...]
--- <exception caught here> ---
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 551, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/Users/ANA/Documents/Scraper/blablablaProject/blablablaProject/spiders/blablablaScrapy.py", line 73, in after_login
    return Request(url="https://secure.blablabla.com/Prod/BackOffice/Impersonation/MerchantList?",callback=self.parse_impersonate)
exceptions.AttributeError: 'LoginSpider' object has no attribute 'parse_impersonate'


On Tuesday, September 10, 2013 9:17:08 AM UTC+2, Ana Carolina Assis Jesus wrote:

Paul Tremberth

unread,
Sep 10, 2013, 4:53:17 AM9/10/13
to scrapy...@googlegroups.com
Hi Ana Carolina,

I'm not sure how well Google Groups preserves code formatting
so make sure def parse_impersonate is at the same indendation level as the other "def after_login..." and "def parse..."

You can also post your code to Pastebin or gist.github.com and post the link here, it's usually better for code review

Paul.


On Tuesday, September 10, 2013 9:17:08 AM UTC+2, Ana Carolina Assis Jesus wrote:

Paul Tremberth

unread,
Sep 10, 2013, 5:04:34 AM9/10/13
to scrapy...@googlegroups.com
I highly doubt "authentication failed" is the only message you could get after a failed login attempt.
is just an example

You should adapt to your case specifically, maybe print part of the response page body to test things out,
for example, in after_login,  print response.body[:512] to start with

It's usually safer to check for a specific HTML element, or text (note that "Authentication Failed" is a possible variation)

Hope this helps.
Paul.

Ana Carolina Assis Jesus

unread,
Sep 11, 2013, 2:15:40 AM9/11/13
to scrapy...@googlegroups.com
Hi Paul,

Thanks for the answer.
But the indentation was ok.

I made a few little changes on it, trying to put a specific message in case I didn't login but nothing worked. It keeps looking like I logged in.
It even gives me the next path in html like if I had logged in, but I cant do anything.

After changing to just give me the webpage from outside it does now however gives me a new error message:


Traceback (most recent call last):
 File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 1178, in mainLoop
   self.runUntilCurrent()
 File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 800, in runUntilCurrent
   call.func(*call.args, **call.kw)
 File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 368, in callback
   self._startRunCallbacks(result)
 File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 464, in _startRunCallbacks
   self._runCallbacks()
--- <exception caught here> ---
 File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 551, in _runCallbacks
   current.result = callback(current.result, *args, **kw)
 File "/Users/ajesus/Documents/Scraper/ogoneProject/ogoneProject/spiders/ogoneTest.py", line 36, in parse
   return [FormRequest.from_response(response,formdata={'userid': 'BLABLABLA', 'PSWD': 'titititi'},callback=self.after_login)]
 File "/Library/Python/2.7/site-packages/Scrapy-0.18.2-py2.7.egg/scrapy/http/request/form.py", line 36, in from_response
   form = _get_form(response, formname, formnumber, formxpath)
 File "/Library/Python/2.7/site-packages/Scrapy-0.18.2-py2.7.egg/scrapy/http/request/form.py", line 55, in _get_form
   raise ValueError("No <form> element found in %s" % response)


So, in the end it didn't login? I still cant completely understand.

Someone also told me that I might need to grab these &CSRFTS and &CSRFKEY keys that you see on the url address.
Any idea how to do this?

Thanks!
Ana
Reply all
Reply to author
Forward
0 new messages