crawling after login page with scrapy

Ana Carolina Assis Jesus

unread,

Sep 10, 2013, 3:17:08 AM9/10/13

to scrapy...@googlegroups.com

Hello!

I am new at the group and to scrapy.

Last week I had a post saying that I wasn't being able to scrapy a https page... But by friday I was just able to pass the login page!

YEY! :-D

Well, now that I am able to login, I want to continue crawling but the code complains that the parse function I call doesn't exist.

Do I need to predefine the function before? or at least make a "mention" in some settings page?

As far as I could see I did the exactly same thing I was following in some discussion page but then I get this message.

Below you can see my code and the error message!

Thanks for the help!

Cheers!

Ana

from scrapy.spider import BaseSpider

from scrapy.selector import HtmlXPathSelector

from scrapy.http import Request, FormRequest

from scrapy.item import Item, Field

# Data Definition

class BlaItem(Item):

title = Field()

name_type = Field()

name_value = Field()

# Spider Definition (Login Page Spider)

class LoginSpider(BaseSpider):

name="blalogin"

allowed_domains=["secure.blablabla.com"]

start_urls=["https://secure.blablabla.coml/Prod/login.asp"]

def parse(self, response):

filename = response.url.split("/")[-2]

open(filename, 'wb').write(response.body)

hxs = HtmlXPathSelector(response)

bla_title = hxs.select("//title/text()").extract()

bla_name = hxs.select("//h1/text()").extract()

bla_form = hxs.select("//form[@id = 'MainForm']/text()").extract()

print bla_title

print bla_name

print bla_form

print ''

return [FormRequest.from_response(response,formdata={'USERID': 'Username', 'Password': 'password'},callback=self.after_login)]

def after_login(self, response):

# check login succeed before going on

if "authentication failed" in response.body:

self.log("Login failed", level=log.ERROR)

return

else:

print 'Login worked!!! YEY'

return Request(url="https://secure.blablabla.com//Prod/BackOffice/Impersonation/MerchantList?",callback=self.parse_impersonate)

def parse_impersonate(self, response):

message = 'yes, I am here'

print 'Impersonate here!'

return message

ERROR MESSAGE

[...]

--- <exception caught here> ---

File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 551, in _runCallbacks

current.result = callback(current.result, *args, **kw)

File "/Users/ANA/Documents/Scraper/blablablaProject/blablablaProject/spiders/blablablaScrapy.py", line 73, in after_login

return Request(url="https://secure.blablabla.com/Prod/BackOffice/Impersonation/MerchantList?",callback=self.parse_impersonate)

exceptions.AttributeError: 'LoginSpider' object has no attribute 'parse_impersonate'

Ana Carolina Assis Jesus

unread,

Sep 10, 2013, 3:20:04 AM9/10/13

to scrapy...@googlegroups.com

Another question.

When I ask

def after_login(self, response):

# check login succeed before going on

if "authentication failed" in response.body:

self.log("Login failed", level=log.ERROR)

return

else:

print 'Login worked!!! YEY'

The webpage I am working in should give me the message "authentication failed" or is this general for ANY login scrapy?

Meaning, since I didn't get the message but the else, I assumed I logged in successfully... Is it possible I just think I had but in fact had not?

Thanks!

Ana

Ana Carolina Assis Jesus

unread,

Sep 10, 2013, 3:30:49 AM9/10/13

to scrapy...@googlegroups.com

Sorry guys!

Me again!

Forgot to add the top error message:

2013-09-10 09:28:43+0200 [blalogin] ERROR: Spider error processing <GET https://secure.blablabla.com/Prod/frame_blablabla.asp?CSRFSP=%2FProd%2Flogin%2Easp&CSRFKEY=A9A43E0D612F7B9360418D7482F4C59D4073E7F5&CSRFTS=20130910092842>

[...]

--- <exception caught here> ---

File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 551, in _runCallbacks

current.result = callback(current.result, *args, **kw)

File "/Users/ANA/Documents/Scraper/blablablaProject/blablablaProject/spiders/blablablaScrapy.py", line 73, in after_login

return Request(url="https://secure.blablabla.com/Prod/BackOffice/Impersonation/MerchantList?",callback=self.parse_impersonate)

exceptions.AttributeError: 'LoginSpider' object has no attribute 'parse_impersonate'

On Tuesday, September 10, 2013 9:17:08 AM UTC+2, Ana Carolina Assis Jesus wrote:

Paul Tremberth

unread,

Sep 10, 2013, 4:53:17 AM9/10/13

to scrapy...@googlegroups.com

Hi Ana Carolina,

I'm not sure how well Google Groups preserves code formatting

so make sure def parse_impersonate is at the same indendation level as the other "def after_login..." and "def parse..."

You can also post your code to Pastebin or gist.github.com and post the link here, it's usually better for code review

Paul.

On Tuesday, September 10, 2013 9:17:08 AM UTC+2, Ana Carolina Assis Jesus wrote:

Paul Tremberth

unread,

Sep 10, 2013, 5:04:34 AM9/10/13

to scrapy...@googlegroups.com

I highly doubt "authentication failed" is the only message you could get after a failed login attempt.

http://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-userlogin

is just an example

You should adapt to your case specifically, maybe print part of the response page body to test things out,

for example, in after_login, print response.body[:512] to start with

It's usually safer to check for a specific HTML element, or text (note that "Authentication Failed" is a possible variation)

Hope this helps.

Paul.

Ana Carolina Assis Jesus

unread,

Sep 11, 2013, 2:15:40 AM9/11/13

to scrapy...@googlegroups.com

Hi Paul,

Thanks for the answer.

But the indentation was ok.

I made a few little changes on it, trying to put a specific message in case I didn't login but nothing worked. It keeps looking like I logged in.

It even gives me the next path in html like if I had logged in, but I cant do anything.

After changing to just give me the webpage from outside it does now however gives me a new error message:

2013-09-11 07:55:24+0200 [ogtest] ERROR: Spider error processing <GET https://secure.ogone.com/Ncol/Prod/BackOffice/Home?CSRFSP=%252fncol%252fprod%252fbackoffice%252fimpersonation%252fmerchantlist&CSRFKEY=55271B0C01A5B399A23465BE7FB1264D9E7FEDBF&CSRFTS=20130911075523>

Traceback (most recent call last):

File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 1178, in mainLoop

self.runUntilCurrent()

File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 800, in runUntilCurrent

call.func(*call.args, **call.kw)

File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 368, in callback

self._startRunCallbacks(result)

File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 464, in _startRunCallbacks

self._runCallbacks()

--- <exception caught here> ---

File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 551, in _runCallbacks

current.result = callback(current.result, *args, **kw)

File "/Users/ajesus/Documents/Scraper/ogoneProject/ogoneProject/spiders/ogoneTest.py", line 36, in parse

return [FormRequest.from_response(response,formdata={'userid': 'BLABLABLA', 'PSWD': 'titititi'},callback=self.after_login)]

File "/Library/Python/2.7/site-packages/Scrapy-0.18.2-py2.7.egg/scrapy/http/request/form.py", line 36, in from_response

form = _get_form(response, formname, formnumber, formxpath)

File "/Library/Python/2.7/site-packages/Scrapy-0.18.2-py2.7.egg/scrapy/http/request/form.py", line 55, in _get_form

raise ValueError("No <form> element found in %s" % response)

exceptions.ValueError: No <form> element found in <200 https://secure.ogone.com/Ncol/Prod/BackOffice/Home?CSRFSP=%252fncol%252fprod%252fbackoffice%252fimpersonation%252fmerchantlist&CSRFKEY=55271B0C01A5B399A23465BE7FB1264D9E7FEDBF&CSRFTS=20130911075523>

So, in the end it didn't login? I still cant completely understand.

Someone also told me that I might need to grab these &CSRFTS and &CSRFKEY keys that you see on the url address.

Any idea how to do this?

Thanks!

Ana

Reply all

Reply to author

Forward