I want scrapy to crawl pages where going to the next one link looks
like this:
<a href="#" onclick="return gotoPage('2');"> Next </a>
Will scrapy be able to interpret javascript code of that?
With livehttpheaders extension I found out that clicking Next
generates a POST with a really huge piece of "garbage" starting like
this: encoded_session_hidden_map=H4sIAAAAAAAAALWZXWwj1RXHJ9n
I am trying to build my spider on the CrawlSpider class, but I can't
really figure out how to code it, with BaseSpider I used the parse()
method to process the first URL, which happens to be a login form,
where I did a POST with:
def logon(self, response):
login_form_data={ 'email': 'us...@example.com', 'password':
'mypass22', 'action': 'sign-in' }
return [FormRequest.from_response(response, formnumber=0,
formdata=login_form_data, callback=self.submit_next)]
And then I defined submit_next() to tell what to do next. I can't
figure out how do I tell CrawlSpider which method to use on the first
URL?
All requests in my crawling, except the first one, are POST requests.
They are alternating two types of requests: pasting some data, and
clicking "Next" to go to the next page.
2) Not sure what do you mean by `first URL` in the penultimate
paragraph.
Is it the start_urls or smth else?
3) If you want to setup method of request you can always pass method
as param to Request constructor
eg. Request(url, method='POST')
4) Don't really understand what's the problem in submit_next you're
running with.
Could you please provide example?
On 16 мар, 16:12, Miernik <pub...@public.miernik.name> wrote:
> Hello,
>
> I want scrapy to crawl pages where going to the next one link looks
> like this:
> <a href="#" onclick="return gotoPage('2');"> Next </a>
>
> Will scrapy be able to interpret javascript code of that?
>
> With livehttpheaders extension I found out that clicking Next
> generates a POST with a really huge piece of "garbage" starting like
> this: encoded_session_hidden_map=H4sIAAAAAAAAALWZXWwj1RXHJ9n
>
> I am trying to build my spider on the CrawlSpider class, but I can't
> really figure out how to code it, with BaseSpider I used the parse()
> method to process the first URL, which happens to be a login form,
> where I did a POST with:
>
> def logon(self, response):
> login_form_data={ 'email': 'u...@example.com', 'password':
Let me explain again: I know how to use BaseSpider. But I don't know
how to use CrawlSpider.
With BaseSpider, I define a parse() method, which processes the URL
from start_urls, and then it goes on. But with CrawlSpider in the
examples in the documentation it doesn't show to define a parse()
method, so where do I specify which method is used for start_urls? I
do not understand how can a spider be built without having a parse()
method to start with.
But in http://doc.scrapy.org/intro/overview.html#what-else it says:
Scrapy provides a lot of powerful features for making scraping easy
and efficient, such as:
* Built-in support for parsing HTML, XML, CSV, and Javascript
What can that JavaScript support do then, if it can not do
onclick="return gotoPage('2');"?
Scrapy supports javascript urls:
$ python scrapy-ctl.py shell http://www.google-analytics.com/ga.js
...
>>> response.headers['Content-Type']
'text/javascript'
>>> hxs.re('www.+?\.com')
[u'www.google-analytics.com', u'www.google.com']
But Scrapy doesn't know the url that open a javascript function like
gotoPage('2')
> What can that JavaScript support do then, if it can not do
> onclick="return gotoPage('2');"?
What I did in this situation is inspect the javascript source to know
what does gotoPage()
function and reproduce in scrapy. e.g. in your parse function
# extract gotoPage links
for page_id in hxs.re('onclick="return gotoPage\(\'(\d+)\'\);"'):
# build here from page_id here
url = "http://domain.com/goto?page=%s" % page_id
yield Request(url, callback=self.parse_js_page)
You can also use CrawlSpider with a custom link extractor to do the work
of transform onclick-js-function to page-url:
http://doc.scrapy.org/topics/spiders.html#crawlspider
Regards,
Rolando
here's scrapy.contrib.spiders.crawl module's excerpts:
class CrawlSpider(InitSpider):
def parse(self, response):
"""This function is called by the framework core for all the
start_urls. Do not override this function, override
parse_start_url
instead."""
return self._response_downloaded(response,
self.parse_start_url, cb_kwargs={}, follow=True)
def parse_start_url(self, response):
"""Overrideable callback function for processing start_urls.
It must
return a list of BaseItem and/or Requests"""
return []
-------------------------------------------------------------------------------------------------------------
The rest is currently pricessed with rules, but in the future,
hopefully, will be done with CrawlSpider2.
Good point about including "Javascript parsing" in the features list, here's
some explanation:
There used to be a Javascript parser in Scrapy which used some ctypes-based
spidermonkey bindings we made, but we had to remove it (r1769) because it was
buggy and unmaintained.
It was removed before the first stable release (0.7), but we forgot to update
the docs accordingly. So I've just removed those Javascript referenes in r1953.
Btw, based on our previous experience, adding native support for interpreting
and executing Javascript won't be trivial (if possible, at all) and would
certainly prove to be an overkill, like Victor suggested.
If you need a project that runs Javascript, you could take a look at Piggy Bank
from MIT: http://simile.mit.edu/wiki/Piggy_Bank
It's a screen scraper that runs in a Firefox add-on, where you write your
spiders in Javascript. It uses a sort of "headless" Firefox to process all
requests, which is quite slower than Scrapy but it does process Javascript.
It's also a bit messy to install IMHO.
I made a "Firefox HTML cleanup" downloader middleware once, by writing a
"Firefox proxy" inspired on their code which basically parsed all downloaded
responses with Firefox. The purpose was to make XPaths extracted with Firebug
work unmodified in Scrapy spiders, but the slowness and unreliability (it
wouldn't work with DOMs modified after the page was loaded) lead me to abandon
that project.
Pablo.