is scrapy able to click a link with javascript onclick?

6,610 views
Skip to first unread message

Miernik

unread,
Mar 16, 2010, 10:12:42 AM3/16/10
to scrapy-users
Hello,

I want scrapy to crawl pages where going to the next one link looks
like this:
<a href="#" onclick="return gotoPage('2');"> Next </a>

Will scrapy be able to interpret javascript code of that?

With livehttpheaders extension I found out that clicking Next
generates a POST with a really huge piece of "garbage" starting like
this: encoded_session_hidden_map=H4sIAAAAAAAAALWZXWwj1RXHJ9n

I am trying to build my spider on the CrawlSpider class, but I can't
really figure out how to code it, with BaseSpider I used the parse()
method to process the first URL, which happens to be a login form,
where I did a POST with:

def logon(self, response):
login_form_data={ 'email': 'us...@example.com', 'password':
'mypass22', 'action': 'sign-in' }
return [FormRequest.from_response(response, formnumber=0,
formdata=login_form_data, callback=self.submit_next)]

And then I defined submit_next() to tell what to do next. I can't
figure out how do I tell CrawlSpider which method to use on the first
URL?

All requests in my crawling, except the first one, are POST requests.
They are alternating two types of requests: pasting some data, and
clicking "Next" to go to the next page.

Victor Mireyev

unread,
Mar 16, 2010, 6:24:22 PM3/16/10
to scrapy-users
1) Scrapy doesn't interpret javascript `on-the-fly` and, probably,
never will,
because it's simply overkill, IMHO.

2) Not sure what do you mean by `first URL` in the penultimate
paragraph.
Is it the start_urls or smth else?

3) If you want to setup method of request you can always pass method
as param to Request constructor
eg. Request(url, method='POST')

4) Don't really understand what's the problem in submit_next you're
running with.
Could you please provide example?

On 16 мар, 16:12, Miernik <pub...@public.miernik.name> wrote:
> Hello,
>
> I want scrapy to crawl pages where going to the next one link looks
> like this:
> <a href="#" onclick="return gotoPage('2');"> Next </a>
>
> Will scrapy be able to interpret javascript code of that?
>
> With livehttpheaders extension I found out that clicking Next
> generates a POST with a really huge piece of "garbage" starting like
> this: encoded_session_hidden_map=H4sIAAAAAAAAALWZXWwj1RXHJ9n
>
> I am trying to build my spider on the CrawlSpider class, but I can't
> really figure out how to code it, with BaseSpider I used the parse()
> method to process the first URL, which happens to be a login form,
> where I did a POST with:
>
>     def logon(self, response):

>         login_form_data={ 'email': 'u...@example.com', 'password':

Miernik

unread,
Mar 17, 2010, 8:37:36 AM3/17/10
to scrapy-users
On Mar 16, 11:24 pm, Victor Mireyev <ambientligh...@gmail.com> wrote:
> 2) Not sure what do you mean by `first URL` in the penultimate
> paragraph.
> Is it the start_urls or smth else?

Let me explain again: I know how to use BaseSpider. But I don't know
how to use CrawlSpider.

With BaseSpider, I define a parse() method, which processes the URL
from start_urls, and then it goes on. But with CrawlSpider in the
examples in the documentation it doesn't show to define a parse()
method, so where do I specify which method is used for start_urls? I
do not understand how can a spider be built without having a parse()
method to start with.

Miernik

unread,
Mar 17, 2010, 8:48:48 AM3/17/10
to scrapy-users
On Mar 16, 11:24 pm, Victor Mireyev <ambientligh...@gmail.com> wrote:
> 1) Scrapy doesn't interpret javascript `on-the-fly` and, probably,
> never will, because it's simply overkill, IMHO.

But in http://doc.scrapy.org/intro/overview.html#what-else it says:

Scrapy provides a lot of powerful features for making scraping easy
and efficient, such as:

* Built-in support for parsing HTML, XML, CSV, and Javascript

What can that JavaScript support do then, if it can not do
onclick="return gotoPage('2');"?

Rolando Espinoza La Fuente

unread,
Mar 17, 2010, 2:45:34 PM3/17/10
to scrapy...@googlegroups.com
On Wed, Mar 17, 2010 at 8:48 AM, Miernik <pub...@public.miernik.name> wrote:
> On Mar 16, 11:24 pm, Victor Mireyev <ambientligh...@gmail.com> wrote:
>> 1) Scrapy doesn't interpret javascript `on-the-fly` and, probably,
>> never will, because it's simply overkill, IMHO.
>
> But in http://doc.scrapy.org/intro/overview.html#what-else it says:
>
> Scrapy provides a lot of powerful features for making scraping easy
> and efficient, such as:
>
>  * Built-in support for parsing HTML, XML, CSV, and Javascript

Scrapy supports javascript urls:

$ python scrapy-ctl.py shell http://www.google-analytics.com/ga.js
...
>>> response.headers['Content-Type']
'text/javascript'
>>> hxs.re('www.+?\.com')
[u'www.google-analytics.com', u'www.google.com']

But Scrapy doesn't know the url that open a javascript function like
gotoPage('2')

> What can that JavaScript support do then, if it can not do
> onclick="return gotoPage('2');"?

What I did in this situation is inspect the javascript source to know
what does gotoPage()
function and reproduce in scrapy. e.g. in your parse function

# extract gotoPage links
for page_id in hxs.re('onclick="return gotoPage\(\'(\d+)\'\);"'):
# build here from page_id here
url = "http://domain.com/goto?page=%s" % page_id
yield Request(url, callback=self.parse_js_page)

You can also use CrawlSpider with a custom link extractor to do the work
of transform onclick-js-function to page-url:
http://doc.scrapy.org/topics/spiders.html#crawlspider

Regards,

Rolando

Victor Mireyev

unread,
Mar 17, 2010, 6:14:06 PM3/17/10
to scrapy-users
Let's use the source code of scrapy as reference.

here's scrapy.contrib.spiders.crawl module's excerpts:

class CrawlSpider(InitSpider):
def parse(self, response):
"""This function is called by the framework core for all the
start_urls. Do not override this function, override
parse_start_url
instead."""
return self._response_downloaded(response,
self.parse_start_url, cb_kwargs={}, follow=True)

def parse_start_url(self, response):
"""Overrideable callback function for processing start_urls.
It must
return a list of BaseItem and/or Requests"""
return []
-------------------------------------------------------------------------------------------------------------
The rest is currently pricessed with rules, but in the future,
hopefully, will be done with CrawlSpider2.

Pablo Hoffman

unread,
Mar 20, 2010, 7:53:17 PM3/20/10
to scrapy...@googlegroups.com

Good point about including "Javascript parsing" in the features list, here's
some explanation:

There used to be a Javascript parser in Scrapy which used some ctypes-based
spidermonkey bindings we made, but we had to remove it (r1769) because it was
buggy and unmaintained.

It was removed before the first stable release (0.7), but we forgot to update
the docs accordingly. So I've just removed those Javascript referenes in r1953.

Btw, based on our previous experience, adding native support for interpreting
and executing Javascript won't be trivial (if possible, at all) and would
certainly prove to be an overkill, like Victor suggested.

If you need a project that runs Javascript, you could take a look at Piggy Bank
from MIT: http://simile.mit.edu/wiki/Piggy_Bank

It's a screen scraper that runs in a Firefox add-on, where you write your
spiders in Javascript. It uses a sort of "headless" Firefox to process all
requests, which is quite slower than Scrapy but it does process Javascript.
It's also a bit messy to install IMHO.

I made a "Firefox HTML cleanup" downloader middleware once, by writing a
"Firefox proxy" inspired on their code which basically parsed all downloaded
responses with Firefox. The purpose was to make XPaths extracted with Firebug
work unmodified in Scrapy spiders, but the slowness and unreliability (it
wouldn't work with DOMs modified after the page was loaded) lead me to abandon
that project.

Pablo.

Reply all
Reply to author
Forward
0 new messages