Follow and scrape javascript link in scrapy

3,709 views
Skip to first unread message

rsthythm

unread,
Sep 3, 2010, 5:01:49 AM9/3/10
to scrapy-users
Hi:

I am a newbie to scrapy, thanks for making such a great program. I
was wondering if there was a way that CrawlSpider could follow and
scrape a javascript POST link in the HTML (e.g. <a
href="javascript:__doPostBack('DG_Name$ctl07$ctl00','')">Next</a>), I
have not had much success trying to use SgmlLinkExtractor to do this.

Thanks

bruce

unread,
Sep 3, 2010, 10:46:31 AM9/3/10
to scrapy...@googlegroups.com
hey...

not that i'm using scrapy, but i'm not sure that scrapy can handle javascript.

can it guys?

i would imagine that you'd have to use one of the headless node
processing apps, like htmlunit.

basically, you want to emulate the browser function, without the
browser overhead.

> --
> You received this message because you are subscribed to the Google Groups "scrapy-users" group.
> To post to this group, send email to scrapy...@googlegroups.com.
> To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.
>
>

Pablo Hoffman

unread,
Sep 3, 2010, 5:27:46 PM9/3/10
to scrapy...@googlegroups.com
Yeah, you need to use Firebug (or some similar tool) to inspect what the
browser is doing, and replicate that in your spider.

Pablo.

bruce

unread,
Sep 4, 2010, 11:49:16 AM9/4/10
to scrapy...@googlegroups.com
hi pablo, and others.

it's more than just firing up firebug/firefox, and inspecting the javascript.

for simple cases, that approach will work. but if you're doing serious
crawling where you're dealing with potentially dynamic javascript
that's returning data from a backend/database, you can't anticipate,
so you can't "hardcode" in your app based on what you see in firebug
when you examine it. in these cases you really need to get into using
a headless browser function which more or less repilcates/emulates
some of the browser functionality. this then gives you the ability to
"run" the javascript in the same way a browser would.

you'd basically do the analysis of the site/page you're looking to
parse, set it up using the headless (htmlunit) all, and then call the
htmlunit app from the crawling app, returning the data back to the
crawling python app..

hope this makes sense..

the

rsthythm

unread,
Sep 4, 2010, 2:25:50 PM9/4/10
to scrapy-users
Hi bruce, pablo, others:

Thanks for your replies. Perhaps I should be specific:-

In Firebug, after I click on the javascript link I get the following
relevant HTML:-

... <form name="Form1" method="post" action="Index.aspx?
qstr_level=05" onsubmit="javascript:return WebForm_OnSubmit();"
id="Form1">
<div>
<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" /
>
<input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT"
value="" />
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/
wEPDwUKLTY1OTY3ODQwNA9kFgICAQ9kFhA......[I cut-off LONG ENCRYPTED
STREAM for clarity]" />
</div>

<script type="text/javascript">
//<![CDATA[
var theForm = document.forms['Form1'];
if (!theForm) {
theForm = document.Form1;
}
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
theForm.submit();
}
} ...

Thus if I call:-

FormRequest.from_response(response, formdata = {'_EVENTTARGET' :
'DG_Name$ctl03$ctl00', '_EVENTVALIDATION': _EVENTVALIDATION,
'_EVENTARGUMENT' : "", '_VIEWSTATE' : "/
wEPDwUKLTY1OTY3ODQwNA9kFgICAQ9k....." }....

(also by including subsets of these id:value pairs in formdata) I only
get the original webpage (as if no javascript POST had happened). Am
I doing something obviously wrong, should I be trying to emulate
javascript in python-spidermonkey?

I'd be grateful for your help.
> >> > For more options, visit this group athttp://groups.google.com/group/scrapy-users?hl=en.
>
> >> --
> >> You received this message because you are subscribed to the Google Groups "scrapy-users" group.
> >> To post to this group, send email to scrapy...@googlegroups.com.
> >> To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
> >> For more options, visit this group athttp://groups.google.com/group/scrapy-users?hl=en.

Victor Mireyev

unread,
Sep 5, 2010, 5:50:27 PM9/5/10
to scrapy-users
Hey rsthythm!

I see you're dealing with famous-nasty ASP.NET postback problem.
I believe this is not what should be done with python-spidermonkey.
Recently, I have successfully parsed pager built on top of ASP.NET
postback with the help of standard Scrapy tools.
This was the open government data scraper (http://github.com/
AmbientLighter/rpn-fas).
Also I hope, that you will find these explanations useful:
http://search.cpan.org/~ecarroll/HTML-TreeBuilderX-ASP_NET-0.09/lib/HTML/TreeBuilderX/ASP_NET.pm
Though the latter library is written in the Perl.

Happy scraping!
Victor Mireyev,

Ken Cochrane

unread,
Sep 6, 2010, 11:15:36 AM9/6/10
to scrapy-users
Victor,

Thanks for posting those links. I just finished working a project
which had the same issue, and you solution was so much better then the
one I had to use.

Ken

On Sep 5, 5:50 pm, Victor Mireyev <ambientligh...@gmail.com> wrote:
> Hey rsthythm!
>
> I see you're dealing with famous-nasty ASP.NET postback problem.
> I believe this is not what should be done with python-spidermonkey.
> Recently, I have successfully parsed pager built on top of ASP.NET
> postback with the help of standard Scrapy tools.
> This was the open government data scraper (http://github.com/
> AmbientLighter/rpn-fas).
> Also I hope, that you will find these explanations useful:http://search.cpan.org/~ecarroll/HTML-TreeBuilderX-ASP_NET-0.09/lib/H...

Pablo Hoffman

unread,
Sep 6, 2010, 12:20:23 PM9/6/10
to scrapy...@googlegroups.com
Hi Victor,

Very nice tips about the __VIEWSTATE parameter.

I've taken the opportunity to add it to the FAQ, with a link to your spider:
http://doc.scrapy.org/0.10/faq.html#what-s-this-huge-cryptic-viewstate-parameter-used-in-some-forms

Pablo.

Surajit Nundy

unread,
Sep 9, 2010, 9:05:40 AM9/9/10
to scrapy...@googlegroups.com
Dear Victor:

Thank you very much for your post. I looked through your code at (http://github.com/
AmbientLighter/rpn-fas), but am having trouble understanding where you get around the postback problem in file "rnp.py" (I also don't speak Russian but am trying to understand http://rnp.fas.gov.ru/Default.aspx). I was wondering if you might do me the favor of explaining to me how exactly you solved it.

Again, thank you.

Shuro

Victor Mireyev

unread,
Sep 10, 2010, 4:35:08 AM9/10/10
to scrapy-users
Hi Surajit!

As you can see, rnp.fas.gov.ru site contains paginated table.
Each page contains by default 10 items (rows).
Also there is a paginator below the table, which contains
`from field` (ctl00_phWorkZone_rnpList_datapgr_tbRecNumFrom) .

In order to follow to the next page, we can override `from field`
with it's current value increased, for example, by 10, and submit
(POST) the form.

Note, however, that this approach is not universal for all cases where
postback is used.
It's rather a special case for paginated tables. That's why I've
posted the link to CPAN,
where general explanations are made.

On 9 сен, 16:05, Surajit Nundy <nun...@gmail.com> wrote:
> Dear Victor:
>
> Thank you very much for your post.  I looked through your code at (http://github.com/
> AmbientLighter/rpn-fas), but am having trouble understanding where you get around the postback problem in file "rnp.py" (I also don't speak Russian but am trying to understandhttp://rnp.fas.gov.ru/Default.aspx).  I was wondering if you might do me the favor of explaining to me how exactly you solved it.
>
> Again, thank you.
>
> Shuro
> On 05-Sep-2010, at 2:50 PM, Victor Mireyev wrote:
>
> > Hey rsthythm!
>
> > I see you're dealing with famous-nasty ASP.NET postback problem.
> > I believe this is not what should be done with python-spidermonkey.
> > Recently, I have successfully parsed pager built on top of ASP.NET
> > postback with the help of standard Scrapy tools.
> > This was the open government data scraper (http://github.com/
> > AmbientLighter/rpn-fas).
> > Also I hope, that you will find these explanations useful:
> >http://search.cpan.org/~ecarroll/HTML-TreeBuilderX-ASP_NET-0.09/lib/H...

Surajit Nundy

unread,
Sep 12, 2010, 8:14:03 PM9/12/10
to scrapy...@googlegroups.com
Hi Victor:

Thanks for your reply. I understand (I think) what you mean and have a slightly more complex problem. The webpage I am interested in has 2 "submit" fields:-

1. javascript:return WebForm_OnSubmit() requires a numeric value (year)

and

2. javascript:__doPostBack('DG_Name$ctl03$ctl00',''). When you look at this in Firebug, it appears to do a POST operation with some arguments (in this case 'DG_Name$ctl03$ctl00' and '').

I need to do the 2nd option while not activating the first one. Is it your experience that both 1) and 2) are POST operations, just with different arguments?

Thanks very much for your help.

Message has been deleted

embedded

unread,
Dec 27, 2012, 7:37:24 AM12/27/12
to scrapy...@googlegroups.com, nun...@gmail.com
victor are you still there?
I have similar problem I can't solve.

mind helping out?

lin di

unread,
Jan 27, 2013, 11:55:53 PM1/27/13
to scrapy...@googlegroups.com

it's pretty useful, thanks.


在 2010年9月6日星期一UTC+8上午5时50分27秒,Victor Mireyev写道:
Reply all
Reply to author
Forward
0 new messages