scraping html that is contained in a webpage that has ajax calls

205 views
Skip to first unread message

alberto

unread,
Sep 22, 2009, 1:24:18 PM9/22/09
to scrapy-users
I'm a beginner in python and so in scrapy. what can I do? I suppose
that using scrapy I found a way to scrape the ajax call. then I
realized that a browser emulator or a browser class in python can do
that.
Can anyone help me?

Pablo Hoffman

unread,
Sep 22, 2009, 11:55:00 PM9/22/09
to scrapy...@googlegroups.com
Hi Alberto,

I dont't understand. What do you need to do?

Alberto Priore

unread,
Sep 23, 2009, 4:40:33 AM9/23/09
to scrapy...@googlegroups.com
Take the example of a page http://www.site.com/example.html that has some ajax calls if I whatch the source of this page I can't see all the html becouse there are some ajax calls. If I watch the source in Firebug I can see all the html so How can I make that? I believe one solution can be downloading automatically all the ajax call for example but how can I do that there is a solution? I hope that I explained well

 
2009/9/23 Pablo Hoffman <pabloh...@gmail.com>

Mark Ellul

unread,
Sep 23, 2009, 7:13:41 AM9/23/09
to scrapy...@googlegroups.com
Hi Alberto,

If the Ajax calls are static, i.e. their urls don't change you can just create requests in the start_requests method of your spider and parse the results.

If they are not static, i.e. they might be passing in some parameters that are specific to the main page, you need to parse out the script tags that do the calls and basically create requests in your parse_item code, and return all the requests in the results list of the parse_item method in your spider.

If say you need 3 ajax requests to fill up an item, you should add your item to the request meta data, so on each call back for the requests, you fill in the bits you can for your item, when the item has been filled, you return that item in a list, and then it will get put into the pipe line.

Regards

Mark

Pablo Hoffman

unread,
Sep 23, 2009, 8:14:49 AM9/23/09
to scrapy...@googlegroups.com
Yeah, you can use the Firebug network monitoring feature to inspect AJAX calls
being performed through the page. Here's the doc: http://getfirebug.com/net.html

And then construct those same requests in Scrapy.


Btw, I'm getting the feeling that we should put together an AJAX scraping
HOWTO, as it's a commonly asked topic. Perhaps we should add it to the Firebug
page: http://doc.scrapy.org/topics/firebug.html

Pablo.

Matthias Buehlmaier

unread,
Sep 24, 2009, 4:24:51 AM9/24/09
to scrapy...@googlegroups.com
Hi Pablo,

> Btw, I'm getting the feeling that we should put together an AJAX scraping
> HOWTO, as it's a commonly asked topic. Perhaps we should add it to the Firebug
> page: http://doc.scrapy.org/topics/firebug.html

That would be a great idea. It might also be interesting to show the
conceptual differences between Scrapy and other Python-based approaches,
for example

http://www.packtpub.com/article/web-scraping-with-python
http://www.packtpub.com/article/web-scraping-with-python-part-2

Thanks,

Matthias


Ismael Carnales

unread,
Sep 28, 2009, 11:34:52 AM9/28/09
to scrapy...@googlegroups.com
I've started a small AJAX howto in the Scrapy Wiki, you can see it here:

http://dev.scrapy.org/wiki/ScrapingAjaxSites

Suggestions are welcome :)

bye!
Reply all
Reply to author
Forward
0 new messages