Roundabout way of scraping dynamic content.

110 views
Skip to first unread message

mitch

unread,
Apr 24, 2014, 5:58:03 AM4/24/14
to scrapy...@googlegroups.com
Hi everyone,

I'm just a hack when it comes to this stuff, so this solution is by no means elegant.

I have some dynamic content I want to scrape.  I have a small number of actual pages (< 50), but I want to parse many different page elements.  Because of this, I thought I'd just manually visit the pages, download the html source after JS does its work, then put the files on my own private webserver and to a quick crawl so that I can have the parsing benefits of scrapy....

The problem I'm running in to is that, even after the page has been saved as as html file, much of the information I want is still hidden inside these "hidden_elem" tags and surrounded by comment type "<--!" characters, making it invisible to scrapy.  However, the information IS in the code, I can open the file and see it plain as day.  How can I make scrapy give it to me?

Thanks so much!

Bill Ebeling

unread,
Apr 28, 2014, 10:17:47 AM4/28/14
to scrapy...@googlegroups.com
Hey Mitch,

At the risk of stating the obvious, Scrapy handles dynamic content quite well.  The general approach is to scrape the page, submit requests for the ajax, stich the item together, submit it to the pipeline.

That said, it's not complicated, but not trivial, either.

To your specific point, the solution is either to regex it out, or to start fiddling with the underlying html.  I would not personally download someone else's page and then put it on a server, since the js is still going to be running and logging things and all that.

If you want to look into writing a crawler that gets the dynamic content, start here: http://doc.scrapy.org/en/latest/topics/request-response.html and pay special attention to the meta dict.

If you want more help with the specific site, provide a link so we can see it.

Hope that helps.

bruce

unread,
Apr 28, 2014, 1:01:48 PM4/28/14
to scrapy-users
I didn't think scrappy had the ability to run remote ajax, similar to
casperjs/phantom/nodejs...

Does scrappy run a headless browser process to accomplish this??

thanks
> --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to scrapy-users...@googlegroups.com.
> To post to this group, send email to scrapy...@googlegroups.com.
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.

Bill Ebeling

unread,
Apr 28, 2014, 1:13:53 PM4/28/14
to scrapy...@googlegroups.com
Scrapy sends a request to the ajax address just like it does for the normal webpage. You maintain data from one request to the other with the meta dict.

There was a tutorial on it a while back about scraping the nasa website for it's pic of the day.  Can't seem to find it, now though.  If you take a look at the link above, you can read all about it.


You received this message because you are subscribed to a topic in the Google Groups "scrapy-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scrapy-users/LyCuWu4ydeA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scrapy-users...@googlegroups.com.

bruce

unread,
Apr 28, 2014, 1:30:31 PM4/28/14
to scrapy-users
bill...

not sure that's the same... ie, I don't think scrapy has a way to
"wait" for an element to show up on a given page, based on the
underlying ajax functions...

I had talked to pablo about this awhile ago and he was saying scrapy
couldn't handle this. Are you saying it now can??

This would be cool if it really can.

Bill Ebeling

unread,
Apr 28, 2014, 1:34:52 PM4/28/14
to scrapy...@googlegroups.com
Hey  Bruce,

I'm not sure what you're exact situation is, and of course Pablo is far more knowledgeable about Scrapy cans and can'ts than I am. 

But I will say that I have many spiders that crawl AJAX powered sites.

bruce

unread,
Apr 28, 2014, 1:35:48 PM4/28/14
to scrapy-users
Hey Bill.

i found what I think to be articles discussing the nasa image/scrapy.
Yeah, it's not really doing the headless browser at all.. It's
"simulating" a piece of what the javascript returns from that given
page.. But for a complex dnamic site, still doesn't do a "real"
headless browser..

thanks


On Mon, Apr 28, 2014 at 1:30 PM, bruce <bado...@gmail.com> wrote:

Nikolaos-Digenis Karagiannis

unread,
Apr 29, 2014, 6:41:58 AM4/29/14
to scrapy...@googlegroups.com
Bill implies that you will have to yield those Ajax requests yourself (though that misses the point of "dynamic"). Nothing stands on your way to do this (provided you have the headers and body for the request)
Regarding the "hidden" information, the scrapy top level package can not. XPath selectors (scrapy.selector.Selector) can:
Selector(response).xpath("//comment()")
When constructing xpath expressions which describe elements, open the files with a plain text editor and not a browser that may alter the html to comply with the standard and/or eval any javascript leftovers.

William Kinaan

unread,
Apr 30, 2014, 2:42:17 PM4/30/14
to scrapy...@googlegroups.com
Hi,
Scrapy downloads the DOM of any page. If you want extract data that doesn't come with the DOM(IE: Ajax data), you can make a new Ajax request and set the correct headers and cookies.
I would scrapy the page, then call the ajax calls.
Reply all
Reply to author
Forward
0 new messages