Parse a page (partly generated by JavaScript) by using Selenium

28 views
Skip to first unread message

Maren Sonnenschein

unread,
Aug 28, 2014, 8:54:59 AM8/28/14
to seleniu...@googlegroups.com

I've got a problem: I want to parse a page (e.g. this one) to collect information about the offered apps and save these information into a database.

Moreover I am using crawler4j for visiting every (available) page. But the problem - as I can see - is, that crawler4j needs links to follow in the source code.

But in this case the hrefs are generated by some JavaScript code so that crawler4j does not get new links to visit / pages to crawl.

So my idea was to use Selenium so that I can inspect several Elements like in a real Browser like Chrome or Firefox (I'm quite new with this).

But, to be honest, I don't know how to get the "generated" HTML instead of the source code.

Can anybody help me?

Andreas Tolfsen

unread,
Aug 29, 2014, 9:59:58 AM8/29/14
to seleniu...@googlegroups.com
Maren Sonnenschein <maren.s...@gmail.com>:
> But, to be honest, I don't know how to get the "generated" HTML instead of
> the source code.

Presumably you mean a snapshot of the DOM which you can parse.

The get_page_source command returns the markup and contents of the
documentElement as a string which you can pass along to an HTML5
parser.
Reply all
Reply to author
Forward
0 new messages