jsdom, I need to understand the concept

235 views
Skip to first unread message

Jorge Ventura

unread,
Dec 8, 2010, 11:49:15 AM12/8/10
to NodeJS
I was trying to use jsdom as a headless browser to scrape data but I think my concept around this library maybe not correct.
Consider the example below.

I first tried to simulate the test page where the source is:
<html>
    <head>
        <script type="text/javascript">
            document.write("Test");
        </script>
    </head>
    <body>
    </body>
</html>
 
When I browse this page using jsdom I get exactly the same. When I do the same using Firefox or Chrome I get the new DOM structure what make absolute sense to me. What did happen here is that at the time of parsing the parser found document.write and append a text node in the body.

<html>
    <head>
        <script type="text/javascript">
            document.write("Test");
        </script>
    </head>
<body>
Test
</body>
</html>

How can I scrape data in this simple case ?

Ventura

Dean Mao

unread,
Dec 8, 2010, 2:37:27 PM12/8/10
to nod...@googlegroups.com
I don't think domjs actually does anything with inline javascript tags.  It just treats it like another xml element.  I'm not sure if there are any projects out there right now that will render the dom & execute inline javascript -- that's probably a much harder problem.  I've heard of projects like htmlunit successfully execute certain scenarios of javascript, but I don't believe anyone has come up with a great headless renderer that doesn't involve an existing browser.


--
You received this message because you are subscribed to the Google Groups "nodejs" group.
To post to this group, send email to nod...@googlegroups.com.
To unsubscribe from this group, send email to nodejs+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nodejs?hl=en.

Jorge Ventura

unread,
Dec 8, 2010, 3:09:46 PM12/8/10
to nod...@googlegroups.com
The point is that all of this software (htmlunit and ...) is based on java what have very poor performance.

I did an imperfect change (patch bellow) at the version 0.1.20 and get the right answer from the code example.

I would like to have a direction for this problem. I want to scrape data from ajax application and without this feature it become impossible.

diff -r -u tmpvar-jsdom-462ef4a/lib/jsdom/browser/index.js tmpvar-jsdom-ventura/lib/jsdom/browser/index.js
--- tmpvar-jsdom-462ef4a/lib/jsdom/browser/index.js    2010-11-25 13:53:29.000000000 -0600
+++ tmpvar-jsdom-ventura/lib/jsdom/browser/index.js    2010-12-05 08:34:49.000000000 -0600
@@ -196,7 +197,7 @@
   }
   if (!dom.HTMLDocument.write) {
     dom.HTMLDocument.prototype.write = function(html) {
-      this.innerHTML = html;
+      htmltodom.appendHtmlToElement(html, this.body);
     };
   }
 
diff -r -u tmpvar-jsdom-462ef4a/lib/jsdom/level2/html.js tmpvar-jsdom-ventura/lib/jsdom/level2/html.js
--- tmpvar-jsdom-462ef4a/lib/jsdom/level2/html.js    2010-11-25 13:53:29.000000000 -0600
+++ tmpvar-jsdom-ventura/lib/jsdom/level2/html.js    2010-12-05 08:26:56.000000000 -0600
@@ -328,7 +328,7 @@
   },
 
   writeln : function(text) {
-    this.innerHTML = text + '\n';
+    this.write(text + '\n');
   },

Mikeal Rogers

unread,
Dec 8, 2010, 3:45:51 PM12/8/10
to nod...@googlegroups.com
1st of all, jsdom is pretty slow, even though it's about 8x faster than it used to be. I have this suspicion that it had to do with the excessive use of getters and setters you're forced to use implementing the DOM API.

Also, jsdom *does* run inline script tags, which is how you use jQuery and other libraries server side. Some stuff works, some doesn't, tmpvar loves patches :)

-Mikeal

Marco Rogers

unread,
Dec 8, 2010, 3:49:12 PM12/8/10
to nodejs

> used to be. I have this suspicion that it had to do with the excessive use
> of getters and setters you're forced to use implementing the DOM API.

I wonder if it would be significantly faster using C++ accessor
methods through v8. I'm gonna have to tackle this at some point with
libxmljs. Although it has been languishing of late.

:Marco

Mikeal Rogers

unread,
Dec 8, 2010, 3:56:02 PM12/8/10
to nod...@googlegroups.com
We've got a lot of browser vendors who already implemented the DOM in C and C++ and include lots of good optimizations. If we can decouple the code that actually displays stuff from one of them I think it could be a better solution.

I know Tom Hughes Croucher has advocated this as well. At first I was resistant because I thought it might be easier/faster to do it in pure js but the more I think about it the more I realize how many modern optimizations there are around things like css selectors that are already in the browser implementations.

Someone wanna report on how hard this might be if we wanted to rip code out of WebKit or Firefox?

-Mikeal


Jorge Ventura

unread,
Dec 8, 2010, 4:39:01 PM12/8/10
to nod...@googlegroups.com
I guess that decouple the rendering parts is hard.
This link is a good reference that explain something about this.

http://taligarsiel.com/Projects/howbrowserswork1.htm

One point that worth to say is that Envjs is doing that, when it loads the page you have all DOM tree with the inline javascript processed. Unfortunately today it's running only on the slow stack Envjs/rhino/java.  The maintainer is working in a new version to run using nodejs. In my opinion this is the best promise for this problem in near future.

Maybe an idea to port the parser from envjs to jsdom.

Ventura

Nick Husher

unread,
Dec 8, 2010, 3:21:01 PM12/8/10
to nod...@googlegroups.com
That only covers document.write, which isn't used very much anymore. It definitely isn't used for AJAX data, since using document.write after the page is loaded (IIRC) will wipe the document out and replace it with the string passed as an argument.

I think you're looking at the wrong tool for the job. A company I worked for once used Selenium to do screen scraping at one point, since it lets you instantiate a browser and run it against a website. I seem to remember that this was a bit of a hack--selenium is used as a browser testing framework--but a similar approach might work.

To do it right, you will probably need to figure out how to spawn a headless browser, which might involve fooling around with webkit or gecko bindings and then operate based on the DOM trees it returns. I don't think this is something you can do in nodeJS alone.

What data are you trying to get ahold of? A lot of websites have APIs you can use to access their AJAX data.
____________________

Nicholas Husher

Jorge Ventura

unread,
Dec 8, 2010, 5:20:12 PM12/8/10
to nod...@googlegroups.com
Envjs does what I need, the only problem is the rhino/java. The scraper will be installed in an small ARM processor, I can't use something like Selenium. Even JVM for such machine is a nightmare. The jsdom/nodejs has been already used with success for static pages and I guess maybe close to work with javascript.

I think you are right about document.write, I used only as an example.

Justin Cormack

unread,
Dec 8, 2010, 5:33:25 PM12/8/10
to nod...@googlegroups.com
But why cant you use an embedded headless webkit browser?

Justin

Jorge Ventura

unread,
Dec 8, 2010, 5:42:02 PM12/8/10
to nod...@googlegroups.com
This is part of back end from an SNMP agent, collect the data in a console (ajax application) and fill out a MIB record to be transmitted to another remote application for monitoring and requests for maintenance.

Jorge Ventura

unread,
Dec 8, 2010, 5:45:56 PM12/8/10
to nod...@googlegroups.com
Sorry Justin, I think I missed understood your question.
I don't think about a headless webkit browser just because I don't know that it's possible.

Do you have some references ??

On Wed, Dec 8, 2010 at 4:33 PM, Justin Cormack <jus...@specialbusservice.com> wrote:

Marco Rogers

unread,
Dec 8, 2010, 6:12:20 PM12/8/10
to nodejs
Mikeal, In spirit I agree with you. But everything I've read makes it
sound like it would be a nightmare to do that kind of extraction. v8
is pretty bloated primarily because it's designed to be embedded in a
browser environment. The DOM is even worse. I'll try to find some of
the stuff I'm remembering. The other thing is that the xml parsing in
browsers is iffy and it doesn't give you the level of control in
parsing and error feedback that libxml2 does. On the other hand, the
html4 support in libxml2 is pretty decent, but not as good as browser
engines. Plus it's not really being updated. There don't seem to be
any plans to implement an html5 parser.

In a perfect world, we've have two separate parsing engines that are
geared towards xml or html and same api around both. Who's
volunteering for that ;) ?

:Marco

On Dec 8, 3:56 pm, Mikeal Rogers <mikeal.rog...@gmail.com> wrote:
> We've got a lot of browser vendors who already implemented the DOM in C and
> C++ and include lots of good optimizations. If we can decouple the code that
> actually displays stuff from one of them I think it could be a better
> solution.
>
> I know Tom Hughes Croucher has advocated this as well. At first I was
> resistant because I thought it might be easier/faster to do it in pure js
> but the more I think about it the more I realize how many modern
> optimizations there are around things like css selectors that are already in
> the browser implementations.
>
> Someone wanna report on how hard this might be if we wanted to rip code out
> of WebKit or Firefox?
>
> -Mikeal
>
> On Wed, Dec 8, 2010 at 12:49 PM, Marco Rogers <marco.rog...@gmail.com>wrote:
>
>
>
>
>
>
>
>
>
> > > used to be. I have this suspicion that it had to do with the excessive
> > use
> > > of getters and setters you're forced to use implementing the DOM API.
>
> > I wonder if it would be significantly faster using C++ accessor
> > methods through v8.  I'm gonna have to tackle this at some point with
> > libxmljs.  Although it has been languishing of late.
>
> > :Marco
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "nodejs" group.
> > To post to this group, send email to nod...@googlegroups.com.
> > To unsubscribe from this group, send email to
> > nodejs+un...@googlegroups.com<nodejs%2Bunsu...@googlegroups.com>
> > .

Jorge Ventura

unread,
Dec 8, 2010, 6:56:21 PM12/8/10
to nod...@googlegroups.com
Justin,
I found libqt4-webkit. I think that can helps me. I will check.

Thank you.

On Wed, Dec 8, 2010 at 4:33 PM, Justin Cormack <jus...@specialbusservice.com> wrote:

Dean Mao

unread,
Dec 8, 2010, 7:16:39 PM12/8/10
to nod...@googlegroups.com
I believe you can spawn a headless firefox instance (often used for selenium testing) that has a firefox plugin that makes a websocket connection to your scraping engine and just have it send firefox a bunch of commands to scrape content.  This is probably overengineering the solution though...

Jorge Ventura

unread,
Dec 8, 2010, 7:29:07 PM12/8/10
to nod...@googlegroups.com
"Headless firefox instance", never heard before. This is another restriction that I have, I think that I can't have X libraries because the small box I am using has no graphic card, I have access only by network or serial lines. I guess that without X libraries firefox doesn't starts.

I would like to get any reference in case you have.

Thanks.

Dean Mao

unread,
Dec 8, 2010, 7:43:35 PM12/8/10
to nod...@googlegroups.com
This is approximately what I did the last time I ran selenium headless using firefox:


You won't need a graphics card.  However, it's probably memory intensive if you have a bunch of firefox instances running for your scraping needs.  

Jorge Ventura

unread,
Dec 8, 2010, 7:49:01 PM12/8/10
to nod...@googlegroups.com
It's a thing that I have to check. No, I will have only one instance per time, the MIB library is not re-entrant, I have to run one process per time. I have 512Mb, maybe enough.

Thank you.

Justin Cormack

unread,
Dec 9, 2010, 3:14:08 AM12/9/10
to nod...@googlegroups.com
The qt-webkit can run without a real X server you can run it eg backed by just an image file. 

Justin

Jorge Ventura

unread,
Dec 9, 2010, 6:24:33 AM12/9/10
to nod...@googlegroups.com

Yes, I was checking. It's all I need. You can use QApplication(argc, argv, false), the third parameter is to specify that you are not using GUI and the library doesn't connect to X.
I am using an ARM box that has Debian Lenny and everything is there, I installed last night.
I was looking for something like this for over seven months.

Thank you so much.

On Dec 9, 2010 2:11 AM, "Justin Cormack" <jus...@specialbusservice.com> wrote:

The qt-webkit can run without a real X server you can run it eg backed by just an image file. 

Justin


On 9 Dec 2010, at 00:29, Jorge Ventura <jorge.arau...@gmail.com> wrote:

> "Headless firefox...


--
You received this message because you are subscribed to the Google Groups "nodejs" group.

To po...

Rob Colburn

unread,
Dec 9, 2010, 11:22:52 AM12/9/10
to nodejs
Jorge,

Looks like you've found your solution.

Though, I'm just a bit curious.

I think you described a situation where there is existing an ajax
application in place. You're building a series of small ARM devices,
than can perform the necessary actions without a GUI. However, you
are not able to extend the existing application to provide an API.
So, the solution is to have the devices spawn a Node process, spawn-a-
headless-browser, perform the necessary actions on the ajax app.

Assuming this is the case, would it be possible to instead to add a
new server just to interact with the existing ajax app. Then you can
have the new server spawn the headless browser, and provide an API for
the devices? I say this because the original solution seems a little
fragile. If the maintainers of the existing ajax application re-
factor then you'll need to patch all of the devices, right?

On Dec 9, 3:24 am, Jorge Ventura <jorge.araujo.vent...@gmail.com>
wrote:
> Yes, I was checking. It's all I need. You can use QApplication(argc, argv,
> false), the third parameter is to specify that you are not using GUI and the
> library doesn't connect to X.
> I am using an ARM box that has Debian Lenny and everything is there, I
> installed last night.
> I was looking for something like this for over seven months.
>
> Thank you so much.
>
> On Dec 9, 2010 2:11 AM, "Justin Cormack" <jus...@specialbusservice.com>
> wrote:
>
> The qt-webkit can run without a real X server you can run it eg backed by
> just an image file.
>
> Justin
>
> On 9 Dec 2010, at 00:29, Jorge Ventura <jorge.araujo.vent...@gmail.com>

greim

unread,
Dec 10, 2010, 6:01:59 PM12/10/10
to nodejs
Reply all
Reply to author
Forward
0 new messages