Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion Bug when optimisation is not set to -1
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Cefn  
View profile  
 More options May 6 2010, 6:46 am
From: Cefn <google....@cefn.com>
Date: Thu, 6 May 2010 03:46:57 -0700 (PDT)
Local: Thurs, May 6 2010 6:46 am
Subject: [env-js] Re: Bug when optimisation is not set to -1
Should have been clearer. Strictly the load and parse time is around a
second. It takes another second and a half if javascript in the page
is turned on with...

   scriptTypes: {
      'text/javascript': true
   }

...and indeed this is the scenario we need env.js for.

Some pages sadly operate ONLY when javascript is active, so can only
be crawled properly by evaluating that too. I'm guessing this timing
will get massively cut down by javascript compilation.

I've been testing this timing by simply re-running the lines from
Envjs onwards (e.g. only load libraries once with one evaluate call,
then evaluate the 'load and scrape' half of the script over and over
in a java while loop) to get an indication of timings for the load and
scrape part.

I guess i'm often on a dual core so I might get a double time speedup
if Java can use both cores. Other than multicore I don't see how
threading is going to improve results since it's actually processing
time which is the bottleneck.

If there's a Spidermonkey alternative in the long run it would mean I
can keep developing on this, and switch when it's critical. Be great
to hear more about that alternative.

Cefn

On May 6, 3:56 am, chris thatcher <thatcher.christop...@gmail.com>
wrote:

> How big is the page?  At some point soon we may have the optimization issue
> worked out, but in reality the optimization wont substantially change the
> parse time, maybe shave off a second for an 'average' sized page.   The
> catch 22 is that the html 5 parser is the reason we can't optimize yet and
> if you are really crawling html in the wild you wouldn't get very far
> without it.  Rhino is slower than its C counterparts, which is why envjs
> developers have a number of other supported platforms at various stages of
> readiness.

> In general the platform you choose will be based on the project
> requirements,  if you dont need access to java apis, you might want to ask
> steven parkes or nickg about their integrations with spider monkey.

> For env.rhino.js your best solution in the near term is to simply thread a
> number of crawlers/scrappers to optimize processing time.

> Thatcher

> On Wed, May 5, 2010 at 3:26 PM, Cefn <google....@cefn.com> wrote:
> > Im using the rhino/js.jar distributed with env.js, and I found some
> > bizarre behaviour from invocation with any optimization turned on.

> > If I run the following script, js/bt/test.js

> > load("js/envjs/dist/env.rhino.js");
> > load("js/libs/sizzle.js");
> > Envjs("http://alistapart.com/", {
> >    logLevel: Envjs.DEBUG,
> >    scriptTypes: {
> >       'text/javascript': true
> >    }
> > });
> > (function(){
> >        var results = Sizzle("h4.title a");
> >        for(count = 0; count < results.length; count++){
> >                print(results[count].childNodes[0].data);
> >        }
> > })();

> > using the harness athttp://pastebin.com/zAmJpKNC(note the line
> > setOptimizationLevel(-1);) it correctly reports the titles of
> > Alistapart. A typical invocation is...

> > java -cp bin:js.jar Scraper

> > ...where bin contains the Scraper class, js.jar is rhino, and js/envjs/
> > dist/ contains the contents of the dist/ directory from envjs.

> > If I comment out the setOptimization(-1); line, then it throws errors
> > about missing javascript references, and doesn't even get to load the
> > page. Here's an example error it throws, just from changing that line.

> > java -cp bin:js.jar Scraper
> > Exception in thread "main" org.mozilla.javascript.EcmaError:
> > ReferenceError: "Envjs" is not defined. (test.js#2)

> > This is obviously fatal to using the framework with any acceleration.
> > The problem with running it unoptimized as a scraper, is that it takes
> > 2.5 seconds to complete the parse operation alone (barring env.js and
> > sizzle.js startup costs).

> > It would mean a lot to have it run optimized, which is why I tried
> > Sizzle instead of JQuery - to see if I could come in under the 64Kb
> > limit and turn optimization back on.

> > Perhaps I can avoid the env.js and sizzle.js startup costs for page+1,
> > page+2 at different URLs by just reusing the same Rhino Context, and
> > experts on this list may know a good way to do this, I'd welcome
> > suggestions, expecially given I may have to run this framework using
> > its slowest possible setting.

> > Anyone know what my options are to get JQuery-style selectors in
> > env.js without taking more than two seconds to process a single page.
> > I have hundreds of pages to process.

> > --
> > You received this message because you are subscribed to the Google Groups
> > "Env.js" group.
> > To post to this group, send email to envjs@googlegroups.com.
> > To unsubscribe from this group, send email to
> > envjs+unsubscribe@googlegroups.com <envjs%2Bunsubscribe@googlegroups.com>.
> > For more options, visit this group at
> >http://groups.google.com/group/envjs?hl=en.

> --
> Christopher Thatcher

> --
> You received this message because you are subscribed to the Google Groups "Env.js" group.
> To post to this group, send email to envjs@googlegroups.com.
> To unsubscribe from this group, send email to envjs+unsubscribe@googlegroups.com.
> For more options, visit this group athttp://groups.google.com/group/envjs?hl=en.

--
You received this message because you are subscribed to the Google Groups "Env.js" group.
To post to this group, send email to envjs@googlegroups.com.
To unsubscribe from this group, send email to envjs+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/envjs?hl=en.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.