Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Parsing perf research
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  7 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Peter van der Zee  
View profile  
 More options Jul 15 2012, 12:01 pm
From: Peter van der Zee <qfo...@gmail.com>
Date: Sun, 15 Jul 2012 18:01:31 +0200
Local: Sun, Jul 15 2012 12:01 pm
Subject: Parsing perf research
I've been doing some parsing research.

I've set up a jsperf that tries to test various ways of parsing input,
character by character, and acting accordingly.

http://jsperf.com/parsing-tests/4

This is a tricky test as it covers a few things. As such it's not a
micro-unit test that tests the difference between ++x or x++. Here's
what I've tried to cover:

"Get char from string", as string or number

(these are tested on all comparison tests, where applicable)
str[index]
str.charAt(index)
str.charCodeAt(index)

(these are only tested once, with the if-else, for completeness)
str.substring(index, index+1)
str.substr(index, 1)
str.slice(index, index+1)
implicit coercion string comparison
dynamic regex: new RegExp('.{index}(.)')[1] match and exec
regex per group
str.split('')
split foreach
foreach.apply(str, func)
str.replace

"Is char part of a set?"

switch
if-else
object as hash
array as hash
in object
one big regex
array.indexOf
string.indexOf

The results are certainly interesting, though it's fairly safe to say
that the "manual if-else" case wins.

If you have any other relevant tests to add in either category, please
tell me and I'll add them. I want to be as complete as possible on
this one.

In case you didn't recognize the main gist of the test, it's to
determine the start of the next token in JS source (ex the identifiers
and punctuators).

Preliminary results show that the manual if-else case is the best way
to go for char-by-char parsing. You would expect switch to perform
similar, but that really depends on the platform being tested. There's
no clear general second place to me.

Oddities:
Safari on the iPad (2 and 3) as well as on windows (5.1) prefer using
an array as a hash (by number). Kind of surprising since it makes a
sparse array with length=0xfff but only contains four elements. Safari
on the mac does not seem to have this bias.
IE9 performed kind of pathetic (on an idle decent machine). IE10 did
much better, although there are clear cases of optimization focus
there.
Firefox needs to work on switches.
JSperf needs bigger charts if there are many test cases :p

Any additions, comments and test runs are most welcome. Blog post coming soon.

Thanks for your help,

- peter


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ariya Hidayat  
View profile  
 More options Jul 15 2012, 2:04 pm
From: Ariya Hidayat <ariya.hida...@gmail.com>
Date: Sun, 15 Jul 2012 11:04:58 -0700
Local: Sun, Jul 15 2012 2:04 pm
Subject: Re: [js-tools] Parsing perf research
Manual range checking is what I found out to be the fastest for
Esprima. This leads to a function like this:

    function isWhiteSpace(ch) {
        return (ch === ' ') || (ch === '\u0009') || (ch === '\u000B') ||
            (ch === '\u000C') || (ch === '\u00A0') ||
            (ch.charCodeAt(0) >= 0x1680 &&
             '\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u 200A\u202F\u205F\u3000\uFEFF'.indexOf(ch)

>= 0);

    }

(Note that once you want to support Unicode, things get more complicated).

As a perspective, rather than a microbenchmark, I prefer running the
real benchmark suite (which involves popular libraries such as jQuery
in the test corpus) and monitor the speed impact on the total running
time of the parser. This helps me to avoid getting trapped in some
local maximum and also permits the analysis at the lower-level, close
to the JavaScript engine. Often this is more entertaining, instead of
shooting in the dark I got to learn why some code constructs provokes
the slow code path.

Some of my own discovery can be read in the following blog posts:

http://ariya.ofilabs.com/2012/02/javascript-branching-and-code-shuffl...
http://ariya.ofilabs.com/2012/02/javascript-object-structure-speed-ma...
http://ariya.ofilabs.com/2012/04/javascript-switch-case-deoptimizatio...

Thanks!

Regards,

--
Ariya Hidayat, http://ariya.ofilabs.com
http://twitter.com/ariyahidayat


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Peter van der Zee  
View profile  
 More options Jul 15 2012, 3:37 pm
From: Peter van der Zee <qfo...@gmail.com>
Date: Sun, 15 Jul 2012 21:37:32 +0200
Local: Sun, Jul 15 2012 3:37 pm
Subject: Re: [js-tools] Parsing perf research

On Sun, Jul 15, 2012 at 8:04 PM, Ariya Hidayat <ariya.hida...@gmail.com> wrote:
> Manual range checking is what I found out to be the fastest for
> Esprima. This leads to a function like this:

>     function isWhiteSpace(ch) {
>         return (ch === ' ') || (ch === '\u0009') || (ch === '\u000B') ||
>             (ch === '\u000C') || (ch === '\u00A0') ||
>             (ch.charCodeAt(0) >= 0x1680 &&
>              '\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u 200A\u202F\u205F\u3000\uFEFF'.indexOf(ch)
>>= 0);
>     }

From the tests it seems like manual else-ifs are still faster than
indexOf. Optimization is a tricky business of course.

> (Note that once you want to support Unicode, things get more complicated).

> As a perspective, rather than a microbenchmark, I prefer running the
> real benchmark suite (which involves popular libraries such as jQuery

While I appreciate that, it's a lot of work to rewrite the code to try
every method. I think this is a fairly representative set of tests.

Interesting stuff, cheers :)

- peter


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ariya Hidayat  
View profile  
 More options Jul 16 2012, 10:59 am
From: Ariya Hidayat <ariya.hida...@gmail.com>
Date: Mon, 16 Jul 2012 07:59:56 -0700
Local: Mon, Jul 16 2012 10:59 am
Subject: Re: [js-tools] Parsing perf research

> From the tests it seems like manual else-ifs are still faster than
> indexOf. Optimization is a tricky business of course.

The isWhiteSpace() is just part of the equation. If it is about to be
plugged in a similar scenario like your benchmark, there will be the
conditional:

  if (isWhiteSpace(ch)) ++whitespace;

There are similar functions like that in Esprima, e.g.
isIdentifierStart, isHexDigit, etc.

Seems that your benchmark does not include identifier check yet. I
would have added that, checking for identifier (and then checking it
again whether it is a keyword or not) is quite important for a parser.

> While I appreciate that, it's a lot of work to rewrite the code to try
> every method. I think this is a fairly representative set of tests.

Depending on the parser code. In the above example, I can play with
different implementations of isWhiteSpace (or isIdentifierStart or
other functions for that matter), run the benchmark with V8 shell to
see the complete running time, rerun it again with V8 in debug to
watch for possible bail out and other deoptimizations.

Regards,

--
Ariya Hidayat, http://ariya.ofilabs.com
http://twitter.com/ariyahidayat


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Peter van der Zee  
View profile  
 More options Jul 17 2012, 6:37 pm
From: Peter van der Zee <qfo...@gmail.com>
Date: Wed, 18 Jul 2012 00:37:46 +0200
Local: Tues, Jul 17 2012 6:37 pm
Subject: Re: [js-tools] Parsing perf research
What do you reckon an engine would do when I'd translate the huge
unicode identifier checker to an even bigger (but equivalent) if-else
structure? Would it cope or would it choke?

I just rewrote my punctuator checker and got some huge savings
(http://qfox.nl/weblog/259). But I'm afraid that there's a ceiling to
how much optimization one might get from this, especially with the
sheer number of checks you'd have to do for all the valid unicode
chars in an identifier. Even if you took care of checking certain
ranges, or even smarter stuff (there are ranges of "every-other"). At
some point, regex should win.

I'm gonna have a go at converting the unicode regex to an if-else
block automatically later, see if that gets me anywhere.

Especially if you make sure to check ascii (and then maybe "popular"
unicodes) first, the overhead should be minimal anyways.

- peter


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ariya Hidayat  
View profile  
 More options Jul 17 2012, 11:17 pm
From: Ariya Hidayat <ariya.hida...@gmail.com>
Date: Tue, 17 Jul 2012 20:17:19 -0700
Local: Tues, Jul 17 2012 11:17 pm
Subject: Re: [js-tools] Parsing perf research

> What do you reckon an engine would do when I'd translate the huge
> unicode identifier checker to an even bigger (but equivalent) if-else
> structure? Would it cope or would it choke?

> I just rewrote my punctuator checker and got some huge savings
> (http://qfox.nl/weblog/259).

Been there, done that. This tiered comparison is precisely how
Esprima's scanPunctuator() works:
https://github.com/ariya/esprima/blob/master/esprima.js#L539.

It's been a while but last time I checked, the code generated by V8
for such construct is pretty sensible.

> But I'm afraid that there's a ceiling to
> how much optimization one might get from this, especially with the
> sheer number of checks you'd have to do for all the valid unicode
> chars in an identifier. Even if you took care of checking certain
> ranges, or even smarter stuff (there are ranges of "every-other"). At
> some point, regex should win.

It's pretty much my (gross) assumption. We just have to guess that it
is not that common and sacrifice it in the slow path. Here's an
example from Esprima:

  function isIdentifierStart(ch) {
        return (ch === '$') || (ch === '_') || (ch === '\\') ||
            (ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z') ||
            ((ch.charCodeAt(0) >= 0x80) &&
Regex.NonAsciiIdentifierStart.test(ch));
    }

The regex will be hit only in that uncommon (rare) case. Because of
the shortcut behavior of the logical OR operator, the fast path is not
impacted unless there is a character is that very specific code
points.

--
Ariya Hidayat, http://ariya.ofilabs.com
http://twitter.com/ariyahidayat


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Peter van der Zee  
View profile  
 More options Jul 18 2012, 4:42 am
From: Peter van der Zee <qfo...@gmail.com>
Date: Wed, 18 Jul 2012 10:42:00 +0200
Subject: Re: [js-tools] Parsing perf research

On Wed, Jul 18, 2012 at 5:17 AM, Ariya Hidayat <ariya.hida...@gmail.com> wrote:
> is not that common and sacrifice it in the slow path. Here's an
> example from Esprima:

>   function isIdentifierStart(ch) {
>         return (ch === '$') || (ch === '_') || (ch === '\\') ||
>             (ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z') ||
>             ((ch.charCodeAt(0) >= 0x80) &&
> Regex.NonAsciiIdentifierStart.test(ch));
>     }

> The regex will be hit only in that uncommon (rare) case. Because of
> the shortcut behavior of the logical OR operator, the fast path is not
> impacted unless there is a character is that very specific code
> points.

Yeah that's how I did it too. I guess the case is rare enough not to
worry about it too much. By the way, you should really start comparing
numbers rather than strings. It's so much faster.

Cheers

- peter


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »