Parsing perf research

38 views
Skip to first unread message

Peter van der Zee

unread,
Jul 15, 2012, 12:01:31 PM7/15/12
to js-t...@googlegroups.com
I've been doing some parsing research.

I've set up a jsperf that tries to test various ways of parsing input,
character by character, and acting accordingly.

http://jsperf.com/parsing-tests/4

This is a tricky test as it covers a few things. As such it's not a
micro-unit test that tests the difference between ++x or x++. Here's
what I've tried to cover:

"Get char from string", as string or number

(these are tested on all comparison tests, where applicable)
str[index]
str.charAt(index)
str.charCodeAt(index)

(these are only tested once, with the if-else, for completeness)
str.substring(index, index+1)
str.substr(index, 1)
str.slice(index, index+1)
implicit coercion string comparison
dynamic regex: new RegExp('.{index}(.)')[1] match and exec
regex per group
str.split('')
split foreach
foreach.apply(str, func)
str.replace

"Is char part of a set?"

switch
if-else
object as hash
array as hash
in object
one big regex
array.indexOf
string.indexOf

The results are certainly interesting, though it's fairly safe to say
that the "manual if-else" case wins.

If you have any other relevant tests to add in either category, please
tell me and I'll add them. I want to be as complete as possible on
this one.

In case you didn't recognize the main gist of the test, it's to
determine the start of the next token in JS source (ex the identifiers
and punctuators).

Preliminary results show that the manual if-else case is the best way
to go for char-by-char parsing. You would expect switch to perform
similar, but that really depends on the platform being tested. There's
no clear general second place to me.

Oddities:
Safari on the iPad (2 and 3) as well as on windows (5.1) prefer using
an array as a hash (by number). Kind of surprising since it makes a
sparse array with length=0xfff but only contains four elements. Safari
on the mac does not seem to have this bias.
IE9 performed kind of pathetic (on an idle decent machine). IE10 did
much better, although there are clear cases of optimization focus
there.
Firefox needs to work on switches.
JSperf needs bigger charts if there are many test cases :p

Any additions, comments and test runs are most welcome. Blog post coming soon.

Thanks for your help,

- peter

Ariya Hidayat

unread,
Jul 15, 2012, 2:04:58 PM7/15/12
to js-t...@googlegroups.com
Manual range checking is what I found out to be the fastest for
Esprima. This leads to a function like this:

function isWhiteSpace(ch) {
return (ch === ' ') || (ch === '\u0009') || (ch === '\u000B') ||
(ch === '\u000C') || (ch === '\u00A0') ||
(ch.charCodeAt(0) >= 0x1680 &&
'\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u202F\u205F\u3000\uFEFF'.indexOf(ch)
>= 0);
}

(Note that once you want to support Unicode, things get more complicated).

As a perspective, rather than a microbenchmark, I prefer running the
real benchmark suite (which involves popular libraries such as jQuery
in the test corpus) and monitor the speed impact on the total running
time of the parser. This helps me to avoid getting trapped in some
local maximum and also permits the analysis at the lower-level, close
to the JavaScript engine. Often this is more entertaining, instead of
shooting in the dark I got to learn why some code constructs provokes
the slow code path.

Some of my own discovery can be read in the following blog posts:

http://ariya.ofilabs.com/2012/02/javascript-branching-and-code-shuffling.html
http://ariya.ofilabs.com/2012/02/javascript-object-structure-speed-matters.html
http://ariya.ofilabs.com/2012/04/javascript-switch-case-deoptimization.html


Thanks!

Regards,

--
Ariya Hidayat, http://ariya.ofilabs.com
http://twitter.com/ariyahidayat

Peter van der Zee

unread,
Jul 15, 2012, 3:37:32 PM7/15/12
to js-t...@googlegroups.com
On Sun, Jul 15, 2012 at 8:04 PM, Ariya Hidayat <ariya....@gmail.com> wrote:
> Manual range checking is what I found out to be the fastest for
> Esprima. This leads to a function like this:
>
> function isWhiteSpace(ch) {
> return (ch === ' ') || (ch === '\u0009') || (ch === '\u000B') ||
> (ch === '\u000C') || (ch === '\u00A0') ||
> (ch.charCodeAt(0) >= 0x1680 &&
> '\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u202F\u205F\u3000\uFEFF'.indexOf(ch)
>>= 0);
> }

From the tests it seems like manual else-ifs are still faster than
indexOf. Optimization is a tricky business of course.

> (Note that once you want to support Unicode, things get more complicated).
>
> As a perspective, rather than a microbenchmark, I prefer running the
> real benchmark suite (which involves popular libraries such as jQuery

While I appreciate that, it's a lot of work to rewrite the code to try
every method. I think this is a fairly representative set of tests.
Interesting stuff, cheers :)

- peter

Ariya Hidayat

unread,
Jul 16, 2012, 10:59:56 AM7/16/12
to js-t...@googlegroups.com
> From the tests it seems like manual else-ifs are still faster than
> indexOf. Optimization is a tricky business of course.

The isWhiteSpace() is just part of the equation. If it is about to be
plugged in a similar scenario like your benchmark, there will be the
conditional:

if (isWhiteSpace(ch)) ++whitespace;

There are similar functions like that in Esprima, e.g.
isIdentifierStart, isHexDigit, etc.

Seems that your benchmark does not include identifier check yet. I
would have added that, checking for identifier (and then checking it
again whether it is a keyword or not) is quite important for a parser.

> While I appreciate that, it's a lot of work to rewrite the code to try
> every method. I think this is a fairly representative set of tests.

Depending on the parser code. In the above example, I can play with
different implementations of isWhiteSpace (or isIdentifierStart or
other functions for that matter), run the benchmark with V8 shell to
see the complete running time, rerun it again with V8 in debug to
watch for possible bail out and other deoptimizations.

Peter van der Zee

unread,
Jul 17, 2012, 6:37:46 PM7/17/12
to js-t...@googlegroups.com
What do you reckon an engine would do when I'd translate the huge
unicode identifier checker to an even bigger (but equivalent) if-else
structure? Would it cope or would it choke?

I just rewrote my punctuator checker and got some huge savings
(http://qfox.nl/weblog/259). But I'm afraid that there's a ceiling to
how much optimization one might get from this, especially with the
sheer number of checks you'd have to do for all the valid unicode
chars in an identifier. Even if you took care of checking certain
ranges, or even smarter stuff (there are ranges of "every-other"). At
some point, regex should win.

I'm gonna have a go at converting the unicode regex to an if-else
block automatically later, see if that gets me anywhere.

Especially if you make sure to check ascii (and then maybe "popular"
unicodes) first, the overhead should be minimal anyways.

- peter

Ariya Hidayat

unread,
Jul 17, 2012, 11:17:19 PM7/17/12
to js-t...@googlegroups.com
> What do you reckon an engine would do when I'd translate the huge
> unicode identifier checker to an even bigger (but equivalent) if-else
> structure? Would it cope or would it choke?
>
> I just rewrote my punctuator checker and got some huge savings
> (http://qfox.nl/weblog/259).

Been there, done that. This tiered comparison is precisely how
Esprima's scanPunctuator() works:
https://github.com/ariya/esprima/blob/master/esprima.js#L539.

It's been a while but last time I checked, the code generated by V8
for such construct is pretty sensible.

> But I'm afraid that there's a ceiling to
> how much optimization one might get from this, especially with the
> sheer number of checks you'd have to do for all the valid unicode
> chars in an identifier. Even if you took care of checking certain
> ranges, or even smarter stuff (there are ranges of "every-other"). At
> some point, regex should win.

It's pretty much my (gross) assumption. We just have to guess that it
is not that common and sacrifice it in the slow path. Here's an
example from Esprima:

function isIdentifierStart(ch) {
return (ch === '$') || (ch === '_') || (ch === '\\') ||
(ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z') ||
((ch.charCodeAt(0) >= 0x80) &&
Regex.NonAsciiIdentifierStart.test(ch));
}

The regex will be hit only in that uncommon (rare) case. Because of
the shortcut behavior of the logical OR operator, the fast path is not
impacted unless there is a character is that very specific code
points.

Peter van der Zee

unread,
Jul 18, 2012, 4:42:00 AM7/18/12
to js-t...@googlegroups.com
On Wed, Jul 18, 2012 at 5:17 AM, Ariya Hidayat <ariya....@gmail.com> wrote:
> is not that common and sacrifice it in the slow path. Here's an
> example from Esprima:
>
> function isIdentifierStart(ch) {
> return (ch === '$') || (ch === '_') || (ch === '\\') ||
> (ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z') ||
> ((ch.charCodeAt(0) >= 0x80) &&
> Regex.NonAsciiIdentifierStart.test(ch));
> }
>
> The regex will be hit only in that uncommon (rare) case. Because of
> the shortcut behavior of the logical OR operator, the fast path is not
> impacted unless there is a character is that very specific code
> points.

Yeah that's how I did it too. I guess the case is rare enough not to
worry about it too much. By the way, you should really start comparing
numbers rather than strings. It's so much faster.

Cheers

- peter
Reply all
Reply to author
Forward
0 new messages