best performing way to write a scanner's state machine?

145 views
Skip to first unread message

Brian Slesinsky

unread,
Dec 14, 2014, 5:01:22 PM12/14/14
to mi...@dartlang.org
Has anyone done any benchmarking on techniques for writing state machines in Dart?

I'm writing a template engine for fun and I've been copying scanning code from the html5lib package as a way to get started. Its state machine represents states as functions (tear-off methods, actually) and it uses a trampoline to go to the next state.

This seems well-organized, but since the html5lib package was ported from Python, I'm wondering if this is actually the best-performing way to represent a state machine in Dart? (And here I care about performance of JavaScript emitted by dart2js, not the VM.)

I have similar questions about how best to scan the characters in a string, if anyone wants to chime in.

The equivalent code in Go's html parser tries pretty hard to avoid unnecessary allocations, but perhaps that's not as much of an improvement in Dart-generated JavaScript?

- Brian

Greg Lowe

unread,
Dec 14, 2014, 10:25:06 PM12/14/14
to mi...@dartlang.org
I wrote a simple mustache template library a couple of years ago. I didn't use an explicit state machine. But feel free to steal ideas - or come up with new ideas of how to make it faster. I haven't done any benchmarking myself - though it seems to do alright in the TechEmpower benchmarks. My main concern was to make sure that the template rendering was efficient, as in an application the results of the template parse can be cached.

Bob Nystrom

unread,
Dec 15, 2014, 11:51:34 AM12/15/14
to General Dart Discussion, Peter von der Ahé, Brian Wilkerson
+ahe, +brian

Peter and Brian have both written high-performance lexers in Dart, so they probably have some expertise on what's fast and what isn't.

- bob

Brian Wilkerson

unread,
Dec 15, 2014, 12:05:46 PM12/15/14
to Bob Nystrom, General Dart Discussion, Peter von der Ahé
Peter and Brian have both written high-performance lexers in Dart, so they probably have some expertise on what's fast and what isn't.

Actually, I wrote my lexer in Java and it was machine translated to Dart. But it doesn't use an explicit state machine, so unfortunately I don't really have any experience implementing one in Dart.

Brian

Peter Ahé

unread,
Dec 15, 2014, 1:20:26 PM12/15/14
to General Dart Discussion
I would start by making a class for each state and try to avoid polymorphism. I would not use tear-offs and trampolines if I could avoid it as I assume it would lead to polymorphism.

If your scanner is written well, you won't need jit'ing to get close to peak performance. Your users may never use your template library long enough for it to become fully optimized. So try to strive for dead simple code with repetitions, not well-structured code with abstractions. Generate the code if you feel you need abstractions.

Sorry, all this is vague. If you send me a code review, I'll try to dump as many ideas into it as I can think of :-)

Brian Slesinsky

unread,
Dec 15, 2014, 9:53:21 PM12/15/14
to General Dart Discussion
Thanks, that's somewhat useful, but I don't quite get how to avoid polymorphism.

Since a lexer is a kind of iterator over tokens, it seems like I need to save the current state and then branch on it to the proper next-state handler when the parser asks for the next token. 

An enum and a switch statement would be one way to go. Or perhaps each state could be a function instead of a method like it is now. I'm not sure how I'd use a class per state if there's no polymorphism?

Or maybe it shouldn't be an iterator? I could generate a list of tokens before running the parser, but I'll still need a loop somewhere so it seems like it amounts to the same thing.

- Brian

--
For other discussions, see https://groups.google.com/a/dartlang.org/
 
For HOWTO questions, visit http://stackoverflow.com/tags/dart
 
To file a bug report or feature request, go to http://www.dartbug.com/new

Greg Lowe

unread,
Dec 15, 2014, 11:08:03 PM12/15/14
to mi...@dartlang.org, br...@slesinsky.org
I'm looking forward to having sync* functions, as it means it should be possible to write scanners like this:

Iterable<Token> nextToken() sync* {
 
switch(_peek()) {
   
case _EOF:
     
return;
   
case _OPEN_MUSTACHE:
     
yield _scanMustacheTag();
   
default:
     
yield _scanText();
 
}
}

Not sure about the sync* syntax yet - that example is probably wrong. And I have no idea what the performance will be like.

Greg Lowe

unread,
Dec 15, 2014, 11:12:10 PM12/15/14
to mi...@dartlang.org, br...@slesinsky.org
Actually that example is hopelessly broken, but hopefully you get the idea.

Take #2.

Token nextToken() sync* {
 
int c;
 
while ((c = _peek()) != _EOF) {
   
switch(c) {

Brian Slesinsky

unread,
Dec 16, 2014, 12:06:00 AM12/16/14
to Greg Lowe, General Dart Discussion
It seems like that will compile down to something complicated as JavaScript, though?

Another question is whether to work with runes or code points. The html5lib library uses code points and translates each code point back to one-character strings, but there's a TODO change this.

It is actually too early to think seriously about performance since I'm still in the "make it work" phase. I'm just hoping to pick an overall structure that I won't have to rewrite later.

- Brian

Justin Fagnani

unread,
Dec 16, 2014, 2:18:29 PM12/16/14
to General Dart Discussion, Greg Lowe
On Mon, Dec 15, 2014 at 9:05 PM, Brian Slesinsky <br...@slesinsky.org> wrote:
It seems like that will compile down to something complicated as JavaScript, though?

Another question is whether to work with runes or code points. The html5lib library uses code points and translates each code point back to one-character strings, but there's a TODO change this.

To unsubscribe from this group and stop receiving emails from it, send an email to misc+uns...@dartlang.org.
Reply all
Reply to author
Forward
0 new messages