Help needed - a faster parser for Cucumber (Gherkin, Ragel)

389 views
Skip to first unread message

aslak hellesoy

unread,
Aug 31, 2009, 11:45:32 PM8/31/09
to cu...@googlegroups.com
The part of Cucumber that parses .feature files is rather slow. If you have several thousands of steps you have to wait around a minute just for the parsing of all your .feature files - and this is before the execution even begins.

This is no good. The main reason it's so slow is because Cucumber uses Treetop, which is not a very fast parser. Treetop is also a memory hog - it generates a *lot* of objects while parsing. I did some profiling with ruby-prof and it wasn't pretty.

We need a faster parser. I'm currently looking at Ragel (http://www.complang.org/ragel/) - a super fast state machine compiler. It's used by Mongrel, Thin, RedCloth and Hpricot to name a few, so it has a good track record in the Ruby community.

I have started a new project on GitHub: http://github.com/aslakhellesoy/gherkin
When/if this project is able to parse features in the same way that Cucumber's Treetop parser does today, we'll switch. Users would notice a (massive) speed improvement while Cucumber's behaviour stays the same.

There is only a minor catch. I don't have bandwidth to do the heavy lifting myself, so I'm reaching out. Who wants to take this new project and evolve it into a parser that understands the current Gherkin syntax? You would earn tremendous kudos from the Cucumber community for doing this. Previous experience with Ragel is not a must, but definitely a plus.

Let us know if you want to help. The more the merrier.

Cheers,
Aslak

Mike Sassak

unread,
Sep 1, 2009, 12:58:36 AM9/1/09
to cu...@googlegroups.com

I don't know Ragel well enough to lead anything, but I'm not a
complete stranger to it, either. I'd love to help in any way I can.

Mike

aslak hellesoy

unread,
Sep 1, 2009, 2:07:14 AM9/1/09
to cu...@googlegroups.com

Great! I think a good first step would be to implement the table parser. There is a failing spec in the gherkin project. Do you think you could make that pass?

Aslak
 
Mike



Wincent Colaiuta

unread,
Sep 1, 2009, 3:13:26 AM9/1/09
to Cukes
I'm a big fan of Ragel so it's great to hear that you're exploring
this.

One limitation to bare in mind is that as a state machine compiler --
and not a parser generator like Treetop -- out of the box Ragel
doesn't give you the tools you need to parse recursive structures; it
can only "parse" regular languages (ie. what you could pass with a
single regular expression, albeit a large or even huge one).

I haven't looked at the Gherkin grammar yet so I am not sure whether
this is going to present much of a hurdle or not. What this basically
means is that if you have to move beyond regular languages you need to
use your Ragel-generated state machine as a lexer/tokenizer and then
provide your own parser-level machinery for handling the recursive
parts. This is typical done using either recursion (like a recursive
descent parser) or manually managing and tracking a "stack" (Array)
indicating "where you are".

One example of this from my Wikitext module (a wikitext markup to HTML
translator, C Ruby extension, Ragel-generated lexer, see
http://git.wincent.com/wikitext.git), where the Ragel part is
basically just a tokenizer and the C part is a parser/translator that
uses arrays to track "where we are" in the scope/nesting sense, and
therefore which symbols are allowed at any given moment.

Depending on how complex the non-Ragel side of things is will affect
how fast this thing is. Obviously the bigger it is the less speed gain
you'll see. Also, I have no experience with Ruby code generation in
Ragel. I know that the generated C code is _insanely_ fast, but I also
know that you want Cucumber to run everywhere (JRuby etc) so targeting
C isn't an option.

At this stage I don't have any real time to contribute to this but
I'll be watching this thread and try to offer as much help and input
as I can. If more time becomes available later to do real work I'd
certainly be interested in getting on board.

Cheers,
Wincent

Brent Snook

unread,
Sep 1, 2009, 9:37:44 AM9/1/09
to Cukes
I wouldn't mind lending a hand. No experience with Ragel but I can
check it out.

Brent.

aslak hellesoy

unread,
Sep 1, 2009, 9:52:52 AM9/1/09
to cu...@googlegroups.com

Wonderful! Mike has already made my intentionally broken spec pass on his fork (separate branch).
I think Mike will continue that work to complete table parsing. Right, Mike?

The next step after that would be to start parsing features. Starting with a simple feature with 1 scenario and 1 step would be a good start. Then 2 steps. Then 2 scenarios and 2 steps etc. Something you'd want to take a look at?

Mike's current implementation of the table parser *returns* an Array of Array of String. I'm thinking we should maybe use an event listener approach further down the line. This means passing a listener object to the parse method, and have the parser invoke various methods on it while parsing (instead of returning values).

We don't have to worry about i18n translations of the grammar yet - we can handle that later.

Aslak
 
Brent.


Brent Snook

unread,
Sep 1, 2009, 11:24:23 AM9/1/09
to Cukes
I'll take a stab at the simple feature.

Brent.

Mike Sassak

unread,
Sep 1, 2009, 12:08:23 PM9/1/09
to cu...@googlegroups.com
That's correct. I'll let everyone know how it goes.

Mike

<snip>

> Aslak
>
>>
>> Brent.
>>
>
>
> >
>

Mike Sassak

unread,
Sep 1, 2009, 6:00:26 PM9/1/09
to cu...@googlegroups.com
Hi Wincent,

Good to hear someone with Real Ragel Experience is listening. It can
get a bit... hairy, at times. :-)

From what I understand, the intent of Gherkin is to produce a
reference implementation of the parser in Ruby that is used for
development, and then use C for MRI and Java for JRuby.

Mike

> Cheers,
> Wincent
> >
>

Gregory Hnatiuk

unread,
Sep 1, 2009, 7:18:48 PM9/1/09
to cu...@googlegroups.com
I'm going to take a crack at this as well.   I'll probably be working off of Mike's fork for a bit since we're both looking at tables for now and I'm just getting familiar with Ragel.

Greg

aslak hellesoy

unread,
Sep 1, 2009, 7:37:40 PM9/1/09
to cu...@googlegroups.com

The intent is to produce a fast implementation that will replace the treetop parser in Cucumber. Cucumber will depend on the gherkin gem.
The gherkin gem will be packaged for both MRI and JRuby, using a C and Java parser respectively. These will be generated from the same .rl files.

The purpose of the Ruby parser is just to have something that is easier to work with while we're developing. Once we have a decent parser we can scratch the Ruby parser.

A secondary goal would be to produce parsers in all the languages supported by Ragel, so that other BDD tools (like for example JBehave) can use the same parser.

I'm glad to see there is so much interest in helping out with this. If progress happens at the same rate as it has done today we should have something usable within a few weeks I think! Let's hope we don't run into any roadblocks.

Aslak

Mike

> Cheers,
> Wincent
> >
>



Brent Snook

unread,
Sep 2, 2009, 10:28:13 AM9/2/09
to Cukes


On Sep 2, 12:37 am, aslak hellesoy <aslak.helle...@gmail.com> wrote:
A few things have come up and I've been stalled working on the feature
parsing side of things. If someone else wants to take this on then
please go ahead. In the meantime I'll try and get up to speed with
Ragel.

Cheers,

Brent.

Mike Sassak

unread,
Sep 2, 2009, 12:20:42 PM9/2/09
to cu...@googlegroups.com
On Wed, Sep 2, 2009 at 10:28 AM, Brent Snook<br...@fuglylogic.com> wrote:
>
> <snip>

>
> A few things have come up and I've been stalled working on the feature
> parsing side of things. If someone else wants to take this on then
> please go ahead. In the meantime I'll try and get up to speed with
> Ragel.
>
>

Greg and I have a decent-ish handle on the table parsing, so I should
have some time soon to look at the features. Because of the way Ragel
composes state machines, it should be possible to decompose feature
parsing into smaller parts and work on those separately. I have some
ideas for ways to do that, but haven't tried implementing them yet.
I'll let everyone know if/when I get around to testing them out.

Mike

aslak hellesoy

unread,
Sep 3, 2009, 10:40:29 AM9/3/09
to cu...@googlegroups.com

Thanks Mike,

I have brought the JBehave into the loop: http://comments.gmane.org/gmane.comp.java.jbehave.devel/915

One thing that came up is the desire to have a BNF. It would be nice if the gherkin project could produce a BNF once we get a little further.

Aslak
 
Mike



Mike Sassak

unread,
Sep 3, 2009, 11:51:35 AM9/3/09
to cu...@googlegroups.com

A BNF is a great idea--and needed after taking a look at the Gherkin page on the wiki. :-) I saw that JBehave opened up a ticket about providing one. Would you like us to do the same in Lighthouse (for the BNF and also for other outstanding features / issues)? 

Mike
 
Aslak
 
Mike






aslak hellesoy

unread,
Sep 3, 2009, 12:01:03 PM9/3/09
to cu...@googlegroups.com

I think tickets is a great idea to keep track of work and todos. Please use the Github issue tracker in my gherkin repo to register them so we can keep this apart from Cucumber.

Cheers,
Aslak
 
Mike
 
Aslak
 
Mike









Joseph Wilk

unread,
Sep 3, 2009, 12:41:33 PM9/3/09
to cu...@googlegroups.com
aslak hellesoy wrote:
>
>
> On Thu, Sep 3, 2009 at 5:51 PM, Mike Sassak <msa...@gmail.com
> <mailto:msa...@gmail.com>> wrote:
>
> On Thu, Sep 3, 2009 at 10:40 AM, aslak hellesoy
> <aslak.h...@gmail.com <mailto:aslak.h...@gmail.com>> wrote:
>
>
>
> On Wed, Sep 2, 2009 at 6:20 PM, Mike Sassak <msa...@gmail.com
> <mailto:msa...@gmail.com>> wrote:
>
>
> On Wed, Sep 2, 2009 at 10:28 AM, Brent
> Snook<br...@fuglylogic.com <mailto:br...@fuglylogic.com>>
> wrote:
> >
> > <snip>
> >
> > A few things have come up and I've been stalled working
> on the feature
> > parsing side of things. If someone else wants to take
> this on then
> > please go ahead. In the meantime I'll try and get up to
> speed with
> > Ragel.
> >
> >
>

I spent quite a bit of time working with the treetop grammar of Cucumber
and would be happy to help out with the Ragel grammar.

--
Joseph Wilk
http://blog.josephwilk.net
mob: +44(0)7812816431

Mike Sassak

unread,
Sep 3, 2009, 4:51:47 PM9/3/09
to cu...@googlegroups.com

Cool! There are three tasks, I think, that need priority right now: the BNF, feature parsing and test support. Of these, I think it would be awesome to have a BNF, especially considering that it would allow people to play around with this: http://code.haskell.org/bnfc/. Feature parsing depends on the BNF, albeit tangentially, but whatever the case it would be a good help to both of us if we could point to it to resolve ambiguities.

Greg has some basic feature parsing working already, but I don't think there's a good way to divide the work on the feature parser yet without causing conflicts on every merge. That being said, neither of us know enough about Ragel to say what way is the the right way to do anything, so any suggestions, improvements or alternate approaches would be very much appreciated. I've pushed the rudimentary feature parsing to my fork, so you can take a look there if you'd like.

The testing support is just to make development of the parsers easier. It's certainly not as painless right now as it could be. I'll probably be taking a look at that as soon as I'm done with this email.

I've created issues in the official fork for all of the above, plus a few more, and voted up the ones I thought were important. (I think only Aslak can set priority.) I'll add more issues as I can think of them.

Mike
 

aslak hellesoy

unread,
Sep 3, 2009, 7:20:33 PM9/3/09
to cu...@googlegroups.com

Just pulled it down and I'm really impressed by the fast progress here! I have a few comments (you may already have thought about this).

I'm not sure this project should actually provide an implementation of the Listener. The way I have pictured it, Listener is an object that the users of the gherkin library will provide. To use Java terminology - Listener is only an interface. Since Ruby doesn't have interfaces, I think it's enough to just document the interface in RDoc. So for example - Cucumber would provide a AstBuilder object that conforms to that interface.

About the methods on Listener: I'm not crazy about the *_found naming of methods. Instead of calling it step_found - why not just call it step? (They are all "found" so it seems a little verbose).

One last thing. Would it make sense to give the programming language-specific .rl files a "sub" extension? feature.rb.rl instead of feature.rl etc?

I know I'm being picky here, but this is turning out to be so nice that it's not hard to make it "perfect". Great job so far!

but I don't think there's a good way to divide the work on the feature parser yet without causing conflicts on every merge. That being said, neither of us know enough about Ragel to say what way is the the right way to do anything, so any suggestions, improvements or alternate approaches would be very much appreciated.

You look like you have a pretty good grasp on it.
 
I've pushed the rudimentary feature parsing to my fork, so you can take a look there if you'd like.

The testing support is just to make development of the parsers easier. It's certainly not as painless right now as it could be. I'll probably be taking a look at that as soon as I'm done with this email.

I've created issues in the official fork for all of the above, plus a few more, and voted up the ones I thought were important. (I think only Aslak can set priority.) I'll add more issues as I can think of them.


I have added you and Greg as collaborators in my repo. Hopefully that gives you more karma in my tracker. I'll follow a commit bit policy for future contributors.

Should we set up a gherkin user to "own" this project? Or do one of you guys want to take over ownership? It feels wrong that the official repo is mine when you are doing all the hard work.

Aslak

Mike Sassak

unread,
Sep 3, 2009, 10:07:30 PM9/3/09
to cu...@googlegroups.com
I see you already made most of the changes. Good stuff! I put the listener in there, really, so I could see a method in Gherkin returning a Cucumber::Ast::Table object, for the geeky thrill of it all. 

I think projects with their own user only make sense when they're big--Rails, for example. I would be glad to take over ownership if you'd like to be relieved of the official pushing/pulling duties, but I don't mind you having the official repo in the slightest. The only thing that matters to me is that it's as easy on everyone as possible. 

Mike

Joseph Wilk

unread,
Sep 4, 2009, 4:13:30 PM9/4/09
to cu...@googlegroups.com


Sent from my iPhone
One suggestion, I found it useful in Cuke to commit the generated parser code and the Ragel grammar seperatly.

It took quite a while to find a good way to test treetop. Aslak's decision to use sexp to check the correct AST was generated made life a lot easier. 

Of these, I think it would be awesome to have a BNF, especially considering that it would allow people to play around with this: http://code.haskell.org/bnfc/. Feature parsing depends on the BNF, albeit tangentially, but whatever the case it would be a good help to both of us if we could point to it to resolve ambiguities.


Cool I'll start putting together a BNF cuke file.


Greg has some basic feature parsing working already, but I don't think there's a good way to divide the work on the feature parser yet without causing conflicts on every merge. That being said, neither of us know enough about Ragel to say what way is the the right way to do anything, so any suggestions, improvements or alternate approaches would be very much appreciated. I've pushed the rudimentary feature parsing to my fork, so you can take a look there if you'd like.

The testing support is just to make development of the parsers easier. It's certainly not as painless right now as it could be. I'll probably be taking a look at that as soon as I'm done with this email.

I've created issues in the official fork for all of the above, plus a few more, and voted up the ones I thought were important. (I think only Aslak can set priority.) I'll add more issues as I can think of them.

Mike
 

Gregory Hnatiuk

unread,
Sep 4, 2009, 5:01:02 PM9/4/09
to cu...@googlegroups.com

Thanks!  I was thinking about using something like sexp in a TestListener.  Message expection is working, though getting a bit cumbersome to do correctly as the features being parsed grow.  It also seems less explicit then the way treetop is being tested.

We're currently ignoring the generated parser code and relying on rake to generate it when we run the tests.  We found out right quick that merging those files was pretty unbearable.

Of these, I think it would be awesome to have a BNF, especially considering that it would allow people to play around with this: http://code.haskell.org/bnfc/. Feature parsing depends on the BNF, albeit tangentially, but whatever the case it would be a good help to both of us if we could point to it to resolve ambiguities.


Cool I'll start putting together a BNF cuke file.


Mike's putting up some of our notes on a BNF in the gherkin wiki.  I think that would be a great place to work on this.  If you see any glaring errors, please don't hesitate to correct us.


Greg

 
Reply all
Reply to author
Forward
0 new messages