Google Groups

Re: Dead project?

rsms Jan 17, 2011 10:07 AM
Posted in group: gazelle-users
On Mon, Jan 17, 2011 at 04:32, Joshua  Haberman <> wrote:
> On Jan 16, 4:45 pm, Rasmus Andersson <> wrote:
>> @Scott Taylor: Oh, so you mean that the current (haberman/master)
>> implementation uses PB? Are you sure? Looks like BitCode all the way
> That's correct; the port hasn't happened yet.
>> Anyhow, yes I did some research earlier and read an old post (from
>> 2009 or possibly early 2010) where you (Joshua) said that Gazelle
>> where getting refactored internally to used PB instead of BC for data
>> serialization. That sounds nice, but what are the reasons (beyond PB
>> being easier to debug) for using Protocol Buffers instead of BitCode?
>> I mean, the versioning features are basically useless in the case of
>> Gazelle since the canonical representation of a grammar is in the gzl
>> file. And is the use-case of having a version 1 runtime reading in a
>> version 2 grammar really something that would occur often (that is,
>> controlling the runtime version is out of control for the user)? I'm
>> asking, not criticizing :)
> The biggest reason for using Protocol Buffers instead of BitCode is
> that Protocol Buffers includes a schema language.  Having an explicit
> schema opens up lots of possibilities and solves lots of problems.
> As one example, with BitCode Gazelle has to have an text file that
> informally describes its schema:
> The code to write BitCode using this schema is very manual and
> tedious.  I essentially have to describe the schema again, and keep it
> in sync with the schema described in FILEFORMAT:
> The code to load the bytecode in the runtime is again very manual, and
> must be kept in sync with FILEFORMAT:
> With Protocol Buffers I can describe this schema once and then get a
> nice API in both languages.  I get a text format for free for
> debugging or human creation/modification.  It will make a huge
> difference.
> So far I've only talked about Gazelle's internal bytecode.  The
> benefits of Protocol Buffers are again apparent when it comes to ASTs:
> most notably, they are typed (so, for example, a double can be
> represented as a double instead of a string).

Very good arguments. Thanks for clearing that out. I agree that PB is
a better option.

> It's less clear how PBs are useful for a parse tree.  Parse trees
> aren't really typed (every leaf is a string) and as you noted, the
> "schema" for a parse tree is mostly implicit from the structure of the
> grammar.

> My current thought is that the streaming API of protobufs could still
> be a useful model for parse trees, where all parse trees use a PB
> schema like:
> message ParseTreeNode {
>  // The name of the terminal or nonterminal this node represents.
>  required string name = 1;
>  union {
>    // For terminals: the text represented by the terminal.
>    string text = 2;
>    // For nonterminals: the sequence of nodes underneath this one.
>    repeated ParseTreeNode = 3;
>  }
>  // Info about where in the source text this node appears.
>  required int64 offset = 4;
>  required int64 length = 5;
>  required int64 line_start = 6;
>  required int64 column_start = 7;
>  required int64 line_end = 8;
>  required int64 column_end = 9;
> }


>> Further, I've not spent too much time with Gazelle yet (about 10 hours
>> in total), but my primary goal is to use Gazelle as a fundament for
>> building a syntax-aware program editor. In this case I need to handle
>> both error recovery and incremental parsing, both of which I believe
>> can be modelled on top of Gazelle.
> One thing I can definitely say is that this is exactly the kind of use
> case that I always wanted Gazelle to excel at.

Sweet. That's excellent news.

>> I've written an implementation in C++ during the day which generate an
>> AST where any branch (not leaf) can halt and/or resume the parsing.
>> For this to be feasible in the editor I'm writing (
>> the state stored need to be very small — here, there's much room for
>> improvement. Currently the parser state is rather large (a few hundred
>> bytes for a few levels deep stack). Given a source file AST with 1000
>> branches, that's a lot of data to be stored, most of which is
>> redundant. By storing incremental/parity data instead, and later
>> traversing the tree to build a complete state, would be a much more
>> efficient solution IMHO (but of course also more complex and would
>> imply the existence of some sort of semantic tree of the stack).
> I'm not sure I completely follow when you say "later traversing the
> tree to build a complete state."  My vision for how this would work in
> an editor is that you would store a reasonable number of parse states
> -- say one for the beginning of each line of the file being edited --
> and then whenever a character is typed you can resume from the most
> recent saved state.  The number of states you save is a memory/CPU
> tradeoff.

We are deliberately trying to avoid assuming that the language we are
parsing is line based (or making any other assumptions, like for
instance the use of braces/brackets to denote scope). Imagine opening
a large JSON file in an editor. Strict JSON does not contain any
linebreaks and would thus require a full re-parsing for every
keystroke in such a case.

What we are looking at doing is something along the lines of what's
presented in the paper "Lazy Functional Incremental Parsing" by
Jean-Philippe Bernardy (presentation here: I also have a PDF of the paper I can
share on request).

Imagine the following AST:

01. root
02. --object
03. ----pair
04. ------key
05. --------string
06. ------value
07. --------number
08. ----pair
09. ------key
10. --------string
11. ------value
12. --------number

Now, the state of the stack at any given node is inherently related to
the branch that node lives in. The stack state for node 04. in the
example above might look like this:

rule:key, range:x,y, ...
rule:pair, range:x,y, ...
rule:object, range:x,y, ...
rule:root, range:x,y, ...

When we need to resume parsing at node 04. we could either load a
saved complete stack stored in/at that node. OR, we could load a
partial stack then walk up the AST until we reach the root and thus
have a complete stack and finally can resume parsing.

> By the way, a few hundred bytes per parser state is already doing
> pretty good; I can't guarantee Gazelle will be much better than that.
> Last time I computed it, the current code in Gazelle uses 24 bytes per
> stack entry and 12 bytes per token of lookahead.  So with a stack just
> 4 deep and 1 token of lookahead you've already hit 100 bytes.

And what I can see by quickly viewing the gzl_parse_state struct, the
struct itself constitutes almost 100 bytes (I'm on x86_64). Im my
rudimentary C++ program on top of Gazelle, I have a "ParserSnapshot"
which only save the crucial parts of the gzl_parse_state struct.
Namely, leaving out things like max_stack_depth, max_lookahead and
user_data (none of which in my case is bound to a specific parse

I think we can streamline the state load/save API to only copy the
bare minimums required for a given application (e.g. passing bitflags
to a copy_state function describing what parts you need).

>> What are your goals and ideas about the future of Gazelle? For parsing
>> start-to-end strictly well-formed input (e.g. building compilers) or
>> for parsing incomplete parts of online input (e.g. from human input)?
> My goal is definitely to accommodate both.  I may discover at some
> point that I can't do both without making the tool sub-par at both
> jobs; if so I'll have to cross that bridge when I come to it.

> As far as parsing incomplete parts of online input, the way I
> currently think about that problem is that it is the same as error
> recovery for a compiler.  In both cases you're dealing with incomplete/
> inaccurate input, and in both cases you want to make as much sense of
> the input as you can.  I believe that Clang from LLVM is taking the
> same approach and using the compiler's parser interactively also.

Agreed — incremental parsing and error recovery is indeed very much
related and kind of "intersects". However, in the case of clang (which
has a rather rich and nice C API in libclang), error recovery can be
handled efficiently since the parser is very much purpose-built. If
you find an extra "}" after another "}" which doesn't match up with
the parent scope you can take a specific error recovery action. In our
case with a generalized approach this is AFAIK a much trickier
business. A very interesting problem indeed!

>> I would very much appreciate some guiding directions to where Gazelle
>> is moving and how soon (to know if we should invest more time with
>> building tools around Gazelle or go for something different). Thanks!
> The one answer I can give you with the most certainty is "probably not
> soon enough to make you happy."  If there's one thing that has become
> apparent to me over the last several years of developing Gazelle and
> upb, it's that I don't work fast.  I'm really trying to make a tool
> that will stand the test of time and be useful for lots of different
> things, so my process involves reading a lot of literature, looking at
> a lot of existing software, considering a lot of different use cases,
> etc.
> That said, I'm really keen on getting feedback and information from
> people who are focused on one particular problem, like your editor
> which looks very cool.  I'd love to hear more about your
> requirements.  And if you think Gazelle is a useful enough basis for
> doing your own work, I'd even be happy for you to fork it and hack it
> up to your heart's delight, solving the problems you encounter in
> whatever way is most convenient for you.  It would give me a lot of
> good ideas about the best ways to generalize.  I'd be happy to answer
> any questions you have about the code.  Of the stuff in Gazelle, I'd
> say the most valuable and hard to recreate is the LL lookahead
> generation in compiler/ll.lua.

I've decided to invest more time in Gazelle and have briefly talked
with some of my co-developers of Kod who seem to agree this looks like
a very interesting path to walk down. You can find my fork at (however, I've not yet pushed the C++
library additions nor published my AST-machine-stuff, but will soon).


> Thanks for writing!
> Josh
>> BTW, I've been researching incremental parsing of concrete languages
>> for about two weeks (full time) now and of everything I've dug into,
>> Gazelle is among the top 3 most interesting projects. You might also
>> find the now stagnant Harmonia project at Berkeley CS interesting: server hosting the
>> files is currently behind a mis-configured firewall at the moment, but
>> they're fixing it this week).
>> On Sun, Jan 16, 2011 at 17:42, Joshua  Haberman <> wrote:
>> > That's exactly right Scott.  It's upb actually (
>> > haberman/upb/wiki).
>> > I've known from the beginning that I wanted Gazelle to support a
>> > streaming interface where it would deliver values to you one-by-one,
>> > sort of like SAX for XML.  Here is the Gazelle interface I wrote for
>> > doing this, before I started working on upb:
>> >
>> > After I started tinkering with upb for a few months, I realized that
>> > Protocol Buffers were an ideal schema language and data model for
>> > streaming parse trees or ASTs.  I had the epiphany that parsing is
>> > just tree traversal.
>> > upb has taken longer than I expected, but I have made what I expect to
>> > be the last round of major interface changes.  I'm still extremely
>> > interested in returning to Gazelle once upb is released.
>> > Josh
>> > On Jan 16, 8:33 am, Scott Taylor <> wrote:
>> >> As far as I understand it, Haberman has basically put it on hold until udp is complete.
>> >> As you may or may not know, Gazelle uses protocol buffers internally - so it's a logical devision.
>> >> Gazelle is such an interesting project, that I'd like to know how we could help Joshua out to speed things up (either on udp *or* gazelle).
>> >> Scott
>> >> On Jan 16, 2011, at 5:53 AM, rsms wrote:
>> >> > Hi everyone (someone?),
>> >> > Gazelle is very interesting but apparently still in it's early stages
>> >> > of life. Has Gazelle become a dead project or simply put on hold?
>> >> > Thanks
>> >> > --
>> >> > You received this message because you are subscribed to the Google Groups "gazelle-users" group.
>> >> > To post to this group, send email to
>> >> > To unsubscribe from this group, send email to
>> >> > For more options, visit this group at
>> > --
>> > You received this message because you are subscribed to the Google Groups "gazelle-users" group.
>> > To post to this group, send email to
>> > To unsubscribe from this group, send email to
>> > For more options, visit this group at
>> --
>> Rasmus Andersson
> --
> You received this message because you are subscribed to the Google Groups "gazelle-users" group.
> To post to this group, send email to
> To unsubscribe from this group, send email to
> For more options, visit this group at

Rasmus Andersson