Dead project?

232 views
Skip to first unread message

rsms

unread,
Jan 16, 2011, 5:53:20 AM1/16/11
to gazelle-users
Hi everyone (someone?),

Gazelle is very interesting but apparently still in it's early stages
of life. Has Gazelle become a dead project or simply put on hold?

Thanks

Scott Taylor

unread,
Jan 16, 2011, 11:33:22 AM1/16/11
to gazell...@googlegroups.com

As far as I understand it, Haberman has basically put it on hold until udp is complete.

As you may or may not know, Gazelle uses protocol buffers internally - so it's a logical devision.

Gazelle is such an interesting project, that I'd like to know how we could help Joshua out to speed things up (either on udp *or* gazelle).

Scott

> --
> You received this message because you are subscribed to the Google Groups "gazelle-users" group.
> To post to this group, send email to gazell...@googlegroups.com.
> To unsubscribe from this group, send email to gazelle-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/gazelle-users?hl=en.
>

Scott Taylor

unread,
Jan 16, 2011, 11:39:04 AM1/16/11
to gazell...@googlegroups.com

Oops - how embarrassing. s/devision/division/p

Scott

Joshua Haberman

unread,
Jan 16, 2011, 11:42:42 AM1/16/11
to gazelle-users
That's exactly right Scott. It's upb actually (https://github.com/
haberman/upb/wiki).

I've known from the beginning that I wanted Gazelle to support a
streaming interface where it would deliver values to you one-by-one,
sort of like SAX for XML. Here is the Gazelle interface I wrote for
doing this, before I started working on upb:
https://github.com/haberman/gazelle/blob/master/runtime/include/gazelle/parse.h

After I started tinkering with upb for a few months, I realized that
Protocol Buffers were an ideal schema language and data model for
streaming parse trees or ASTs. I had the epiphany that parsing is
just tree traversal.

upb has taken longer than I expected, but I have made what I expect to
be the last round of major interface changes. I'm still extremely
interested in returning to Gazelle once upb is released.

Josh

Rasmus Andersson

unread,
Jan 16, 2011, 7:45:42 PM1/16/11
to gazell...@googlegroups.com
@Scott Taylor: Oh, so you mean that the current (haberman/master)
implementation uses PB? Are you sure? Looks like BitCode all the way
AFAIK.

Anyhow, yes I did some research earlier and read an old post (from
2009 or possibly early 2010) where you (Joshua) said that Gazelle
where getting refactored internally to used PB instead of BC for data
serialization. That sounds nice, but what are the reasons (beyond PB
being easier to debug) for using Protocol Buffers instead of BitCode?
I mean, the versioning features are basically useless in the case of
Gazelle since the canonical representation of a grammar is in the gzl
file. And is the use-case of having a version 1 runtime reading in a
version 2 grammar really something that would occur often (that is,
controlling the runtime version is out of control for the user)? I'm
asking, not criticizing :)

Further, I've not spent too much time with Gazelle yet (about 10 hours
in total), but my primary goal is to use Gazelle as a fundament for
building a syntax-aware program editor. In this case I need to handle
both error recovery and incremental parsing, both of which I believe
can be modelled on top of Gazelle.

I've written an implementation in C++ during the day which generate an
AST where any branch (not leaf) can halt and/or resume the parsing.
For this to be feasible in the editor I'm writing (http://kodapp.com/)
the state stored need to be very small — here, there's much room for
improvement. Currently the parser state is rather large (a few hundred
bytes for a few levels deep stack). Given a source file AST with 1000
branches, that's a lot of data to be stored, most of which is
redundant. By storing incremental/parity data instead, and later
traversing the tree to build a complete state, would be a much more
efficient solution IMHO (but of course also more complex and would
imply the existence of some sort of semantic tree of the stack).

What are your goals and ideas about the future of Gazelle? For parsing
start-to-end strictly well-formed input (e.g. building compilers) or
for parsing incomplete parts of online input (e.g. from human input)?

I would very much appreciate some guiding directions to where Gazelle
is moving and how soon (to know if we should invest more time with
building tools around Gazelle or go for something different). Thanks!

BTW, I've been researching incremental parsing of concrete languages
for about two weeks (full time) now and of everything I've dug into,
Gazelle is among the top 3 most interesting projects. You might also
find the now stagnant Harmonia project at Berkeley CS interesting:
http://harmonia.cs.berkeley.edu/harmonia/ (the server hosting the
files is currently behind a mis-configured firewall at the moment, but
they're fixing it this week).

--
Rasmus Andersson

Joshua Haberman

unread,
Jan 16, 2011, 10:32:56 PM1/16/11
to gazelle-users
On Jan 16, 4:45 pm, Rasmus Andersson <ras...@notion.se> wrote:
> @Scott Taylor: Oh, so you mean that the current (haberman/master)
> implementation uses PB? Are you sure? Looks like BitCode all the way
> AFAIK.

That's correct; the port hasn't happened yet.

> Anyhow, yes I did some research earlier and read an old post (from
> 2009 or possibly early 2010) where you (Joshua) said that Gazelle
> where getting refactored internally to used PB instead of BC for data
> serialization. That sounds nice, but what are the reasons (beyond PB
> being easier to debug) for using Protocol Buffers instead of BitCode?
> I mean, the versioning features are basically useless in the case of
> Gazelle since the canonical representation of a grammar is in the gzl
> file. And is the use-case of having a version 1 runtime reading in a
> version 2 grammar really something that would occur often (that is,
> controlling the runtime version is out of control for the user)? I'm
> asking, not criticizing :)

The biggest reason for using Protocol Buffers instead of BitCode is
that Protocol Buffers includes a schema language. Having an explicit
schema opens up lots of possibilities and solves lots of problems.

As one example, with BitCode Gazelle has to have an text file that
informally describes its schema:
https://github.com/haberman/gazelle/blob/f8034e85f7d9b6b5749a9c7f622766337a569dc1/docs/FILEFORMAT

The code to write BitCode using this schema is very manual and
tedious. I essentially have to describe the schema again, and keep it
in sync with the schema described in FILEFORMAT:
https://github.com/haberman/gazelle/blob/f8034e85f7d9b6b5749a9c7f622766337a569dc1/compiler/bytecode.lua

The code to load the bytecode in the runtime is again very manual, and
must be kept in sync with FILEFORMAT:
https://github.com/haberman/gazelle/blob/f8034e85f7d9b6b5749a9c7f622766337a569dc1/runtime/load_grammar.c

With Protocol Buffers I can describe this schema once and then get a
nice API in both languages. I get a text format for free for
debugging or human creation/modification. It will make a huge
difference.

So far I've only talked about Gazelle's internal bytecode. The
benefits of Protocol Buffers are again apparent when it comes to ASTs:
most notably, they are typed (so, for example, a double can be
represented as a double instead of a string).

It's less clear how PBs are useful for a parse tree. Parse trees
aren't really typed (every leaf is a string) and as you noted, the
"schema" for a parse tree is mostly implicit from the structure of the
grammar.

My current thought is that the streaming API of protobufs could still
be a useful model for parse trees, where all parse trees use a PB
schema like:

message ParseTreeNode {
// The name of the terminal or nonterminal this node represents.
required string name = 1;

union {
// For terminals: the text represented by the terminal.
string text = 2;

// For nonterminals: the sequence of nodes underneath this one.
repeated ParseTreeNode = 3;
}

// Info about where in the source text this node appears.
required int64 offset = 4;
required int64 length = 5;
required int64 line_start = 6;
required int64 column_start = 7;
required int64 line_end = 8;
required int64 column_end = 9;
}

> Further, I've not spent too much time with Gazelle yet (about 10 hours
> in total), but my primary goal is to use Gazelle as a fundament for
> building a syntax-aware program editor. In this case I need to handle
> both error recovery and incremental parsing, both of which I believe
> can be modelled on top of Gazelle.

One thing I can definitely say is that this is exactly the kind of use
case that I always wanted Gazelle to excel at.

> I've written an implementation in C++ during the day which generate an
> AST where any branch (not leaf) can halt and/or resume the parsing.
> For this to be feasible in the editor I'm writing (http://kodapp.com/)
> the state stored need to be very small — here, there's much room for
> improvement. Currently the parser state is rather large (a few hundred
> bytes for a few levels deep stack). Given a source file AST with 1000
> branches, that's a lot of data to be stored, most of which is
> redundant. By storing incremental/parity data instead, and later
> traversing the tree to build a complete state, would be a much more
> efficient solution IMHO (but of course also more complex and would
> imply the existence of some sort of semantic tree of the stack).

I'm not sure I completely follow when you say "later traversing the
tree to build a complete state." My vision for how this would work in
an editor is that you would store a reasonable number of parse states
-- say one for the beginning of each line of the file being edited --
and then whenever a character is typed you can resume from the most
recent saved state. The number of states you save is a memory/CPU
tradeoff.

By the way, a few hundred bytes per parser state is already doing
pretty good; I can't guarantee Gazelle will be much better than that.
Last time I computed it, the current code in Gazelle uses 24 bytes per
stack entry and 12 bytes per token of lookahead. So with a stack just
4 deep and 1 token of lookahead you've already hit 100 bytes.

> What are your goals and ideas about the future of Gazelle? For parsing
> start-to-end strictly well-formed input (e.g. building compilers) or
> for parsing incomplete parts of online input (e.g. from human input)?

My goal is definitely to accommodate both. I may discover at some
point that I can't do both without making the tool sub-par at both
jobs; if so I'll have to cross that bridge when I come to it.

As far as parsing incomplete parts of online input, the way I
currently think about that problem is that it is the same as error
recovery for a compiler. In both cases you're dealing with incomplete/
inaccurate input, and in both cases you want to make as much sense of
the input as you can. I believe that Clang from LLVM is taking the
same approach and using the compiler's parser interactively also.

> I would very much appreciate some guiding directions to where Gazelle
> is moving and how soon (to know if we should invest more time with
> building tools around Gazelle or go for something different). Thanks!

The one answer I can give you with the most certainty is "probably not
soon enough to make you happy." If there's one thing that has become
apparent to me over the last several years of developing Gazelle and
upb, it's that I don't work fast. I'm really trying to make a tool
that will stand the test of time and be useful for lots of different
things, so my process involves reading a lot of literature, looking at
a lot of existing software, considering a lot of different use cases,
etc.

That said, I'm really keen on getting feedback and information from
people who are focused on one particular problem, like your editor
which looks very cool. I'd love to hear more about your
requirements. And if you think Gazelle is a useful enough basis for
doing your own work, I'd even be happy for you to fork it and hack it
up to your heart's delight, solving the problems you encounter in
whatever way is most convenient for you. It would give me a lot of
good ideas about the best ways to generalize. I'd be happy to answer
any questions you have about the code. Of the stuff in Gazelle, I'd
say the most valuable and hard to recreate is the LL lookahead
generation in compiler/ll.lua.

Thanks for writing!
Josh

> BTW, I've been researching incremental parsing of concrete languages
> for about two weeks (full time) now and of everything I've dug into,
> Gazelle is among the top 3 most interesting projects. You might also
> find the now stagnant Harmonia project at Berkeley CS interesting:http://harmonia.cs.berkeley.edu/harmonia/(the server hosting the
> files is currently behind a mis-configured firewall at the moment, but
> they're fixing it this week).
>
> On Sun, Jan 16, 2011 at 17:42, Joshua  Haberman <jhaber...@gmail.com> wrote:
>
>
>
>
>
>
>
>
>
> > That's exactly right Scott.  It's upb actually (https://github.com/
> > haberman/upb/wiki).
>
> > I've known from the beginning that I wanted Gazelle to support a
> > streaming interface where it would deliver values to you one-by-one,
> > sort of like SAX for XML.  Here is the Gazelle interface I wrote for
> > doing this, before I started working on upb:
> >https://github.com/haberman/gazelle/blob/master/runtime/include/gazel...

Rasmus Andersson

unread,
Jan 17, 2011, 1:07:58 PM1/17/11
to gazell...@googlegroups.com

Very good arguments. Thanks for clearing that out. I agree that PB is
a better option.

>
> It's less clear how PBs are useful for a parse tree.  Parse trees
> aren't really typed (every leaf is a string) and as you noted, the
> "schema" for a parse tree is mostly implicit from the structure of the
> grammar.

>
> My current thought is that the streaming API of protobufs could still
> be a useful model for parse trees, where all parse trees use a PB
> schema like:
>
> message ParseTreeNode {
>  // The name of the terminal or nonterminal this node represents.
>  required string name = 1;
>
>  union {
>    // For terminals: the text represented by the terminal.
>    string text = 2;
>
>    // For nonterminals: the sequence of nodes underneath this one.
>    repeated ParseTreeNode = 3;
>  }
>
>  // Info about where in the source text this node appears.
>  required int64 offset = 4;
>  required int64 length = 5;
>  required int64 line_start = 6;
>  required int64 column_start = 7;
>  required int64 line_end = 8;
>  required int64 column_end = 9;
> }

Interesting.

>
>> Further, I've not spent too much time with Gazelle yet (about 10 hours
>> in total), but my primary goal is to use Gazelle as a fundament for
>> building a syntax-aware program editor. In this case I need to handle
>> both error recovery and incremental parsing, both of which I believe
>> can be modelled on top of Gazelle.
>
> One thing I can definitely say is that this is exactly the kind of use
> case that I always wanted Gazelle to excel at.

Sweet. That's excellent news.

>
>> I've written an implementation in C++ during the day which generate an
>> AST where any branch (not leaf) can halt and/or resume the parsing.
>> For this to be feasible in the editor I'm writing (http://kodapp.com/)
>> the state stored need to be very small — here, there's much room for
>> improvement. Currently the parser state is rather large (a few hundred
>> bytes for a few levels deep stack). Given a source file AST with 1000
>> branches, that's a lot of data to be stored, most of which is
>> redundant. By storing incremental/parity data instead, and later
>> traversing the tree to build a complete state, would be a much more
>> efficient solution IMHO (but of course also more complex and would
>> imply the existence of some sort of semantic tree of the stack).
>
> I'm not sure I completely follow when you say "later traversing the
> tree to build a complete state."  My vision for how this would work in
> an editor is that you would store a reasonable number of parse states
> -- say one for the beginning of each line of the file being edited --
> and then whenever a character is typed you can resume from the most
> recent saved state.  The number of states you save is a memory/CPU
> tradeoff.

We are deliberately trying to avoid assuming that the language we are
parsing is line based (or making any other assumptions, like for
instance the use of braces/brackets to denote scope). Imagine opening
a large JSON file in an editor. Strict JSON does not contain any
linebreaks and would thus require a full re-parsing for every
keystroke in such a case.

What we are looking at doing is something along the lines of what's
presented in the paper "Lazy Functional Incremental Parsing" by
Jean-Philippe Bernardy (presentation here:
http://www.vimeo.com/6683816 I also have a PDF of the paper I can
share on request).

Imagine the following AST:

01. root
02. --object
03. ----pair
04. ------key
05. --------string
06. ------value
07. --------number
08. ----pair
09. ------key
10. --------string
11. ------value
12. --------number

Now, the state of the stack at any given node is inherently related to
the branch that node lives in. The stack state for node 04. in the
example above might look like this:

rule:key, range:x,y, ...
rule:pair, range:x,y, ...
rule:object, range:x,y, ...
rule:root, range:x,y, ...

When we need to resume parsing at node 04. we could either load a
saved complete stack stored in/at that node. OR, we could load a
partial stack then walk up the AST until we reach the root and thus
have a complete stack and finally can resume parsing.

>
> By the way, a few hundred bytes per parser state is already doing
> pretty good; I can't guarantee Gazelle will be much better than that.
> Last time I computed it, the current code in Gazelle uses 24 bytes per
> stack entry and 12 bytes per token of lookahead.  So with a stack just
> 4 deep and 1 token of lookahead you've already hit 100 bytes.

And what I can see by quickly viewing the gzl_parse_state struct, the
struct itself constitutes almost 100 bytes (I'm on x86_64). Im my
rudimentary C++ program on top of Gazelle, I have a "ParserSnapshot"
which only save the crucial parts of the gzl_parse_state struct.
Namely, leaving out things like max_stack_depth, max_lookahead and
user_data (none of which in my case is bound to a specific parse
state).

I think we can streamline the state load/save API to only copy the
bare minimums required for a given application (e.g. passing bitflags
to a copy_state function describing what parts you need).

>
>> What are your goals and ideas about the future of Gazelle? For parsing
>> start-to-end strictly well-formed input (e.g. building compilers) or
>> for parsing incomplete parts of online input (e.g. from human input)?
>
> My goal is definitely to accommodate both.  I may discover at some
> point that I can't do both without making the tool sub-par at both
> jobs; if so I'll have to cross that bridge when I come to it.

>
> As far as parsing incomplete parts of online input, the way I
> currently think about that problem is that it is the same as error
> recovery for a compiler.  In both cases you're dealing with incomplete/
> inaccurate input, and in both cases you want to make as much sense of
> the input as you can.  I believe that Clang from LLVM is taking the
> same approach and using the compiler's parser interactively also.

Agreed — incremental parsing and error recovery is indeed very much
related and kind of "intersects". However, in the case of clang (which
has a rather rich and nice C API in libclang), error recovery can be
handled efficiently since the parser is very much purpose-built. If
you find an extra "}" after another "}" which doesn't match up with
the parent scope you can take a specific error recovery action. In our
case with a generalized approach this is AFAIK a much trickier
business. A very interesting problem indeed!

>
>> I would very much appreciate some guiding directions to where Gazelle
>> is moving and how soon (to know if we should invest more time with
>> building tools around Gazelle or go for something different). Thanks!
>
> The one answer I can give you with the most certainty is "probably not
> soon enough to make you happy."  If there's one thing that has become
> apparent to me over the last several years of developing Gazelle and
> upb, it's that I don't work fast.  I'm really trying to make a tool
> that will stand the test of time and be useful for lots of different
> things, so my process involves reading a lot of literature, looking at
> a lot of existing software, considering a lot of different use cases,
> etc.
>
> That said, I'm really keen on getting feedback and information from
> people who are focused on one particular problem, like your editor
> which looks very cool.  I'd love to hear more about your
> requirements.  And if you think Gazelle is a useful enough basis for
> doing your own work, I'd even be happy for you to fork it and hack it
> up to your heart's delight, solving the problems you encounter in
> whatever way is most convenient for you.  It would give me a lot of
> good ideas about the best ways to generalize.  I'd be happy to answer
> any questions you have about the code.  Of the stuff in Gazelle, I'd
> say the most valuable and hard to recreate is the LL lookahead
> generation in compiler/ll.lua.

I've decided to invest more time in Gazelle and have briefly talked
with some of my co-developers of Kod who seem to agree this looks like
a very interesting path to walk down. You can find my fork at
https://github.com/rsms/gazelle (however, I've not yet pushed the C++
library additions nor published my AST-machine-stuff, but will soon).

Cheers

Joshua Haberman

unread,
Jan 19, 2011, 12:43:47 PM1/19/11
to gazelle-users
On Jan 17, 10:07 am, Rasmus Andersson <ras...@notion.se> wrote:
> On Mon, Jan 17, 2011 at 04:32, Joshua  Haberman <jhaber...@gmail.com> wrote:
> > It's less clear how PBs are useful for a parse tree.  Parse trees
> > aren't really typed (every leaf is a string) and as you noted, the
> > "schema" for a parse tree is mostly implicit from the structure of the
> > grammar.
>
> > My current thought is that the streaming API of protobufs could still
> > be a useful model for parse trees, where all parse trees use a PB
> > schema like:
>
> > message ParseTreeNode {
> >  // The name of the terminal or nonterminal this node represents.
> >  required string name = 1;
>
> >  union {
> >    // For terminals: the text represented by the terminal.
> >    string text = 2;
>
> >    // For nonterminals: the sequence of nodes underneath this one.
> >    repeated ParseTreeNode = 3;
> >  }
>
> >  // Info about where in the source text this node appears.
> >  required int64 offset = 4;
> >  required int64 length = 5;
> >  required int64 line_start = 6;
> >  required int64 column_start = 7;
> >  required int64 line_end = 8;
> >  required int64 column_end = 9;
> > }
>
> Interesting.

To add one thought to this, in the case where you want an AST, but
also want to have a parse tree so you can trace specific parts of the
tree back to the source text, I was thinking of having one of these
ParseTreeNode messages for each leaf of the AST.

> 01. root
> 02. --object
> 03. ----pair
> 04. ------key
> 05. --------string
> 06. ------value
> 07. --------number
> 08. ----pair
> 09. ------key
> 10. --------string
> 11. ------value
> 12. --------number
>
> Now, the state of the stack at any given node is inherently related to
> the branch that node lives in. The stack state for node 04. in the
> example above might look like this:
>
> rule:key, range:x,y, ...
> rule:pair, range:x,y, ...
> rule:object, range:x,y, ...
> rule:root, range:x,y, ...
>
> When we need to resume parsing at node 04. we could either load a
> saved complete stack stored in/at that node. OR, we could load a
> partial stack then walk up the AST until we reach the root and thus
> have a complete stack and finally can resume parsing.

I think I see what you're saying -- you want to share entries in the
parse stack that don't vary between consecutive snapshots. One
possible implementation of this would be to make the stack a linked
list, where many consecutive saved states could point to the same
previous stack entry.

> > By the way, a few hundred bytes per parser state is already doing
> > pretty good; I can't guarantee Gazelle will be much better than that.
> > Last time I computed it, the current code in Gazelle uses 24 bytes per
> > stack entry and 12 bytes per token of lookahead.  So with a stack just
> > 4 deep and 1 token of lookahead you've already hit 100 bytes.
>
> And what I can see by quickly viewing the gzl_parse_state struct, the
> struct itself constitutes almost 100 bytes (I'm on x86_64). Im my
> rudimentary C++ program on top of Gazelle, I have a "ParserSnapshot"
> which only save the crucial parts of the gzl_parse_state struct.
> Namely, leaving out things like max_stack_depth, max_lookahead and
> user_data (none of which in my case is bound to a specific parse
> state).

Interesting; one alternative you might consider is moving some of
these things to gzl_bound_grammar, since that is a place to put things
that are essentially immutable over the life of the parse.

> Agreed — incremental parsing and error recovery is indeed very much
> related and kind of "intersects". However, in the case of clang (which
> has a rather rich and nice C API in libclang), error recovery can be
> handled efficiently since the parser is very much purpose-built. If
> you find an extra "}" after another "}" which doesn't match up with
> the parent scope you can take a specific error recovery action. In our
> case with a generalized approach this is AFAIK a much trickier
> business. A very interesting problem indeed!

Yes, I definitely think that custom error recovery actions will be
required; my intention is to allow these to be written by the grammar
developer in Lua.

> I've decided to invest more time in Gazelle and have briefly talked
> with some of my co-developers of Kod who seem to agree this looks like
> a very interesting path to walk down. You can find my fork athttps://github.com/rsms/gazelle(however, I've not yet pushed the C++
> library additions nor published my AST-machine-stuff, but will soon).

Cool, I look forward to hearing about your progress, and I'll answer
whatever questions I can.

Are you planning to have end-users write grammars for your editor? I
ask because I do want to retain the right to define the Gazelle
grammar language. If your end users are writing grammars and your
grammar language diverges from mine, it may be beneficial to have you
use a different file extension or something like that.

Cheers,
Josh

Rasmus Andersson

unread,
Feb 1, 2011, 7:37:29 AM2/1/11
to gazell...@googlegroups.com
I've made some progress with Gazelle during the last few weeks and we
now have a somewhat crude but functional incremental AST parser.

Here's a short screencast demo: http://www.youtube.com/watch?v=nSM41hToa0w

My fork of Gazelle with a few fixes and changes: https://github.com/rsms/gazelle

The AST stuff is implemented in Kod:
https://github.com/rsms/kod/tree/master/src (AST*)

Josh Haberman

unread,
Feb 1, 2011, 12:57:27 PM2/1/11
to gazell...@googlegroups.com
Wow, that's really cool!

It's always been my dream to have a syntax-aware editor like this -- it's one of the reasons I started working on Gazelle.  It's really exciting to see this.

Also, for some reason I was under the impression that your editor was closed-source, which is fine because of Gazelle's BSD license, but it's cool to know that it's open source also.

I see you added a new callback for will_start_rule in addition to did_start_rule; out of curiosity, how do you use this?  It seems like you could just as easily look one stack frame up when your did_start_rule callback gets called.

Josh

Rasmus Andersson

unread,
Feb 1, 2011, 5:47:23 PM2/1/11
to gazell...@googlegroups.com
On Tue, Feb 1, 2011 at 18:57, Josh Haberman <jhab...@gmail.com> wrote:
> Wow, that's really cool!
> It's always been my dream to have a syntax-aware editor like this -- it's
> one of the reasons I started working on Gazelle.  It's really exciting to
> see this.

:)

> Also, for some reason I was under the impression that your editor was
> closed-source, which is fine because of Gazelle's BSD license, but it's cool
> to know that it's open source also.

Yeah, the source is released under a MIT license.

> I see you added a new callback for will_start_rule in addition to
> did_start_rule; out of curiosity, how do you use this?  It seems like you
> could just as easily look one stack frame up when your did_start_rule
> callback gets called.

The reason for this is that I save parser state per AST node. Consider
the following tree:

a
--b
--c
----d
----e
--f

Now, if "e" is the node that need to be replaced, I resuscitate the
parser state from the closest sibling, which in this example is "d".
As the state is saved after "d" has been processed but before "e" has
begun to "process", I can use that as an entry point for the parser.
When I saved the state on "rule [will] end", the state saved would be
that of just before a rule ended, which would cause a "frame pop" when
resuming the parser.

I don't fully understand the Gazelle runtime just yet, so there might
be an easier/prettier way to solve this :)

Joshua Haberman

unread,
Mar 8, 2011, 9:08:30 PM3/8/11
to gazelle-users
On Feb 1, 2:47 pm, Rasmus Andersson <ras...@notion.se> wrote:
Hello again Rasmus. I've been thinking more about this and I still
wonder whether the better solution isn't to store a smaller number of
intermediate parse states. The approach you are describing is much
more complicated to implement an API for.

Last time I checked Gazelle could parse JSON at ~30MB/s on Core2.
With a JIT I'd expect this to be much faster, but we can start with
this number. You could store parser states every 5kb (and always have
a saved state for the current location of the cursor). The worst case
is that you'd have to re-parse 5kb of text when the cursor moves, but
this is only 162 usec of CPU time, and you only pay it when the cursor
moves.. And if each parse snapshot was 200 bytes, your total memory
spent on parser snapshots would be 4% the size of the file itself.
You could tweak this constant differently for a different CPU/memory
tradeoff: for example, snapshotting every 10kb would give you a worst
case of 325us of CPU time for cursor moves, and make the snapshots 2%
the size of the file.

What do you think?

Rasmus Andersson

unread,
Mar 11, 2011, 9:48:10 AM3/11/11
to gazell...@googlegroups.com, Joshua Haberman
On Wed, Mar 9, 2011 at 03:08, Joshua Haberman <jhab...@gmail.com> wrote:
<snip>

>
> Hello again Rasmus.  I've been thinking more about this and I still
> wonder whether the better solution isn't to store a smaller number of
> intermediate parse states.  The approach you are describing is much
> more complicated to implement an API for.
>
> Last time I checked Gazelle could parse JSON at ~30MB/s on Core2.
> With a JIT I'd expect this to be much faster, but we can start with
> this number.  You could store parser states every 5kb (and always have
> a saved state for the current location of the cursor).  The worst case
> is that you'd have to re-parse 5kb of text when the cursor moves, but
> this is only 162 usec of CPU time, and you only pay it when the cursor
> moves..  And if each parse snapshot was 200 bytes, your total memory
> spent on parser snapshots would be 4% the size of the file itself.
> You could tweak this constant differently for a different CPU/memory
> tradeoff: for example, snapshotting every 10kb would give you a worst
> case of 325us of CPU time for cursor moves, and make the snapshots 2%
> the size of the file.
>
> What do you think?

I definitely agree that saving fewer state snapshots is the way to go.
Due to some other things in life (moving to another country, starting
a new job, etc) I haven't had much time to hack on Gazelle nor Kod.
I'm hoping to find some time soon and will investigate this idea.

One thing though; how/where would you save state? I mean, what is a
"unit" (in the finest level of granularity)? A branch (i.e. not a
leaf) in the AST? At the moment my implementation saves a state
snapshot for each branch in the AST, then one can simply walk
leaf-to-root, stopping at the first branch that completely encompasses
the change needing re-parsing. Given we save state every N branch
instead of every 1 branch, the cost could be high (i.e. walking two or
three branches closer to the root might cause _everything_ to be
re-parsed). Just a thought.

Cheers

Will Uther

unread,
May 6, 2011, 9:12:55 PM5/6/11
to gazelle-users
Hi,
I've just been re-evaluating my editor choice, and Kod led me here.
One of the things that I've used occasionally in an editor is the
ability to edit *huge* files. i.e. files that are many times larger
than the available RAM in the machine. vim does this quite well. But
I've not done that with syntax parsing turned on in the editor I'm
using. I started thinking about how to make these things play well
together.

It seems to me that what you'd really like is something partway
between what rsms suggested and the lazy ideas that he provided a link
to. Specifically, you'd store parser state at each node in the AST,
but (using java terminology) each node in the AST would only have a
soft reference to its children. This means that sub-trees of the AST
could be freed when you're running low on RAM. If you need that
subtree later, then you re-parse. This includes subtrees both before
and after the 'current location' in the file.

The only part of the subtree that is really required at any time is
that being displayed (or processed). With other subtrees of the AST
pruned away, you could get a significant memory saving. Moreover,
there is no need prune subtrees if memory isn't tight. This could be
an extra optimisation with little downside. (e.g. LRU caching: Keep
an extra linked list through the AST nodes. Move any node (+
ancestors?) to the front of the list when touched. Free nodes from
the tail of the list when memory is tight.)

Anyway, just thought I'd throw the idea out there. Both Kod and
Gazelle look like cool projects. I wish you luck with them.

Cheers,

Will :-}


On Mar 12, 12:48 am, Rasmus Andersson <ras...@notion.se> wrote:

Rasmus Andersson

unread,
May 13, 2011, 9:39:21 PM5/13/11
to gazell...@googlegroups.com
Interesting although I believe editing huge files is an extremely
small use-case for code editors.

Could you really purge ascending/parent nodes? The way Kod performs
incremental parsing is that traverses the AST tree starting at the
affected node(s) walking upward (in the direction of the root) until
it finds a node/branch that owns all affected child nodes, then
resuscitates the parser to that state (and parses the code + patches
the AST).

Maybe I'm just a bit Friday slow from friday beers, but I think the
ratio between effort-win is something like 9999:1 (e.g. tiny use-case
and huge amount of work).

I'd love to get some help on the Kod parser BTW. Seems most people
involved in the project want to either change the icon, work some
fancy Ruby-ish sidebar UI/magic or implement reasonable UI/UX features
— but no one seems to be interested in the core parts of the editor
(parser, syntax styling, etc).

Cheers

Will Uther

unread,
May 17, 2011, 11:47:49 PM5/17/11
to gazelle-users
Hi,
My reply is inline. :)

On May 14, 11:39 am, Rasmus Andersson <ras...@notion.se> wrote:
> Interesting although I believe editing huge files is an extremely
> small use-case for code editors.

Yes and no. It isn't hugely common, but I don't think it is that hard
to support either.

I don't want to get too much into that debate here - I'll follow up on
the Kod mailing list.

> Could you really purge ascending/parent nodes?

No, but you could purge sibling nodes, including nodes prior to the
nodes currently displayed.

> The way Kod performs
> incremental parsing is that traverses the AST tree starting at the
> affected node(s) walking upward (in the direction of the root) until
> it finds a node/branch that owns all affected child nodes, then
> resuscitates the parser to that state (and parses the code + patches
> the AST).
>
> Maybe I'm just a bit Friday slow from friday beers, but I think the
> ratio between effort-win is something like 9999:1 (e.g. tiny use-case
> and huge amount of work).

At the moment you re-parse when something changes. If you also re-
parse if you need access part of the AST and it has been freed and the
pointer set to NULL, then that's the only extra functionality
required. Deciding which nodes to free could come later.

> I'd love to get some help on the Kod parser BTW. Seems most people
> involved in the project want to either change the icon, work some
> fancy Ruby-ish sidebar UI/magic or implement reasonable UI/UX features
> — but no one seems to be interested in the core parts of the editor
> (parser, syntax styling, etc).

I'll respond to this on the Kod mailing list. :)

Be well,

Will :-}
Reply all
Reply to author
Forward
0 new messages