As you may or may not know, Gazelle uses protocol buffers internally - so it's a logical devision.
Gazelle is such an interesting project, that I'd like to know how we could help Joshua out to speed things up (either on udp *or* gazelle).
Scott
> --
> You received this message because you are subscribed to the Google Groups "gazelle-users" group.
> To post to this group, send email to gazell...@googlegroups.com.
> To unsubscribe from this group, send email to gazelle-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/gazelle-users?hl=en.
>
Scott
Anyhow, yes I did some research earlier and read an old post (from
2009 or possibly early 2010) where you (Joshua) said that Gazelle
where getting refactored internally to used PB instead of BC for data
serialization. That sounds nice, but what are the reasons (beyond PB
being easier to debug) for using Protocol Buffers instead of BitCode?
I mean, the versioning features are basically useless in the case of
Gazelle since the canonical representation of a grammar is in the gzl
file. And is the use-case of having a version 1 runtime reading in a
version 2 grammar really something that would occur often (that is,
controlling the runtime version is out of control for the user)? I'm
asking, not criticizing :)
Further, I've not spent too much time with Gazelle yet (about 10 hours
in total), but my primary goal is to use Gazelle as a fundament for
building a syntax-aware program editor. In this case I need to handle
both error recovery and incremental parsing, both of which I believe
can be modelled on top of Gazelle.
I've written an implementation in C++ during the day which generate an
AST where any branch (not leaf) can halt and/or resume the parsing.
For this to be feasible in the editor I'm writing (http://kodapp.com/)
the state stored need to be very small — here, there's much room for
improvement. Currently the parser state is rather large (a few hundred
bytes for a few levels deep stack). Given a source file AST with 1000
branches, that's a lot of data to be stored, most of which is
redundant. By storing incremental/parity data instead, and later
traversing the tree to build a complete state, would be a much more
efficient solution IMHO (but of course also more complex and would
imply the existence of some sort of semantic tree of the stack).
What are your goals and ideas about the future of Gazelle? For parsing
start-to-end strictly well-formed input (e.g. building compilers) or
for parsing incomplete parts of online input (e.g. from human input)?
I would very much appreciate some guiding directions to where Gazelle
is moving and how soon (to know if we should invest more time with
building tools around Gazelle or go for something different). Thanks!
BTW, I've been researching incremental parsing of concrete languages
for about two weeks (full time) now and of everything I've dug into,
Gazelle is among the top 3 most interesting projects. You might also
find the now stagnant Harmonia project at Berkeley CS interesting:
http://harmonia.cs.berkeley.edu/harmonia/ (the server hosting the
files is currently behind a mis-configured firewall at the moment, but
they're fixing it this week).
--
Rasmus Andersson
Very good arguments. Thanks for clearing that out. I agree that PB is
a better option.
>
> It's less clear how PBs are useful for a parse tree. Parse trees
> aren't really typed (every leaf is a string) and as you noted, the
> "schema" for a parse tree is mostly implicit from the structure of the
> grammar.
>
> My current thought is that the streaming API of protobufs could still
> be a useful model for parse trees, where all parse trees use a PB
> schema like:
>
> message ParseTreeNode {
> // The name of the terminal or nonterminal this node represents.
> required string name = 1;
>
> union {
> // For terminals: the text represented by the terminal.
> string text = 2;
>
> // For nonterminals: the sequence of nodes underneath this one.
> repeated ParseTreeNode = 3;
> }
>
> // Info about where in the source text this node appears.
> required int64 offset = 4;
> required int64 length = 5;
> required int64 line_start = 6;
> required int64 column_start = 7;
> required int64 line_end = 8;
> required int64 column_end = 9;
> }
Interesting.
>
>> Further, I've not spent too much time with Gazelle yet (about 10 hours
>> in total), but my primary goal is to use Gazelle as a fundament for
>> building a syntax-aware program editor. In this case I need to handle
>> both error recovery and incremental parsing, both of which I believe
>> can be modelled on top of Gazelle.
>
> One thing I can definitely say is that this is exactly the kind of use
> case that I always wanted Gazelle to excel at.
Sweet. That's excellent news.
>
>> I've written an implementation in C++ during the day which generate an
>> AST where any branch (not leaf) can halt and/or resume the parsing.
>> For this to be feasible in the editor I'm writing (http://kodapp.com/)
>> the state stored need to be very small — here, there's much room for
>> improvement. Currently the parser state is rather large (a few hundred
>> bytes for a few levels deep stack). Given a source file AST with 1000
>> branches, that's a lot of data to be stored, most of which is
>> redundant. By storing incremental/parity data instead, and later
>> traversing the tree to build a complete state, would be a much more
>> efficient solution IMHO (but of course also more complex and would
>> imply the existence of some sort of semantic tree of the stack).
>
> I'm not sure I completely follow when you say "later traversing the
> tree to build a complete state." My vision for how this would work in
> an editor is that you would store a reasonable number of parse states
> -- say one for the beginning of each line of the file being edited --
> and then whenever a character is typed you can resume from the most
> recent saved state. The number of states you save is a memory/CPU
> tradeoff.
We are deliberately trying to avoid assuming that the language we are
parsing is line based (or making any other assumptions, like for
instance the use of braces/brackets to denote scope). Imagine opening
a large JSON file in an editor. Strict JSON does not contain any
linebreaks and would thus require a full re-parsing for every
keystroke in such a case.
What we are looking at doing is something along the lines of what's
presented in the paper "Lazy Functional Incremental Parsing" by
Jean-Philippe Bernardy (presentation here:
http://www.vimeo.com/6683816 I also have a PDF of the paper I can
share on request).
Imagine the following AST:
01. root
02. --object
03. ----pair
04. ------key
05. --------string
06. ------value
07. --------number
08. ----pair
09. ------key
10. --------string
11. ------value
12. --------number
Now, the state of the stack at any given node is inherently related to
the branch that node lives in. The stack state for node 04. in the
example above might look like this:
rule:key, range:x,y, ...
rule:pair, range:x,y, ...
rule:object, range:x,y, ...
rule:root, range:x,y, ...
When we need to resume parsing at node 04. we could either load a
saved complete stack stored in/at that node. OR, we could load a
partial stack then walk up the AST until we reach the root and thus
have a complete stack and finally can resume parsing.
>
> By the way, a few hundred bytes per parser state is already doing
> pretty good; I can't guarantee Gazelle will be much better than that.
> Last time I computed it, the current code in Gazelle uses 24 bytes per
> stack entry and 12 bytes per token of lookahead. So with a stack just
> 4 deep and 1 token of lookahead you've already hit 100 bytes.
And what I can see by quickly viewing the gzl_parse_state struct, the
struct itself constitutes almost 100 bytes (I'm on x86_64). Im my
rudimentary C++ program on top of Gazelle, I have a "ParserSnapshot"
which only save the crucial parts of the gzl_parse_state struct.
Namely, leaving out things like max_stack_depth, max_lookahead and
user_data (none of which in my case is bound to a specific parse
state).
I think we can streamline the state load/save API to only copy the
bare minimums required for a given application (e.g. passing bitflags
to a copy_state function describing what parts you need).
>
>> What are your goals and ideas about the future of Gazelle? For parsing
>> start-to-end strictly well-formed input (e.g. building compilers) or
>> for parsing incomplete parts of online input (e.g. from human input)?
>
> My goal is definitely to accommodate both. I may discover at some
> point that I can't do both without making the tool sub-par at both
> jobs; if so I'll have to cross that bridge when I come to it.
>
> As far as parsing incomplete parts of online input, the way I
> currently think about that problem is that it is the same as error
> recovery for a compiler. In both cases you're dealing with incomplete/
> inaccurate input, and in both cases you want to make as much sense of
> the input as you can. I believe that Clang from LLVM is taking the
> same approach and using the compiler's parser interactively also.
Agreed — incremental parsing and error recovery is indeed very much
related and kind of "intersects". However, in the case of clang (which
has a rather rich and nice C API in libclang), error recovery can be
handled efficiently since the parser is very much purpose-built. If
you find an extra "}" after another "}" which doesn't match up with
the parent scope you can take a specific error recovery action. In our
case with a generalized approach this is AFAIK a much trickier
business. A very interesting problem indeed!
>
>> I would very much appreciate some guiding directions to where Gazelle
>> is moving and how soon (to know if we should invest more time with
>> building tools around Gazelle or go for something different). Thanks!
>
> The one answer I can give you with the most certainty is "probably not
> soon enough to make you happy." If there's one thing that has become
> apparent to me over the last several years of developing Gazelle and
> upb, it's that I don't work fast. I'm really trying to make a tool
> that will stand the test of time and be useful for lots of different
> things, so my process involves reading a lot of literature, looking at
> a lot of existing software, considering a lot of different use cases,
> etc.
>
> That said, I'm really keen on getting feedback and information from
> people who are focused on one particular problem, like your editor
> which looks very cool. I'd love to hear more about your
> requirements. And if you think Gazelle is a useful enough basis for
> doing your own work, I'd even be happy for you to fork it and hack it
> up to your heart's delight, solving the problems you encounter in
> whatever way is most convenient for you. It would give me a lot of
> good ideas about the best ways to generalize. I'd be happy to answer
> any questions you have about the code. Of the stuff in Gazelle, I'd
> say the most valuable and hard to recreate is the LL lookahead
> generation in compiler/ll.lua.
I've decided to invest more time in Gazelle and have briefly talked
with some of my co-developers of Kod who seem to agree this looks like
a very interesting path to walk down. You can find my fork at
https://github.com/rsms/gazelle (however, I've not yet pushed the C++
library additions nor published my AST-machine-stuff, but will soon).
Cheers
Here's a short screencast demo: http://www.youtube.com/watch?v=nSM41hToa0w
My fork of Gazelle with a few fixes and changes: https://github.com/rsms/gazelle
The AST stuff is implemented in Kod:
https://github.com/rsms/kod/tree/master/src (AST*)
:)
> Also, for some reason I was under the impression that your editor was
> closed-source, which is fine because of Gazelle's BSD license, but it's cool
> to know that it's open source also.
Yeah, the source is released under a MIT license.
> I see you added a new callback for will_start_rule in addition to
> did_start_rule; out of curiosity, how do you use this? It seems like you
> could just as easily look one stack frame up when your did_start_rule
> callback gets called.
The reason for this is that I save parser state per AST node. Consider
the following tree:
a
--b
--c
----d
----e
--f
Now, if "e" is the node that need to be replaced, I resuscitate the
parser state from the closest sibling, which in this example is "d".
As the state is saved after "d" has been processed but before "e" has
begun to "process", I can use that as an entry point for the parser.
When I saved the state on "rule [will] end", the state saved would be
that of just before a rule ended, which would cause a "frame pop" when
resuming the parser.
I don't fully understand the Gazelle runtime just yet, so there might
be an easier/prettier way to solve this :)
I definitely agree that saving fewer state snapshots is the way to go.
Due to some other things in life (moving to another country, starting
a new job, etc) I haven't had much time to hack on Gazelle nor Kod.
I'm hoping to find some time soon and will investigate this idea.
One thing though; how/where would you save state? I mean, what is a
"unit" (in the finest level of granularity)? A branch (i.e. not a
leaf) in the AST? At the moment my implementation saves a state
snapshot for each branch in the AST, then one can simply walk
leaf-to-root, stopping at the first branch that completely encompasses
the change needing re-parsing. Given we save state every N branch
instead of every 1 branch, the cost could be high (i.e. walking two or
three branches closer to the root might cause _everything_ to be
re-parsed). Just a thought.
Cheers
Could you really purge ascending/parent nodes? The way Kod performs
incremental parsing is that traverses the AST tree starting at the
affected node(s) walking upward (in the direction of the root) until
it finds a node/branch that owns all affected child nodes, then
resuscitates the parser to that state (and parses the code + patches
the AST).
Maybe I'm just a bit Friday slow from friday beers, but I think the
ratio between effort-win is something like 9999:1 (e.g. tiny use-case
and huge amount of work).
I'd love to get some help on the Kod parser BTW. Seems most people
involved in the project want to either change the icon, work some
fancy Ruby-ish sidebar UI/magic or implement reasonable UI/UX features
— but no one seems to be interested in the core parts of the editor
(parser, syntax styling, etc).
Cheers