Named terminal documentation

1 view
Skip to first unread message

Matt Brubeck

unread,
Jul 14, 2008, 3:10:04 PM7/14/08
to gazelle-users
Some minor things in the manual that confused me when trying to use
named terminals:

Unlike nonterminals, named terminals *do* need to precede their use.
This is gently implied ("can be referred to *later*") but wasn't clear
enough for me. (Would it be hard to remove this limitation?)

The manual says that you can name a string or regex, but only regexes
work.

If I'm reading it right, the manual uses "term" to mean any grammar
expression, but it also uses "term" and "nonterm" as abbreviations for
"terminal" and "nonterminal" at some point.

The manual says that "the named terminal syntax avoids creating a
trivial and useless nonterminal graph." Would it be possible (or even
desirable) for Gazelle to detect that a rule is trivial, and handle it
like a named terminal automatically?

On another documentation-related note, here's a trivial change to
README for us poor Debian users with our confusing multiple "lua"
packages:
http://github.com/mbrubeck/gazelle/commit/f4a847ed7e2a02274345d469c8b86a4bc0216d0f

Joshua Haberman

unread,
Jul 15, 2008, 3:16:26 AM7/15/08
to gazelle-users
Hi Matt, thanks for the feedback!

On Jul 14, 12:10 pm, Matt Brubeck <mbrub...@limpet.net> wrote:
> Unlike nonterminals, named terminals *do* need to precede their use.
> This is gently implied ("can be referred to *later*") but wasn't clear
> enough for me.  (Would it be hard to remove this limitation?)

Removing this limitation wouldn't be too hard. I think you're right
that it should be removed.

I wish I had an issue tracker where I could put this. I'm trying to
set up hosting on code.google.com, but it's shutting me down because
there's already a project on SourceForge named "Gazelle." But I've
already talked with the guy who runs that project and he's ok with
sharing the name. Unfortunately I can't get ahold of him now to get
his ok to register on code.google.com (which is what the site requires
before it will let you register).

> The manual says that you can name a string or regex, but only regexes
> work.

Good catch. My first inclination is to say that the manual is wrong
and that naming strings should not be supported. I don't think
anything is gained by writing:

let: "let";
equal: "=";
declaration -> let var equal value;

vs. just saying:

declaration -> "let" var "=" value;

On the other hand, I wouldn't be surprised if someone could come up
with a case where naming strings has a real benefit. Do you have one?

> If I'm reading it right, the manual uses "term" to mean any grammar
> expression, but it also uses "term" and "nonterm" as abbreviations for
> "terminal" and "nonterminal" at some point.

That's a good point. I should come up with less ambiguous
terminology. There's a lot of terminology that I'm not crazy about
the names for. For example, "RTN" would probably be better called a
"rule graph" or something like that.

Maybe I should make a glossary of the terms I'm currently using,
identify the ones I don't like, and try and get better names for them
before they become even more ingrained.

> The manual says that "the named terminal syntax avoids creating a
> trivial and useless nonterminal graph."  Would it be possible (or even
> desirable) for Gazelle to detect that a rule is trivial, and handle it
> like a named terminal automatically?

I had this same thought at one point. Gazelle could certainly detect
this case. The problem is that the method for hooking up a host
program to a grammar is going to inextricably tie the host program to
the structure of the grammar. So optimizing parts of the grammar away
isn't really an option. If it looks like a rule in the grammar, it's
got to be a rule in the output.

> On another documentation-related note, here's a trivial change to
> README for us poor Debian users with our confusing multiple "lua"
> packages:http://github.com/mbrubeck/gazelle/commit/f4a847ed7e2a02274345d469c8b...

How exciting -- I just got to merge my first outside contribution
using Git!

http://github.com/haberman/gazelle/commit/82b93dfeb6629b0b1e8d7bc9b651addc4ca4365c

I rebased your change before merging it -- hope you don't mind. If
you rebase on your own prior to submitting, it helps keep the history
cleaner.

Josh

Matt Brubeck

unread,
Jul 15, 2008, 8:48:22 AM7/15/08
to gazelle-users
On Jul 15, 12:16 am, Joshua Haberman <jhaber...@gmail.com> wrote:
> I wish I had an issue tracker where I could put this.

Maybe ticgit or git-issues as a temporary solution (or just keep using
TODO)?

> My first inclination is to say that the manual is wrong
> and that naming strings should not be supported. [...]
> On the other hand, I wouldn't be surprised if someone could come up
> with a case where naming strings has a real benefit.  Do you have one?

The only reason I've come up with would be a long string that is
repeated often, or likely to change. But I don't have any real-world
examples. And anyways, you can always use a trivial regex to match a
single string if you really want to. So I don't see any harm in
keeping (and documenting) the current functionality.

Joshua Haberman

unread,
Jul 15, 2008, 8:04:48 PM7/15/08
to gazelle-users
On Jul 15, 5:48 am, Matt Brubeck <mbrub...@limpet.net> wrote:
> On Jul 15, 12:16 am, Joshua  Haberman <jhaber...@gmail.com> wrote:
>
> > I wish I had an issue tracker where I could put this.
>
> Maybe ticgit or git-issues as a temporary solution (or just keep using
> TODO)?

Ok, I managed to finally get registered at code.google.com, so now I
have an issue tracker:

http://code.google.com/p/gazelle/issues/list

Now that I have it though, I'm a little bit less sure that removing
this limitation is for the best. Allowing this for rules is a
requirement, since rules can be mutually recursive. If you have:

a -> "X" b?;
b -> "Y" a?;

This is a perfectly valid grammar, but cannot be expressed as such
without allowing rules to be referenced before their use. Named
terminals, on the other hand, don't reference anything else and can
always come before everything else.

I'm trying to walk a fine line between having the language's
limitations encourage good style and having them be a straight-
jacket. Is it oppressive to have to list named terminals before their
use? Difficulty of implementation is not a significant issue; I'm
just thinking there may be benefits to know when you read a grammar
that if you see a symbol used that hasn't been defined yet, then you
know it's a rule and not a named terminal.

Another possibility is to syntactically enforce a convention like:
"named terminals are in all caps, anything else is a nonterminal." On
one hand the consistency that would provide appeals to me, on the
other hand I think it could be nice to allow grammar files to follow
the conventions of the standard they are implementing, to make it
easier to compare the two.

I read an essay a while back that I wish I could find now, where the
guy argues that languages like C would be better if they were so
stringent about style that code would fail to compile if it didn't eg.
indent properly. On one hand that sounds extreme, but on the other
hand most significant projects end up establishing a convention anyway
and trying to make sure everyone follows it. Consistency encourages
readability. Why make each project do this work of creating and
establishing a convention? If the convention is a part of the
language, then there will be consistency automatically across everyone
who uses the language.

So to bring this back to a more concrete discussion, would requiring
named terminals to be defined before their use be a gentle nudge in
the right direction that encourages everyone to structure their
grammars with nonterminals first, or would it be a draconian
limitation that makes the language more temperamental than it's
worth?

> > My first inclination is to say that the manual is wrong
> > and that naming strings should not be supported.  [...]
> > On the other hand, I wouldn't be surprised if someone could come up
> > with a case where naming strings has a real benefit.  Do you have one?
>
> The only reason I've come up with would be a long string that is
> repeated often, or likely to change.  But I don't have any real-world
> examples.  And anyways, you can always use a trivial regex to match a
> single string if you really want to.  So I don't see any harm in
> keeping (and documenting) the current functionality.

Cool, I've documented it:

http://github.com/haberman/gazelle/commit/4f9352caaed6fcc6b8c70f2ec8cb053d2f64e3cb

Josh

Matt Brubeck

unread,
Jul 16, 2008, 9:42:33 AM7/16/08
to gazelle-users
On Jul 15, 5:04 pm, Joshua Haberman <jhaber...@gmail.com> wrote:
> I'm trying to walk a fine line between having the language's
> limitations encourage good style and having them be a straight-
> jacket.  Is it oppressive to have to list named terminals before their
> use?

I do think it's too oppressive. My main argument is that there is a
strong convention in grammars to do exactly the opposite. Normally,
grammars are specified starting with the highest-level rules, and
working down to the details. (See sketches/regex.gzl for a perfect
example.) The current named terminal implementation is in direct
tension with this, forcing you to do the exact opposite for a
basically arbitrary subset of the grammar. It really feels like the
implementation showing through.

Joshua Haberman

unread,
Jul 16, 2008, 11:55:56 AM7/16/08
to gazelle-users
Thanks for weighing in. What about an enforced convention like "named
terminals start with a capital letter, rules start with a lower-case
letter"? Sort of like constant names in Ruby. I feel more
comfortable with that level of enforced convention.

Josh

Joshua Haberman

unread,
Jul 16, 2008, 12:34:17 PM7/16/08
to gazelle-users
FWIW, I find your visceral reaction more convincing than your actual
arguments. Bison also requires you to define token names before rules
-- they go in the "declarations" section, which must precede all the
rules:

http://www.gnu.org/software/bison/manual/html_mono/bison.html#Grammar-Outline

Requiring named regexes to come before rules doesn't impose any
limitation on the order of your *rules*, and therefore doesn't keep
you from specifying a grammar starting at the high level and working
down. It just requires that any named terminals come before any
rules. It's sort of like requiring that your "lexer" be defined
before your "parser", but the parser can still be defined top-down.

On the other hand, maybe there is a difference between requiring you
just to *name* your tokens prior to their use (like Bison does) and
requiring you to actually *define* them prior to their use.

sketches/regex.gzl works just fine under this rule -- its only named
terminal is "whitespace" (well it should be a named terminal, I just
haven't modified it since the named terminal syntax was introduced),
which is only used after it is defined.

Josh

Joshua Haberman

unread,
Jul 16, 2008, 12:56:48 PM7/16/08
to gazelle-users
By the way, sketches/lua.gzl is sort of what I had in mind as being
idiomatic wrt. this rule -- the grammar very closely matches the
grammar as specified by the Lua manual. We just predefine a few
regexes and then we work top-down.

Josh

On Jul 16, 9:34 am, Joshua Haberman <jhaber...@gmail.com> wrote:
> FWIW, I find your visceral reaction more convincing than your actual
> arguments.  Bison also requires you to define token names before rules
> -- they go in the "declarations" section, which must precede all the
> rules:
>
> http://www.gnu.org/software/bison/manual/html_mono/bison.html#Grammar...
Reply all
Reply to author
Forward
0 new messages