About finding the start symbol of a grammar

Eduardo Costa

unread,

May 21, 2021, 9:46:25 AM5/21/21

to

Hey guys,

I've been lately dealing with a parser generator for LL grammars, and since
it's inception I've always been blindy assuming the first element read from
within the input file is going to be the start symbol or starting rule.

So I've been wondering all this time, just out of curiosity, if there exists a
method or algorithm to find out the start symbol of a given grammar?

I guess the answer is no.

While there would exist grammars we could recursively check to find out which
it's start symbol is (i.e.: it's the only rule that used the rest of them,
where checking every other resulted in dangling rules that weren't even called
in), there might be other grammars for which more than one rule yields full
coverage (all of these obviously defining different languages) and so leading
to ambiguity.

I only contemplate a simple coverage test, even though other techniques could
exist, again, all of them leading to a point where we couldn't ascertain if
one or the other is what the user meant.

So I'm wondering if this is even an issue in production-grade
parser-generators out there?

Regards,
[yacc and its descendants have an explicit %start declaration, usually defaulting to
the first rule in the file. -John]

Kaz Kylheku

unread,

May 21, 2021, 11:17:17 AM5/21/21

to

On 2021-05-21, Eduardo Costa <ecost...@gmail.com> wrote:
> Hey guys,
>
> I've been lately dealing with a parser generator for LL grammars, and since
> it's inception I've always been blindy assuming the first element read from
> within the input file is going to be the start symbol or starting rule.
>
> So I've been wondering all this time, just out of curiosity, if there exists a
> method or algorithm to find out the start symbol of a given grammar?
>
> I guess the answer is no.

Surely, the start symbol of a context-free grammar is one which appears
only in the left hand side of a rule. If there is such a unique symbol,
it must be /the/ start symbol.

> While there would exist grammars we could recursively check to find out which
> it's start symbol is (i.e.: it's the only rule that used the rest of them,
> where checking every other resulted in dangling rules that weren't even called
> in), there might be other grammars for which more than one rule yields full
> coverage (all of these obviously defining different languages) and so leading
> to ambiguity.

Ambiguity doesn't imply there is no algorithm to find a start symbol,
but that the algorithm must be able to report situations like the
presence of multiple start symbols, or no start symbols.

On the face of it, this problem does not smell of undecidability, or
even NP completeness. Where do you suspect is the difficulty?

It seems like this is a fairly trivial property of a graph, type of
thing.

Whether rules are dangling is also a graph property: is the graph
connected.

> I only contemplate a simple coverage test, even though other techniques could
> exist, again, all of them leading to a point where we couldn't ascertain if
> one or the other is what the user meant.

But tha seems like an identifiable point where the algorithm can stop
and announce a decision. Then diagnostics can be issued.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
[I have seen useful grammars where the start symbol also appears in the RHS of a rule.
Think of the standard expression grammar.

You pick the start symbol that gives you the language you want to parse. -John]

Hans-Peter Diettrich

unread,

May 21, 2021, 11:17:49 AM5/21/21

to

On 5/21/21 12:49 PM, Eduardo Costa wrote:
> I've been lately dealing with a parser generator for LL grammars, and since
> it's inception I've always been blindy assuming the first element read from
> within the input file is going to be the start symbol or starting rule.
>
> So I've been wondering all this time, just out of curiosity, if there exists a
> method or algorithm to find out the start symbol of a given grammar?

Graph analysis methods exist to find unreachable nodes which can become
start symbols. In short any node that is a predecessor of *all* nodes
can be a start symbol. If no such node exists then the grammar is faulty
(discontiguous).

> While there would exist grammars we could recursively check to find out which
> it's start symbol is (i.e.: it's the only rule that used the rest of them,
> where checking every other resulted in dangling rules that weren't even called
> in), there might be other grammars for which more than one rule yields full
> coverage (all of these obviously defining different languages) and so leading
> to ambiguity.

IMO this problem can be solved by introduction of an artificial start
symbol that allows to reach all other symbols but can not be reached
itself. Please note that this solution solves a syntactic problem but
may not prevent or even cause semantic problems.

> I only contemplate a simple coverage test, even though other techniques could
> exist, again, all of them leading to a point where we couldn't ascertain if
> one or the other is what the user meant.
>
> So I'm wondering if this is even an issue in production-grade
> parser-generators out there?

A useful parser generator should include checks for grammar sanity.

DoDi

Anton Ertl

unread,

May 22, 2021, 1:23:40 PM5/22/21

to

Kaz Kylheku <563-36...@kylheku.com> writes:
>Surely, the start symbol of a context-free grammar is one which appears
>only in the left hand side of a rule. If there is such a unique symbol,
>it must be /the/ start symbol.

It could be a now-unused nonterminal, while the start symbol is part
of a strongly connected component of the graph.

If you have one nonterminal that dominates all other nonterminals (the
other nonterminals can be derived from it, but not the other way
round), it probably is the start symbol. Why "probably"? There is
still the possibility that there is a wrapper rule around the real
start symbol that was used for debugging and is left in the grammar.

>On the face of it, this problem does not smell of undecidability, or
>even NP completeness. Where do you suspect is the difficulty?

It's easy to find nodes with particular properties in a graph. But
there is no guarantee that the result really is the start symbol.

There is a reason why you specify the start symbol in some way.

>Whether rules are dangling is also a graph property: is the graph
>connected.

"Connected" is an undirected-graph property. If a nonterminal is
unreachable from the start symbol, it can still be connected to the
reachable graph through a RHS-nonterminal.

- anton
--
M. Anton Ertl
an...@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/

Ev. Drikos

unread,

May 22, 2021, 1:23:57 PM5/22/21

to

On 21/05/2021 13:49, Eduardo Costa wrote:
> While there would exist grammars we could recursively check to find out which
> it's start symbol is (i.e.: it's the only rule that used the rest of them,
> where checking every other resulted in dangling rules that weren't even called
> in), there might be other grammars for which more than one rule yields full
> coverage (all of these obviously defining different languages) and so leading
> to ambiguity.

IMHO, it can be so simple as you describe here without important overhead.
Typically, a parser will reduce the start symbol and finish. All rules
that yield full coverage can be ie alternatives of a single root symbol:

RootSymbol -> R1 | R2 | ... | Rn

Ev. Drikos

Christopher F Clark

unread,

May 22, 2021, 1:24:24 PM5/22/21

to

As Dodi noted, this is basically a graph analysis problem and the
graph may be disconnected (a forest). And our moderator has added
several insightful comments. E.g. you can "declare" a start symbol
and if not present default to some symbol, either the first one in the
grammar, or some symbol from which all other symbols are reachable
(presuming the graph isn't disconnected), and the start symbol can be
recursively defined, etc.

However, there is one particular curious aspect if you are writing a
translator to a recursive descent parser, one generally makes a
function of each rule, as a result one can consider each symbol a
start symbol for whatever sub-graph is reachable from it. With a
table driven parser, one has to make a table of entries into the
parsing table to achieve the same effect, but that is not difficult to
do, although that may require additional table rows if the symbol
behaves slightly differently when used as a start symbol rather than
in the context of other rules (e.g. follow symbols).

So, in that sense, a start symbol is simply what one wants to parse.

--
******************************************************************************
Chris Clark email: christoph...@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------

gah4

unread,

May 27, 2021, 10:40:23 AM5/27/21

to

On Saturday, May 22, 2021 at 10:24:24 AM UTC-7, Christopher F Clark wrote:
> As Dodi noted, this is basically a graph analysis problem and the
> graph may be disconnected (a forest). And our moderator has added
> several insightful comments. E.g. you can "declare" a start symbol
> and if not present default to some symbol, either the first one in the
> grammar, or some symbol from which all other symbols are reachable
> (presuming the graph isn't disconnected), and the start symbol can be
> recursively defined, etc.

Seems to me that this should be related to the problem of finding the
root of a phylogenetic tree.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7149615/

Unlike CS trees, there is no necessary directionality to the links, and so
finding the root is more difficult. Yet biologists have some desire to
know where the root is.

But as also noted above, in the case of recursion, it depends on the language.

In most languages, <expression> is recursive, allowing for

'(' <expression> ')'

however, a language (though I don't know of any) could require all expressions
to be parenthesized, in which case the start would be the parenthesized form.

[I think previous messagees have made it clear that while this is an
interesting exercise, its only practical use is to see if the start
symbol declared in the grammar is different from the computed one, in
which case the grammar is broken. -John]