Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

A finite state transducer specification for yacc's syntax suitable for utilities like "lex".

10 views
Skip to first unread message

Rock Brentwood

unread,
Mar 14, 2021, 8:52:12 PM3/14/21
to
It's a recurrent question that's come up in other forums "can yacc be bootstrapped in yacc?" Now, I'm adding a twist.

I'll repeat one of my recent replies here. In the syntax for yacc files, laid out by the POSIX standard, there is no mandatory semi-colon at the ends of rules, so an extra look-ahead is required to determine whether an identifier is followed by a colon. If so, then this indicates the left-hand side of a new rule.

A grammar rule has the form

left-hand-side ":" stuff on the right optional ";"'s.

If you see a ":" in the middle of the rules on the right, then you've actually sneaked on over into the *next* rule.

Bison hacks the syntax, by making left-hand-side + ":" into a single token.

It may, in fact, be possible to parse yacc *even with* this issue, without having to hack yacc grammar like bison did - by just simply not using yacc at all, but using only lex!

The grammar specified by POSIX is actually a *regular* grammar that specifies a finite state transducer. The transducer is deterministic, precisely because the identifier-colon ambiguity can be resolved. Therefore, if you set up the right kind of finite state machine in lex making use of lex's start condition facility in the right way, then you should be able to parse a yacc grammar with lex and to bootstrap an implementation of yacc with just lex.

Apologies if any of the columns spill over.

// Definitions:
key: token { α₀ } | left { α₁ } | right { α₂ } | binary { α₃ } | type { α₄ };
name: literal { β₀ } | id { β₁ };
def₀: start name { κ₀ } | union UnionC { κ₁ } | en CodeC de { κ₂ };
tag: { μ₀ } ('<' name '>' { μ₁ })?;
idn: name ({ θ₀ } | num { θ₁ });
act: '{' ActionC '}' { ι };
item: name | act;

// A Context-Free Grammar (It's actually a regular grammar), based on the POSIX standard.
// A glossary of the actions is listed below.

%start Yacc

Yacc: Defs mark Rules Tail { ω };
Def: def₀ | key tag { ε₀ } Decs { κ₃ };
Decs: idn { ε₁ } | Decs ','? idn { ε₁ };
Defs: { γ₀ } | Defs Def { γ₁ };
Rules: { ζ₀ } Rule { ζ₁ } | Rules Rule { ζ₁ };
Rule: id ':' { ν₀ } Body Prec { ν₁ } | Rule '|' Body Prec { ν₁ };
Body: { δ₀ } | Body item { δ₁ };
Prec: { λ₀ } (prec name { λ₁ } (act { λ₂ })?)? | Prec ';' { λ₃ };
Tail: { η₀ } (mark ProgramC { η₁ })?;
%start Yacc

// A DFA version of the grammar, since it's a regular grammar.
// All of these transitions can be done unambiguously with a look-ahead of 1.
// For each state, earlier alternatives take precedence over later ones.
Yacc:
γ₀ Defs
Defs:
def₀ NextDef |
key tag ε₀ Decs |
mark Rules
NextDef:
γ₁ Defs
Decs:
idn ε₁ NextDec
NextDec:
','? Decs |
κ₃ NextDef
Rules:
ζ₀ id Rule
Rule:
':' ν₀ Body
Body:
δ₀ Items
Items: // Non-deterministic without the explicit transition on "id", because id requires an extra look-ahead for ':'.
id CheckColon |
item δ₁ Items |
λ₀ (prec name λ₁ (act λ₂)?)? (';' λ₃)* ν₁ NextItem
CheckColon: // Extracts out the transition on "id" and handles it separately.
λ₀ ν₁ ζ₁ ':' ν₀ Body |
β₁ δ₁ Items
NextItem:
'|' Body |
ζ₁ NextRule
NextRule:
id Rule |
η₀ Tail
Tail:
(mark ProgramC η₁)? ω

// Actions:
∙ α₀,α₁,α₂,α₃,α₄: Identify and store the "key" word.
∙ β₀,β₁: Identitify and store the "name".
∙ γ₀: Start an empty set of definitions
γ₁: Add the last definition processed.
∙ δ₀: Set up a new rule, as the empty list.
δ₁: Append the last "item" processed to the current rule.
∙ ε₀: Set up an empty name list for the key word and (optional tag) just processed.
ε₁: Add the most recently processed name (and optional number) to the key word list.
∙ ζ₀: Start the rules section and set up an empty rules list.
ζ₁: Add the most recently processed rule to the rules list.
∙ η₀: Close the Rules section, process the rules.
η₁: Process the Tail section.
∙ θ₀: Add the most recently processed "name" to the key word list with no number or default number.
θ₁: Add the most recently processed "name" to the key word list with the most recently processed number.
∙ ι: Store the most recently processed rule action.
∙ κ₀: Issue/store the most recently processed start declaration.
κ₁: Issue/store the most recently processed union declaration.
κ₂: Issue/store the most recently processed code snippet.
κ₃: Issue/store the most recently processed key word declaration on the key word list just processed.
∙ λ₀: Close out the most recently processed rule and prepare for a possible precedence declaration.
λ₁: Attach a precedence associated with the most recently processed "name" to the most recently processed rule.
λ₂: Attach a final rule action to the most recently processed rule (basically this is just δ₁).
λ₃: Handle a trailing semicolon.
∙ μ₀: Start a key word declaration with a possibly empty/default tag.
μ₁: Attached the most recently processed "name" as a tag to the current key word declaration.
∙ ν₀: Retrieve the rules list for the most recently processed "id" or initialize it to an empty list, if new.
ν₁: Add the most recently processed rule body (and optional precedence) to the current rule list.
∙ ω: Close out the grammar file and compile the grammar just read in.

Rock Brentwood

unread,
Mar 14, 2021, 9:16:52 PM3/14/21
to
On Sunday, March 14, 2021 at 7:52:12 PM UTC-5, Rock Brentwood wrote:
> The grammar specified by POSIX is actually a *regular* grammar that specifies a finite state transducer.
> Apologies if any of the columns spill over.

Correction: union UnionC { κ₁ } should read union '{' UnionC '}' { κ₁ }.

The list of tokens used by the spec:
token - "%token" or "%0"
left - "%left" or "%<"
right - "%right" or "%>"
binary - "%binary" or "%nonassoc" or "%2"
type - "%type"
prec - "%prec"
start - "%start"
union - "%union"
mark - "%%"
en - "%{"
de - "%}"
id - An alphanumeric identifier [a-zA-Z_.][a-zA-Z_.0-9]*
num - A numeral [0-9][0-9]*.
literal - A literal identifier: any valid C character enclosed in single quotes.
Multi-character literals may be allowed, I haven't read the POSIX standard closely on this issue.

UnionC, CodeC, ProgramC, ActionC: C code snippets consisting of items in a format that would be accepted by C preprocessor.
CodeC: has no "%}" in it, other than those appearing in comments or string literals or character literals.
ActionC, UnionC: cannot have unbalanced "{" and "}" characters since the curly brackets are being used to determine when the snippet starts and ends.
ProgramC: can have anything in it and consists of whatever appears after the "%%" mark up to the end of file.
ActionC: May contain the macros $$, $n, $<tag>, $<tag>n which are to be converted. Parentheses '(' and ')' cannot be unbalanced in ActionC.

In the Unix and BSD versions of Yacc, code snippets are stowed away in temporary files, rather than in memory. These days, this is no longer necessary: contemporary OS's *already* cache excess memory to secondary storage. But it would be exceedingly rare for an actual yacc file to have code snippets so large as to require so much memory as to make this necessary - not even for a natural language grammar!
0 new messages