Extending the Standard Yacc Grammar to scripts

Jens Schweikhardt

unread,

May 30, 2012, 10:50:28 AM5/30/12

to

hello, world\n

POSIX specified the shell grammar up to a "complete_command" as the
start symbol. See
http://pubs.opengroup.org/onlinepubs/009695399/utilities/xcu_chap02.html#tag_02_10
which starts with

complete_command : list separator
| list
;
list : list separator_op and_or
| and_or
;
and_or : pipeline
| and_or AND_IF linebreak pipeline
| and_or OR_IF linebreak pipeline
;
pipeline : pipe_sequence
| Bang pipe_sequence
;
pipe_sequence : command
| pipe_sequence '|' linebreak command
;
command : simple_command
| compound_command
| compound_command redirect_list
| function_definition
;
[...]
newline_list : NEWLINE
| newline_list NEWLINE
;
linebreak : newline_list
| /* empty */
;
separator_op : '&'
| ';'
;
separator : separator_op linebreak
| newline_list
;
sequential_sep : ';' linebreak
| newline_list
;

A complete command is, e.g.,

if foo; zap & zonk; then bar; baz; zap
else
ls || cat /dev/null && echo
fi > x &

but this is not

ls
ls

because these are two simple_commands not part of a list.

How would the yacc grammar have to be modified or extended in order to
accept a complete script file, with several complete_commands separated
by appropriate separators? I played around a little bit making a script
a naive sequence of complete_commands with separators but it turns out I
either create many additional shift/reduce conflicts (adding 4 to 117
more to the exisiting 5) or it doesn't even parse the double ls :-)

Regards,

Jens
--
Jens Schweikhardt http://www.schweikhardt.net/
SIGSIG -- signature too long (core dumped)

Janis Papanagnou

unread,

May 30, 2012, 12:14:34 PM5/30/12

to

> if foo; zap& zonk; then bar; baz; zap

> else
> ls || cat /dev/null&& echo
> fi> x&
>
> but this is not
>
> ls
> ls
>
> because these are two simple_commands not part of a list.
>
> How would the yacc grammar have to be modified or extended in order to
> accept a complete script file, with several complete_commands separated
> by appropriate separators? I played around a little bit making a script
> a naive sequence of complete_commands with separators but it turns out I
> either create many additional shift/reduce conflicts (adding 4 to 117
> more to the exisiting 5) or it doesn't even parse the double ls :-)

IIUC...
Just introduce a command_sequence (if not already present in the
grammar) definition, similar to the pipe_sequence definition in
the above quoted grammar, but without the terminal symbol '|' of
course. Whether that would introduce above mentioned conflicts is
hard to tell (for me) without trying that with the yacc parser
generator. What have you tried so far that gave you the conflicts?

Janis

>
> Regards,
>
> Jens

Kaz Kylheku

unread,

May 30, 2012, 12:16:56 PM5/30/12

to

On 2012-05-30, Jens Schweikhardt <use...@schweikhardt.net> wrote:
> hello, world\n

Hi Jens,

> POSIX specified the shell grammar up to a "complete_command" as the
> start symbol.

Yes; probably because Yacc parsing is not well suited to getting interactive
input, and it is easy to call ypparse() repeatedly.

> How would the yacc grammar have to be modified or extended in order to
> accept a complete script file, with several complete_commands separated
> by appropriate separators?

I would start by looking at this production:

subshell : '(' compound_list ')'
;

This compound_list generates a complete script. I suspect that compound_list
can simply be reused as a whole-script recognizer. The parentheses are not
essential since the LALR(1) logic will reduce on the EOF symbol just as well as
on ')'.

> I played around a little bit making a script
> a naive sequence of complete_commands with separators but it turns out I
> either create many additional shift/reduce conflicts (adding 4 to 117
> more to the exisiting 5) or it doesn't even parse the double ls :-)

Existing 5? Hmm, I wonder whether those are true ambiguities or whether
they can be factored out.

--
If you ever need any coding done, I'm your goto man!

Jens Schweikhardt

unread,

May 30, 2012, 4:56:37 PM5/30/12

to

Kaz Kylheku <k...@kylheku.com> wrote
in <20120530...@kylheku.com>:
# On 2012-05-30, Jens Schweikhardt <use...@schweikhardt.net> wrote:
#> hello, world\n
#
# Hi Jens,

Hi Kaz,

#> POSIX specified the shell grammar up to a "complete_command" as the
#> start symbol.
#
# Yes; probably because Yacc parsing is not well suited to getting interactive
# input, and it is easy to call ypparse() repeatedly.
#
#> How would the yacc grammar have to be modified or extended in order to
#> accept a complete script file, with several complete_commands separated
#> by appropriate separators?
#
# I would start by looking at this production:
#
# subshell : '(' compound_list ')'
# ;
#
# This compound_list generates a complete script. I suspect that compound_list
# can simply be reused as a whole-script recognizer. The parentheses are not
# essential since the LALR(1) logic will reduce on the EOF symbol just as well as
# on ')'.

That's a good idea. I have added

commands : compound_list ;

and removed the now useless top rules (as reported by yacc/bison)

complete_command : list separator
| list
;
list : list separator_op and_or
| and_or
;

This parses command sequences well and keeps the s/r conflicts at 5.
Now what appears to remain are extra lone newlines and semicolons, e.g.
a script with just

;

(semi or semi newline or newline) is still a syntax error. I have
appended a selfcontained shell.y for everyone to play with. No lexer
needed, simply initialize the token[] array near the bottom to your
liking. Compile with

bison -y -o shell.c shell.y
cc -o shell shell.c

Enjoy, Jens

------- shell.y: ------

%{
#include <stdio.h>
#include <stdlib.h>

extern int yyerror (char *s);
extern int yylex (void);

%}

%token WORD
%token ASSIGNMENT_WORD
%token NAME
%token NEWLINE
%token IO_NUMBER

/* The following are the operators mentioned above. */

%token AND_IF OR_IF DSEMI
/* '&&' '||' ';;' */

%token DLESS DGREAT LESSAND GREATAND LESSGREAT DLESSDASH
/* '<<' '>>' '<&' '>&' '<>' '<<-' */

%token CLOBBER
/* '>|' */

/* The following are the reserved words. */

%token If Then Else Elif Fi Do Done
/* 'if' 'then' 'else' 'elif' 'fi' 'do' 'done' */

%token Case Esac While Until For
/* 'case' 'esac' 'while' 'until' 'for' */

/* These are reserved words, not operator tokens, and are
recognized when reserved words are recognized. */

%token Lbrace Rbrace Bang
/* '{' '}' '!' */

%token In
/* 'in' */

/* -------------------------------------------------------
The Grammar
------------------------------------------------------- */
/*commands : complete_command
| commands sequential_sep
; */
/*script : complete_command separator
| complete_command
; */

%%

commands : compound_list ;

compound_command : brace_group
| subshell
| for_clause
| case_clause
| if_clause
| while_clause
| until_clause
;
subshell : '(' compound_list ')'
;
compound_list : term
| newline_list term
| term separator
| newline_list term separator
;
term : term separator and_or
| and_or
;
for_clause : For name linebreak do_group
| For name linebreak in sequential_sep do_group
| For name linebreak in wordlist sequential_sep do_group
;
name : NAME /* Apply rule 5 */
;
in : In /* Apply rule 6 */
;
wordlist : wordlist WORD
| WORD
;
case_clause : Case WORD linebreak in linebreak case_list Esac
| Case WORD linebreak in linebreak case_list_ns Esac
| Case WORD linebreak in linebreak Esac
;
case_list_ns : case_list case_item_ns
| case_item_ns
;
case_list : case_list case_item
| case_item
;
case_item_ns : pattern ')' linebreak
| pattern ')' compound_list linebreak
| '(' pattern ')' linebreak
| '(' pattern ')' compound_list linebreak
;
case_item : pattern ')' linebreak DSEMI linebreak
| pattern ')' compound_list DSEMI linebreak
| '(' pattern ')' linebreak DSEMI linebreak
| '(' pattern ')' compound_list DSEMI linebreak
;
pattern : WORD /* Apply rule 4 */
| pattern '|' WORD /* Do not apply rule 4 */
;
if_clause : If compound_list Then compound_list else_part Fi
| If compound_list Then compound_list Fi
;
else_part : Elif compound_list Then else_part
| Else compound_list
;
while_clause : While compound_list do_group
;
until_clause : Until compound_list do_group
;
function_definition : fname '(' ')' linebreak function_body
;
function_body : compound_command /* Apply rule 9 */
| compound_command redirect_list /* Apply rule 9 */
;
fname : NAME /* Apply rule 8 */
;
brace_group : Lbrace compound_list Rbrace
;
do_group : Do compound_list Done /* Apply rule 6 */
;
simple_command : cmd_prefix cmd_word cmd_suffix
| cmd_prefix cmd_word
| cmd_prefix
| cmd_name cmd_suffix
| cmd_name
;
cmd_name : WORD /* Apply rule 7a */
;
cmd_word : WORD /* Apply rule 7b */
;
cmd_prefix : io_redirect
| cmd_prefix io_redirect
| ASSIGNMENT_WORD
| cmd_prefix ASSIGNMENT_WORD
;
cmd_suffix : io_redirect
| cmd_suffix io_redirect
| WORD
| cmd_suffix WORD
;
redirect_list : io_redirect
| redirect_list io_redirect
;
io_redirect : io_file
| IO_NUMBER io_file
| io_here
| IO_NUMBER io_here
;
io_file : '<' filename
| LESSAND filename
| '>' filename
| GREATAND filename
| DGREAT filename
| LESSGREAT filename
| CLOBBER filename
;
filename : WORD /* Apply rule 2 */
;
io_here : DLESS here_end
| DLESSDASH here_end
;
here_end : WORD /* Apply rule 3 */
;

newline_list : NEWLINE
| newline_list NEWLINE
;
linebreak : newline_list
| /* empty */
;
separator_op : '&'
| ';'
;
separator : separator_op linebreak
| newline_list
;
sequential_sep : ';' linebreak
| newline_list
;

%%

static int token[] = {
NEWLINE,
#if 0
If,
WORD,
NEWLINE,
Then,
WORD,
NEWLINE,
Fi,
NEWLINE,
If,
WORD,
NEWLINE,
Then,
WORD,
NEWLINE,
Fi,
#endif

0 /* Yacc's logical end-of-input. */
};

int yylex (void) {
static int i = 0;
return token[i++];
}

int yyerror (char *s)
{
fprintf (stderr, "%s\n", s);
return 0;
}

int main (void)
{
yyparse();
return EXIT_SUCCESS;
}

/* End of shell.y */

bsh

unread,

May 30, 2012, 9:15:17 PM5/30/12

to

Jens Schweikhardt <use...@schweikhardt.net> wrote:
> ...

> How would the yacc grammar have to be modified or extended

> in order to accept a complete script file? ...
> ...

Congratulations to all for banging your way through a yacc(1)
script; however, you should be aware that I deem that the
referenced BNF specification is fundamentally broken in more
ways than just that.

I go into the matter in excruciating detail in the thread:

"Announcement: bashcritic", 2007-10-24
http://groups.google.com/group/comp.unix.shell/msg/4e4cf34fc5c9e99e

"... But there is a problem. According this grammar that you
cite as the de facto justification for the variant bourne function
syntax, the token stream 'foo' '(' ')' 'cmds' has no production!
It is broken grammar."

As explained in my OP, parsing of shell grammar is problematic
insofar as the _multi-pass_, recursive-descent nature of the
original ad-hoc grammar is difficult to munge into an equivalent
LR grammar. Also indicated is the fact that the parse tree is
produced by inspection only. Anyone care to confirm?

If you want pointers to vetted shell parsers, here are a few
leads:

"jbash.java": parser for bash scripts: implements CC
http://code.google.com/p/jbash/

"Cocktail.tcl": GUI shell script parser
ftp://www.cocolab.com/products/cocktail/demo/{ct-linux.tgz,ct-win32.exe}
http://www.cocolab.com/en/parsers.html -- parser for Unix Shell
"SHELL.EXE": builds a syntax tree which can be visualized graphically
http://www.cocolab.com/parse_shell.html
ftp://www.cocolab.com/products/cocktail/demo/shell-win32.zip (Not
Currently Accessible)

Incidentally, after about a decade of starting and stopping,
trying and failing, deciding and then changing my mind on
the nature and feasibility of a native parser written in shell,
I have "cracked the code" (so to speak) having made the
recent discovery of an obscure parser algorithm which is
particularly suitable.

To see how many (sophisticated!) parser-parsers are written
in javascript (javascript!), it's about time.

=Brian

Kaz Kylheku

unread,

Jun 1, 2012, 4:23:32 AM6/1/12

to

On 2012-05-30, Jens Schweikhardt <use...@schweikhardt.net> wrote:

> Kaz Kylheku <k...@kylheku.com> wrote

> This parses command sequences well and keeps the s/r conflicts at 5.
> Now what appears to remain are extra lone newlines and semicolons, e.g.
> a script with just
>
> ;
>
> (semi or semi newline or newline) is still a syntax error.

The cases with semi are in fact syntax errors. There is no such thing as an empty
statement terminated by a semicolon in the shell language.

The null command is spelled : (colon):

# bad
if condition ; then
;
fi

# good
if condition ; then
:
fi

As for handling an empty script, here is a stab in the dark:

commands : compound_list | newline_list ;

> ------- shell.y: ------

I'm afraid that this shell grammar is a bit of a committee-designed screwup, at
least in one particular regard.

Every single one of the five shift-reduce conflicts revolves around a silly
NEWLINE token, in spite of the numerous convoluted rules surrounding newlines
and separators.

Scary to think that essentially same people who committed this hack job
standardized yacc itself, LOL.

(If you can't handle freaking *whitespace* in a grammar without five
shift-reduce conflicts, it's time to retire and give the job to an undergrad
co-op who just took a course in the stuff.)

In fact the situation is quite simple. Commands are separated by a mixture
of one or more separators and newlines, which contains at most one separator.

This obviously constitutes a regular language, which could be reduced in a
regex-driven scanner instead of the parser, thereby relieving the parser of
those conflicted convolutions.

The regular language is stupider in that it doesn't handle recursion, but it
has the virtue of more lookahead than the one-symbol of the LALR(1) generated
parser. So a scanner can eat a whole newline sequence as one token, without
running into silly conflicts about what to do with the next character, and in
parallel with that it can match another pattern which expresses the "at most
one separator" idea.