Best version yet of my parser combinators

80 views
Skip to first unread message

luser droog

unread,
Jun 11, 2022, 12:53:37 AMJun 11
to
I've just debugged a few simple regexes through my regex->parser parser
and as usual every 6 months or so, it's time to show off.

The latest version is greatly simplified compared to the previous. I've squeezed
it all down to 2 "modules" where the headers have to be included in a
particular order ... but I've enforced that with header guards ... but I'm
using "McIllroy convention" for the guards so they go on the outside.

I'll just show the testing file and the output. That ought to give a sense
of what it's trying to do. For more, see
https://github.com/luser-dr00g/pcomb
or
https://github.com/luser-dr00g/pcomb/archive/1c6d8a346483f522c8fea8da2ee26314bad5bbf4.zip
for this exact contemporaneous version that works /on my machine/.

$ make count
wc pc11*[ch]
359 1567 9185 pc11object.c
92 459 3106 pc11object.h
398 1531 9652 pc11parser.c
54 186 1237 pc11parser.h
70 278 1860 pc11test.c
973 4021 25040 total
$ cat pc11test.c
#include "pc11parser.h"
static int test_basics();
static int test_parsers();
static int test_regex();

int main( void ){
return test_basics()
|| test_parsers()
|| test_regex();
}

static int
test_basics(){
puts( "test_basics" );
list ch = chars_from_str( "abcdef" );
print( ch ), puts("");
print_list( ch ), puts("");
drop( 6, ch );
print( ch ), puts("");
print_list( ch ), puts("");
drop( 7, ch );
print( ch ), puts("");
print_list( ch ), puts("");
puts("");
return 0;
}

static int
test_parsers(){
puts( "test_parsers" );
list ch = chars_from_str( "a b c d 1 2 3 4" );
parser p = result_is( Int(42) );
print_list( parse( p, ch ) ), puts("");
parser q = fails();
print_list( parse( q, ch ) ), puts("");
parser r = item();
print_list( parse( r, ch ) ), puts("");
parser s = either( alpha(), item() );
print_list( parse( s, ch ) ), puts("");
parser t = literal( Int('a') );
print_list( parse( t, ch ) ), puts("");
puts("");
return 0;
}

static int
test_regex(){
puts( "test_regex" );
parser a = regex( "." );
print_list( a ), puts("");
print_list( parse( a, chars_from_str( "a" ) ) ), puts("");
print_list( parse( a, chars_from_str( "." ) ) ), puts("");
print_list( parse( a, chars_from_str( "\\." ) ) ), puts("");
puts("");

parser b = regex( "\\." );
print_list( b ), puts("");
print_list( parse( b, chars_from_str( "a" ) ) ), puts("");
print_list( parse( b, chars_from_str( "." ) ) ), puts("");
print_list( parse( b, chars_from_str( "\\." ) ) ), puts("");
puts("");

parser c = regex( "\\\\." );
print_list( c ), puts("");
print_list( parse( c, chars_from_str( "a" ) ) ), puts("");
print_list( parse( c, chars_from_str( "." ) ) ), puts("");
print_list( parse( c, chars_from_str( "\\." ) ) ), puts("");
puts("");
return 0;
}
$ make
cc -std=c99 -g -Wall -Wpedantic -Wextra -Wno-unused-function -Wno-unused-parameter -Wno-switch -Wno-return-type -Wunused-variable -c -o pc11object.o pc11object.c
cc -std=c99 -g -Wall -Wpedantic -Wextra -Wno-unused-function -Wno-unused-parameter -Wno-switch -Wno-return-type -Wunused-variable -c -o pc11parser.o pc11parser.c
cc -std=c99 -g -Wall -Wpedantic -Wextra -Wno-unused-function -Wno-unused-parameter -Wno-switch -Wno-return-type -Wunused-variable -c -o pc11test.o pc11test.c
cc -std=c99 -g -Wall -Wpedantic -Wextra -Wno-unused-function -Wno-unused-parameter -Wno-switch -Wno-return-type -Wunused-variable -o pc11 pc11object.o pc11parser.o pc11test.o
./pc11
test_basics
...(force_chars_from_string)
...(force_chars_from_string)
('a' .('b' .('c' .('d' .('e' .('f' ....(force_chars_from_string) ))))))
('a' 'b' 'c' 'd' 'e' 'f' ...(force_chars_from_string) )
('a' .('b' .('c' .('d' .('e' .('f' .(EOF .() )))))))
('a' 'b' 'c' 'd' 'e' 'f' EOF )

test_parsers
(OK ((VALUE '*' ) ) ...(force_chars_from_string) )
(FAIL () ...(force_chars_from_string) )
(OK 'a' ...(force_chars_from_string) )
(OK 'a' ...(force_chars_from_string) )
(OK 'a' ...(force_chars_from_string) )

test_regex
Parser(parse_satisfy)
(OK 'a' ...(force_chars_from_string) )
(OK '.' ...(force_chars_from_string) )
(OK '\' ...(force_chars_from_string) )

Parser(parse_satisfy)
(FAIL ("predicate not satisfied" Oper(is_literal) ) 'a' ...(force_chars_from_string) )
(OK '.' ...(force_chars_from_string) )
(FAIL ("predicate not satisfied" Oper(is_literal) ) '\' ...(force_chars_from_string) )

Parser(parse_sequence)
(FAIL ("predicate not satisfied" Oper(is_literal) ) 'a' ...(force_chars_from_string) )
(FAIL ("predicate not satisfied" Oper(is_literal) ) '.' ...(force_chars_from_string) )
(OK ('\' '.' ) ...(force_chars_from_string) )

luser droog

unread,
Jun 11, 2022, 10:53:12 AMJun 11
to
On Friday, June 10, 2022 at 11:53:37 PM UTC-5, luser droog wrote:
> I've just debugged a few simple regexes through my regex->parser parser
> and as usual every 6 months or so, it's time to show off.
>
> The latest version is greatly simplified compared to the previous. I've squeezed
> it all down to 2 "modules" where the headers have to be included in a
> particular order ... but I've enforced that with header guards ... but I'm
> using "McIllroy convention" for the guards so they go on the outside.
>
> I'll just show the testing file and the output. That ought to give a sense
> of what it's trying to do. For more, see
> https://github.com/luser-dr00g/pcomb
> or
> https://github.com/luser-dr00g/pcomb/archive/1c6d8a346483f522c8fea8da2ee26314bad5bbf4.zip
> for this exact contemporaneous version that works /on my machine/.
>

Hmm. This is all fine and dandy so far as it goes, but I think there's a lot from
Recursive Programming Techniques by Burge that I have failed to apply.
In a recent thread I was complaining about the way the standard documents
the C language grammar because all the left recursion is too hard to handle
with naive recursive parsing techniques. So ... the obvious solution sitting
inside of that very fact is ... to disabuse myself of some of that naiveté.

IIRC Burge does the combinators differently, so that it builds an internal
representation of the grammar and then traverses that representation
with some functions. That makes a natural interface for swapping out a
different algorithm or applying optimizations or simplifications beforehand.
But, ... too much toofast ... slow brain. Think: what to do. Another edition of
Thunderdome.

fir

unread,
Jun 12, 2022, 9:28:58 AMJun 12
to
you may go may way - split on lines' atomize ich line on atoms and then plant and grow a forest of ifs

it worx and grow naturarelly...the worst part as to me was the expression microcompiler as you need some static ifs dont sufice

i done it somewhat but im not happy how i did it..right now for a month im resting and not coding but maybe will back to it after summer

luser droog

unread,
Jun 12, 2022, 10:41:54 PMJun 12
to
That does seem like a useful shortcut. But it doesn't do what I'm going for
which is parsing the for-real languages (excluding C++, of course).
But this may not be the final version, assuming I can squeeze some gold
out of Burge and maybe by re-reading Hutton's "Higher Order Functions
for Parsing" a few more times.

There's a lot of little extra features that I'm waiting to pursue until I'm
more certain that the foundation is solid. It's not enough for it to (appear to)
work, but it should "make sense" too. As the adage goes, the two hard
tasks in programming are: cache invalidation and naming things.
Since getting the right names is so hard, I've chosen in this code to re-use
the same name in as many places as possible for the same kind of thing
and let the name spaces take care the details. So I have

typedef union object * object;

and I will defend this choice with both pen and sword.

"But it makes it too hard to talk precisely about the thing you're talking about!"
Baloney. Phony baloney. Pony crony phony baloney. There's "the union" or the
"union object" and there's "the object" which a pointer to the union.
I do not accept the arguments I've heard so far that say you should
use different names here. We have all the words already to describe
whatevz.

"But it breaks the abstraction." Yep. Oops. I broke it. Shucks. Did it break
anything important when that happened? The abstraction is a semipermeable
membrane. Stuff leaks. This is the avant garde. Some eggs iz gonna crack.
Wherever practicable, the fiddly stuff is mediated by functions. But inside
the functions, sometimes they gotta just do their business and get it done.

I think I forgot what the question was.

luser droog

unread,
Jun 16, 2022, 9:32:28 AMJun 16
to
Got a simple compiler for EBNF notation to (appear to) work. You feed it a
string containing the productions, a list of any supplemental parsers that
might be easier to build with the combinators instead of EBNF, and a list
of handlers or callbacks which process the results of various sub-parsers.

The ebnf function returns a list of (name parser) pairs including all the
parsers specified by the ebnf string and the supplemental ones. Then you
can use assoc() or assoc_symbol to grab the start parser from the list and
bobsyerunkle. This turned out a little more complicated to do in C than PostScript
because of the gyrations needed to deal with all the potential mutual recursion
among the productions. So, rather than do a dependency analysis or some kind
of topographical sort, I take the list of symbols to-be-defined -- the left hand sides --
and create a "forward declaration" for each one, just an empty parser structure.
These forwarding nodes can then be composed together to build the right
hand side expressions and then the resulting parser for the right hand side
is used to fill in the contents of the forwarding node.

The extra complication in C was that I had to cook up an internal
representation for the expressions. In PostScript this wasn't necessary because
you can dynamically construct a new procedure in PostScript and just pass
around the unevaluated code. To do the same in C meant I needed a simple
internal data language where an expression can be

->compiles to->
a parser -> the same parser
or a symbol -> assoc( symbol, forwards )
or a list
(SEQ expr... ) -> collapse( then, map(compile, expr...) )
(ANY expr...) -> collapse( either, map( compile, expr... )
(MANY expr) -> many( compile( expr ) )
(MAYBE expr) -> maybe( compile( expr ) )
(EPSILON) -> succeeds()
(expr...) -> map( compile, expr... ) //this one shouldn't be necessary but it's a safety case

So, when it's all polished up, you should be able to instantiate a
syntax directed compiler by writing a handful of handler functions,
a single EBNF grammar and making a single call to

list ebnf( char *productions, list supplements, list handlers );

and a call to assoc() or assoc_symbol() to grab the start parser from the list.
A one-liner compiler-compiler (assuming some generosity in the definitions).

One unresolved difficulty I've discovered is that I seem to have a few
extra "implicit types" that I'm having to keep track of without any help
from the type system or my typedefs. Parser, operators, and suspension
functions all receive 2 arguments: an environment and an input. But in many
places, I felt I could get away with not creating a real environment structure
-- which is an association list. For predicate operators and some other uses
of operators, I just use the environment as POD (plain old data, or position
oriented data) and grab the pieces with first() and rest() (ie. (car) and (cdr)).
It seems like I can get away with this, because there are only a few
circumstances where there actually needs to be propagation of environment
data. The two places are the into(parser p, symbol id, parser q) parser which
sequences p and q but makes the result from p available in q -- so into()
has to generate an environment and append it to q's save environment,
and bind( parser p, operator op ) which calls the handler op() upon a successful
result from p -- so it has to append a new environment to the saved context in op.
I'm assuming that the into's q will always be a bind, and therefore only these
2 functions actually need to propagate environments. But that might limit
the usability.... Lots to think through.

luser droog

unread,
Jun 16, 2022, 9:54:44 AMJun 16
to
I forgot to hit 'paste'. Here's the code I tested and its output.


static object
stringify( object env, object it ){
return to_string( it );
}

static int
test_ebnf(){
puts( __func__ );
Symbol(postal_address);
Symbol(name_part);
Symbol(street_address);
Symbol(street_name);
Symbol(zip_part);

list parsers = ebnf(
"postal_address = name_part street_address zip_part ;\n"
"name_part = personal_part SP last_name SP opt_suffix_part EOL\n"
" | personal_part SP name_part ;\n"
"personal_part = initial '.' | first_name ;\n"
"street_address = house_num SP street_name opt_apt_num EOL ;\n"
"zip_part = town_name ',' SP state_code SP zip_code EOL ;\n"
"opt_suffix_part = 'Sr.' | 'Jr.' | roman_numeral | ;\n"
"opt_apt_num = [ apt_num ] ;\n"
"apt_num = NUMBER ;\n"
"town_name = NAME ;\n"
"state_code = UPPER UPPER ;\n"
"zip_code = DIGIT DIGIT DIGIT DIGIT DIGIT ;\n"
"initial = 'Mrs' | 'Mr' | 'Ms' | 'M' ;\n"
"roman_numeral = 'I' [ 'V' | 'X' ] { 'I' } ;\n"
"first_name = NAME ;\n"
"last_name = NAME ;\n"
"house_num = NUMBER ;\n"
"street_name = NAME ;\n",
env( NIL_, 6,
Symbol(EOL), chr('\n'),
Symbol(DIGIT), digit(),
Symbol(UPPER), upper(),
Symbol(NUMBER), some( digit() ),
Symbol(NAME), some( alpha() ),
Symbol(SP), many( anyof( " \t\n" ) )
),
env( NIL_, 2,
Symbol(name_part), Operator( NIL_, stringify ),
Symbol(street_name), Operator( NIL_, stringify ) )
);
//print_list( parsers ), puts("\n"); //output gets really long when showing the innards

parser start = first( assoc_symbol( postal_address, parsers ) );
print_list( start ), puts("\n");

print_list( parse( start,
chars_from_str( "Mr. luser droog I\n2357 Streetname\nAnytown, ST 00700\n" ) ) ),
puts("");

return 0;
}



test_ebnf
Parser(sequence, ((SEQUENCE_P .Parser(bind) ).((SEQUENCE_Q .Parser(sequence, ((SEQUENCE_P .Parser(sequence) ).((SEQUENCE_Q .Parser(sequence) ).((SEQUENCE_OP .Oper(concat, () ) ).() )))) ).((SEQUENCE_OP .Oper(concat, () ) ).() ))))

(OK ("Mr. luser droog I
" '2' '3' '5' '7' ' ' "Streetname" '
' 'A' 'n' 'y' 't' 'o' 'w' 'n' ',' ' ' 'S' 'T' ' ' '0' '0' '7' '0' '0' '
' ) ...(force_chars_from_string) )

luser droog

unread,
Jun 16, 2022, 11:09:32 AMJun 16
to
On Thursday, June 16, 2022 at 8:32:28 AM UTC-5, luser droog wrote:

> So, when it's all polished up, you should be able to instantiate a
> syntax directed compiler by writing a handful of handler functions,
> a single EBNF grammar and making a single call to
>
> list ebnf( char *productions, list supplements, list handlers );
>
> and a call to assoc() or assoc_symbol() to grab the start parser from the list.
> A one-liner compiler-compiler (assuming some generosity in the definitions).

Sigh, I suppose it's really a compiler-interpreter. There's no present ability to
produce a stand-alone compiler from the EBNF definition. It would need to
decompile its own C functions to do that.

luser droog

unread,
Jun 22, 2022, 11:11:19 AMJun 22
to

Ben Bacarisse

unread,
Jun 22, 2022, 11:42:06 AMJun 22
to
I hope you get some helpful reviews.

The idea is interesting (I love combinators) but I am still of the view
that's not really the sort of thing that fits into using C.

--
Ben.

luser droog

unread,
Jun 22, 2022, 1:48:53 PMJun 22
to
On Wednesday, June 22, 2022 at 10:42:06 AM UTC-5, Ben Bacarisse wrote:
> luser droog <luser...@gmail.com> writes:
>
> > On Thursday, June 16, 2022 at 10:09:32 AM UTC-5, luser droog wrote:
> >> On Thursday, June 16, 2022 at 8:32:28 AM UTC-5, luser droog wrote:
> >>
> >> > So, when it's all polished up, you should be able to instantiate a
> >> > syntax directed compiler by writing a handful of handler functions,
> >> > a single EBNF grammar and making a single call to
> >> >
> >> > list ebnf( char *productions, list supplements, list handlers );
> >> >
> >> > and a call to assoc() or assoc_symbol() to grab the start parser from the list.
> >> > A one-liner compiler-compiler (assuming some generosity in the definitions).
> >> Sigh, I suppose it's really a compiler-interpreter. There's no present ability to
> >> produce a stand-alone compiler from the EBNF definition. It would need to
> >> decompile its own C functions to do that.
> >
> > Cleaned up some. Posted to:
> >
> > https://codereview.stackexchange.com/questions/277525/5912/parser-combinators-in-c-redux
> I hope you get some helpful reviews.

Thanks. I may have made the task more difficult than necessary for potential
reviewers by supplying no comments in the .c files. But I felt pretty confident
about the naming choices throughout, so I hope I can get away with asking for
comments while offering no comments.

> The idea is interesting (I love combinators) but I am still of the view
> that's not really the sort of thing that fits into using C.
>

Yes, there's definitely an impedance mismatch. But that ought to be a
surmountable challenge with appropriate programming skill, right?
I suppose time will tell if it actually turns out using for building a
larger system.

The mismatch does keep popping up in odd places. I commented upthread
"when it's all polished up ....". One of the places where the lack of polish
shows is in the test_ebnf() function I showed. Wrapped around the call
to assoc_symbol() is a call to first() because for some reason I haven't
yet discovered, that entry in the association list is ending up with an extra
list wrapper around the right-hand-side, whereas in most places association
lists are a list of just dotted pairs. I still don't know where that extra wrapper
is coming from or how to deal with it more gracefully.

So, to use this code, there's a whole lot of "needing to know what it's doing"
to make it work. And that's not ideal, possibly an irrevocable consequence
of the bad fit.

The positive is it lets me do Lisp-y kind of stuff where ever there's a C99
compiler. And that feels like a win.
Reply all
Reply to author
Forward
0 new messages