Introduction to Parser Combinators

M Joshua Ryan

unread,

Apr 22, 2019, 9:44:35 PM4/22/19

to

It has been alleged that I keep rambling on without actually explaining
what I'm talking about in these threads. And I balked at the suggestion
when it happened partly due to how keenly it cut to the truth. So herein
I will attempt to 'splain.

A parser combinator is as the name suggests,

1. a parser. It can be called like a function to examine input and
return a result to indicate success or failure.
2. a combinator. It is created by calling a constructor and passing
in one or more /child/ parsers whose behavior the combinator
will *combine* in some useful manner.

The classic parser combinators corresponds closely to the meta
characters in regular expressions and EBNF but here we use short words
instead of symbols.

Regex or EBNF parser combinators
x* many
x+ some or many1
x? or [ x ]opt maybe
x | y alt or plus
x y seq or then

So ignoring for the moment how to create a *leaf* parser which is not a
combinator and doesn't require existing parsers to be able to create it,
if we *have* a parser p for something like a string or a single character,

parser p = ...;

we can create a parser to match many occurances by calling the many()
combinator.

parser ps = many( p );

Or we could create a parser which matches 0 or 1 occurances by calling
the maybe() combinator.

parser pq = maybe( p );

If we have two parsers, then we can use one of the binary combinators
like alt or seq. So, if we have a parser to match a comma we could
construct a parser which matches a comma separated list of p's (whatever
they might be).

parser comma = ...;
parser pcs = seq( p, many( seq( comma, p ) ) );

And that's really the pudding right there. Parser Combinators (in
implementations where they are performant) allow you to build complex
parsers with function calls so the expression itself reads like
the EBNF grammar it corresponds to.

For example, the C grammer circa 1975 has this production for the
'statement' nonterminal:

statement:
expression ;
{ statement-list }
if ( expression ) statement
if ( expression ) statement else statement
while ( expression ) statement
for ( expressionopt ; expressionopt ; expressionopt ) statement
switch ( expression ) statement
case constant-expression : statement
default : statement
break ;
continue ;
return ;
return ( expression ) ;
goto expression ;
identifier : statement
;

With parser combinators, the code to implement this looks like the
following. A little more /punctuation/, but the correspondence is
clear. As long as the functions do what they claim (in a reasonable
time and memory space) then the function is *obviously* correct.

parser statement = forward();
*statement =
*PLUS(
seq( expression, semi_ ),
SEQ( lbrace_, statement_list, rbrace_ ),
SEQ( k_if_, lparen_, expression, rparen_, statement ),
SEQ( k_if_, lparen_, expression, rparen_, statement, k_else_, statement ),
SEQ( k_do_, statement, k_while_, lparen_, expression, rparen_, semi_ ),
SEQ( k_while_, lparen_, expression, rparen_, statement ),
SEQ( k_for_, lparen_,
maybe( expression ), semi_, maybe( expression ), semi_, maybe( expression ),
rparen_, statement ),
SEQ( k_switch_, lparen_, expression, rparen_, statement ),
SEQ( k_case_, constant_expression, colon_, statement ),
SEQ( k_default_, colon_, statement ),
seq( k_break_, semi_ ),
seq( k_continue_, semi_ ),
seq( k_return_, semi_ ),
SEQ( k_return_, lparen_, expression, rparen_, semi_ ),
SEQ( k_goto_, expression, semi_ ),
SEQ( identifier, colon_, statement ),
semi_
);

Other virtues should also be fairly obvious, I feel. It is compact
without being overly terse and it is very expressive. But the part
about being *obviously* correct is where the power lies.

This concludes the basic introduction to Parser Combinators.

Summary: They are parsers, functions that can consume input and
perform pattern matching and return success or failure. They are
combinators, functions that receive other functions as arguments
and produce functions as results.

Any questions?

Thiago Adams

unread,

Apr 23, 2019, 8:39:02 AM4/23/19

to

- How to you insert actions, or in other words, how to add
the code that must be executed when you parse something.

- I believe you think that the main advantage is to be able
to use some descriptive language to implement the parser.
How do you compare with other alternatives that also have
descriptive ways to create parsers like parser generators?

- How many tokens ahead do you need/use?
Can we say that you have something like LL(1) parser?

- Do you have any impact on performance that you would had if
doing my hand?

- If you like this style of inserting the description for parsing
direct in code have you considered to use C++? The difference is that
C++ can use operator overloading and it can practically create a new
language inside of itself.
(https://en.wikipedia.org/wiki/Spirit_Parser_Framework)

Some people create regex in compile time in C++.
(https://youtu.be/QM3W36COnE4)

The alternative in C, could be a generator for regex. This is
also the alternative in C for any usage of C++ constexpr.

My comments:

I believe this style can be more useful when the parser action is
pre-defined. For instance, to create a regex. The action regex
does is always the same that is create a parsing for searching.

This style also can be applied not only for parsers but for building
search algorithms.

C# has something called LINQ
https://docs.microsoft.com/pt-br/dotnet/csharp/programming-guide/concepts/linq/introduction-to-linq-queries

LINQ allows the user to create a query that basically accumulate
algorithms

var numQuery =
from num in numbers
where (num % 2) == 0
select num;

I think this technique you are using also could be applied to
create this combinations of algorithms in C.
Just like regex, the output, or the action performed, is
pre-determined. LINQ has an sequence as input and sequence as output.

I have done something similar, but in C++.

Pattern matching
http://thradams.com/pattern.htm

Then I can search and replace for number like this:

std::string s = "a 123 b 45.67 c" ;
using namespace PatternMatching;

find_replace(s,
Range('0', '9') * OneOrMore & ('.' & Range('0', '9') * OneOrMore) * Optional,
std::string("NUMBER"));

The interesting line is:
Range('0', '9') * OneOrMore & ('.' & Range('0', '9') * OneOrMore) * Optional

that builds the parser in a descriptive way.

Container Queries
http://thradams.com/queries.htm

I never used any of these techniques in production code because
I think they are easy to do but are too "dynamic" using more memory
and not given the best performance.

For the same reason, I use C++ std::regex only for tools or "toys"
where the performance is not important.

(This is entire new subject)
My last comment is that combining algorithm 1 + algorithm 2 .. etc
is sometimes possible and gives the correct result.
However, having in mind all the steps together we can write a new
algorithm that is better than the mechanical combination of
each algorithm separately.

For instance, you can merge and join strings with one algorithm that
allocates more memory if necessary. But in compile time if you can
see what strings you are adding in sequence (having the view of entire
algorithm not only parts) then you can buy memory in advance and make
it faster.

The same for math computation evaluating expressions.

I never implemented SQL databases but I believe the have the same problem
to solve. Give the descriptive query they need to build an algorithm
that does the search. Do they just add on algorithm on the top of other
or they need to join the algorithms in a clever way.

fir

unread,

Apr 23, 2019, 9:23:29 AM4/23/19

to

W dniu wtorek, 23 kwietnia 2019 03:44:35 UTC+2 użytkownik luser droog napisał:
> It has been alleged that I keep rambling on without actually explaining
> what I'm talking about in these threads. And I balked at the suggestion

good attempt as some topics just need introducion becouse without it their are sorta practically useless (as nearly noone can follow)

hovever some problem is that that this introduction should be good, most preferably interesting, complete and also not much long, not anybody has a talend to writing that kind of introductions, but they are needed often..

so this is kind of parctical problem

luser droog

unread,

Apr 23, 2019, 1:50:55 PM4/23/19

to

On Tuesday, April 23, 2019 at 7:39:02 AM UTC-5, Thiago Adams wrote:
> On Monday, April 22, 2019 at 10:44:35 PM UTC-3, luser droog wrote:
> > It has been alleged that I keep rambling on without actually explaining
> > what I'm talking about in these threads. And I balked at the suggestion
> > when it happened partly due to how keenly it cut to the truth. So herein
> > I will attempt to 'splain.
> >

> > This concludes the basic introduction to Parser Combinators.
> >
> > Summary: They are parsers, functions that can consume input and
> > perform pattern matching and return success or failure. They are
> > combinators, functions that receive other functions as arguments
> > and produce functions as results.
> >
> >
> > Any questions?
>
> - How to you insert actions, or in other words, how to add
> the code that must be executed when you parse something.

Good question. It was dealing with this very issue that led
to my first post in this series, "Propagating results from the
callbacks". The simple answer is "callbacks". You add another
function pointer in the parser structure and call it from the
handler function.

But, the solution I came to prefer once I began to understand it
is by using 'bind()', the Monadic sequencing combinator. Bind has
extra powers that become useful when the parsers are enhanced
with a few extra features I haven't mentioned yet.

An important aspect I ought to have explained earlier is how
the Parser itself works. The parser accepts input and produces
output. All I said earlier was that this output indicates
success or failure, somehow. My earliest version defined the
parser signature like this:

int parse( char *input, parser p );

So, the input can be just a pointer into the string (initially
the start of the string, but the parser can advance the pointer
when matching sequences). The output here was the length that
was matched. This appeared to me at the time to be the simplest
way to get a minimally-useful behavior. But this arrangement
is not flexibly enough. Particularly if the parser graph has
alternates, there may be more than one possible way to parse
the input. So we need a vector or tuple or list of results.
In the PC literature, this idea was heralded by a paper entitled,
"Replacing failure with a list of successes".

But when I rewrote everything to return lists of matched lengths,
it still wasn't flexible enough. The handler for the callbacks had
lots of extra work to do to extract strings for the callbacks to use.
So, the next refinement that I learned and applied was having the
parser return the matched input chars themselves.

With this arrangement, the parser signature is like this:

result_list parse( parser p, char *input );

The result_list can be NULL which indicates failure, or it can be
a pointer to the start of a linked list of result records.

... probably skipping a few steps here ... (questions welcome)

The magical answer is to use 'bind()' to attach callbacks and
also for connecting parsers together.

Bind attaches your callback to the parser but also handles
the possible multiple results from the parser. The callback
is called by doing a map over the list of results. If there
are 3 results, the callback gets called 3 times, getting one
result at a time. If there are zero results, the map doesn't
call the callback at all.

In my current code, the callback has to have the form of an
"operator function" as I define them:

typedef object fOperator( void *, object );

The 'void *' supplies any context the function needs (it was
bundled up with the function in an 'oper' structure). The 'object'
argument is (single) result from the (child) parser, and the function
must return an 'object' which will be (one of the) return value(s)
from the (compound or composed) parser which 'bind' returns.

For a very short example of a callback, this is a function from
my regular expression parser/constructor.

static parser on_chr( void *v, object o ){ return lit( o ); }

This is the callback or handler for regular characters found in
the regular expression. For each one, we pass the char (the 'object
o') to the 'lit()' constructor which creates a parser to match
a single character.

The callback gets installed or registered by calling 'bind', or
a simpler front-end to it I call 'using'.

parser using( parser p, fOperator *f );

In regex builder, the call looks like this:

parser chr_ = using( noneof( "*+?.|()" ), on_chr );

> - I believe you think that the main advantage is to be able
> to use some descriptive language to implement the parser.
> How do you compare with other alternatives that also have
> descriptive ways to create parsers like parser generators?

That would probably be a more productive way to go about making
a product quickly. But that's not my primary goal. I /am/ working
towards a result, but I'm also attempting to reinvent the world
and learn some fun stuff in the process.

But the downside of external tools is learning the tool itself,
and depending upon it. Having everything /in/ the language reduces
external dependencies (and the potential to become enslaved by
their support schemes).

> - How many tokens ahead do you need/use?
> Can we say that you have something like LL(1) parser?

Hmm. I haven't researched the definition of LL(1) recently but
with the input coming from a lazy list, parsers could easily
look ahead as far as they wish. It would affect the behavior
of other parsers in the graph. Again, without having checked my
definitions, I think the behavior I have implemented is more
like LL(0). Parsers grab at most one item off of the input list
and then do their work with it.

> - Do you have any impact on performance that you would had if
> doing my hand?

Yes, most certainly. My parsers are examining every possible
path through the graph. I believe it's somewhat like a large
Prolog program with no *cuts*. Everything backtracks all over
its own toes to deliver every result at every stage.

Handwritten code would likely begin several orders of magnitude
more efficient even with no special optimizations by the programmer.

But I think the code wouldn't look as nice. :)

[snip C++ comments]

C++ frightens me. I got burned once by STL code that wasn't
working and we couldn't figure it out.

>
> My comments:
>
> I believe this style can be more useful when the parser action is
> pre-defined. For instance, to create a regex. The action regex
> does is always the same that is create a parsing for searching.
>
> This style also can be applied not only for parsers but for building
> search algorithms.

Yes, indeed Eric Meijer was one of the authors of Hutton/Meijer
was my latest, best guide to parser combinators. He was involved
in designing and building LINQ.

It's Monads.

Sadly yes, that is my experience. But there's got to be a way to speed
them up and cut the fat.

jacobnavia

unread,

Apr 23, 2019, 3:06:32 PM4/23/19

to

Thanks for this very interesting post

jacob

Thiago Adams

unread,

Apr 23, 2019, 4:21:45 PM4/23/19

to

On Tuesday, April 23, 2019 at 2:50:55 PM UTC-3, luser droog wrote:
...

> > - Do you have any impact on performance that you would had if
> > doing my hand?
>
> Yes, most certainly. My parsers are examining every possible
> path through the graph. I believe it's somewhat like a large
> Prolog program with no *cuts*. Everything backtracks all over
> its own toes to deliver every result at every stage.
>
> Handwritten code would likely begin several orders of magnitude
> more efficient even with no special optimizations by the programmer.

One alternative approach is to compile the parser in runtime.
By compiling I mean generate state tables. (not generate code)
For instance, let's say I want to create a regex parser.

struct Regex regex = REGEX_INIT;
Regex_Load(&regex, "[abc]");
//matches a string that has either an a or a b or a c

The parser description can be an string "[abc]".
Then this string is parsed and the state tables are computed
and generated in memory.

After this first step, that can take a lot of time, the parser
can be used and it will run fast.

The difference between this approach and the parser generator,
is that the parser generator will run before compilation and
the tables can be code or static data.

The approach you did (I don't know the details) I imagine don't
need compilation but then the runtime is not optimized.

Something intermediary would be accept a language description
in a string form and then convert in your data structures.

One sample where you don't need to specify actions would be
scanner/tokenizer generators.

Imagine something like:

struct Scanner scanner = SCANNER_INIT;
Scanner_Add(&scanner, 1 /*token no*/, "if");
Scanner_Add(&scanner, 2 , "[_A-Za-z]+");

int Scanner_Match(&scanner);

the task here is recognize these tokens and return the number
that is associated with the rule so you can create a scanner
describing the rules using regex.