The Parser Combinators I described in the "Introduction ..." thread are
not possible to directly implement in C. It requires a sort of support
layer facilitating construction of lists, using functions as values,
and lazy evaluation. The simplest way I could come up with to do this
is with a tag-union data structure that merges all the various types
I needed into a single type.
The system also needs to manage its memory and not wildly leak untold
gigabytes in the calculation of relatively small outputs. And so it
needs a garbage collector. And it needs the capability of lazy
evaluation of function application. For the parser combinators like
'plus' or 'alt' which handles alternates, it really isn't efficient
if every possibility is exhaustively searched when only one successful
path is needed.
So, to start with I define a few things in the object module's public
header file. The goal here is a Lispy/Haskellish loosey-goosy type
schenanigan in which all of these pieces are composable.
typedef union uobject POINTER_TO object;
typedef object list;
typedef object parser;
typedef object oper;
typedef oper predicate;
typedef object boolean;
typedef object fSuspension( void * );
typedef list fParser( void *, list );
typedef object fOperator( void *, object );
typedef boolean fPredicate( void *, object );
typedef object fBinOper( object, object );
and a few things in the object module's private header file.
typedef enum object_tag {
INVALID, INTEGER, LIST, SUSPENSION, PARSER, OPERATOR, SYMBOL, STRING,
} tag;
union uobject { tag t;
struct { tag t; int i; } Int;
struct { tag t; object a, b; } List;
struct { tag t; void *v; fSuspension *f; } Suspension;
struct { tag t; void *v; fParser *f; } Parser;
struct { tag t; void *v; fOperator *f; } Operator;
struct { tag t; int symbol; char *pname; object data; } Symbol;
struct { tag t; char *string; int disposable; } String;
struct { tag t; object next; } Header;
};
object new_( object a );
#define OBJECT(...) new_( (union uobject[]){{ __VA_ARGS__ }} )
So the gist of all this is that an 'object' is a pointer to a
union of structs. An object could be a NULL pointer to represent
"nothing" or false an empty list. Or an object could be a valid
pointer, in which case the 't' member will indicate which type
of struct it is.
There are many ways one could build such a tag-union in C. I like
this way because, even the definition is a little weird looking
the usage in nicer. If you define the object as a struct with
an inner union, the you have the extra '.u.' noise in every
expression which accesses it.
The 'new_' function is the allocation routine used by all the
typed constructor functions. In order to support garbage collection,
all allocated objects will require an associated allocation record.
This allocation record has a pointer which links up all allocations
into a big ol' list for the /sweep/ phase of garbage collection.
And it also has space we can use for a 'mark' for the mark phase.
The 'new_' function allocates two objects, one for the 'header'
or allocation record, and one for the object itself. By making the
header to be the same size as the union lets the compiler take
care of all alignment issues.
static list allocation_list = NULL;
object
new_( object a ){
object p = calloc( 2, sizeof *p );
return p ? p[0] = (union uobject){ .Header = { 0, allocation_list } },
allocation_list = p,
p[1] = *a,
&p[1]
: 0;
}
I know it's bit of a weird style to write this function, but I think
it's pretty. The allocation list is maintained in reverse chronological
order.
We've discussed the typed constructor functions in the thread
"Simulating Monads..." so I'll skip those and tell you about 'valid()'.
'valid()' is a function to replace the simple test for NULL that many
functions would use to determine whether to treat the object as real
data or not. So NULL was my Lisp 'NIL'. But now a valid pointer to an
/empty/ object is also considered a NIL. I needed this behavior to
support lazy evaluation which I'll get into further on. The upshot
is that the 'test for NULL' now needed to be a slightly more complicated
check, and so it became a function.
int
valid( object a ){
switch( a ? a->t : 0 ){
default:
return 0;
case INTEGER:
case LIST:
case SUSPENSION:
case PARSER:
case OPERATOR:
case SYMBOL:
case STRING:
return 1;
}
}
The next interesting part of the object module is the garbage collector.
The external api is two functions:
void add_global_root( object a );
int garbage_collect( object local_roots );
The global roots are of course just a simple list, easily managed
with one function (unless we ever need to delete from the list).
static list global_roots = NULL;
void
add_global_root( object a ){
global_roots = cons( a, global_roots );
}
Then the garbage collector has just a few different things to do:
int
garbage_collect( object local_roots ){
mark_objects( local_roots );
mark_objects( global_roots );
return sweep_objects( &allocation_list );
}
Marking objects involves accessing the [-1] object which is where
the header info is stored. Any objects with internal sub-structure
need to be recursively enumerated. We make sure to set the mark
for the current node /before/ recursing upon the children so that
lists with loops or other recursive data structures won't trip up
the process.
void
mark_this( object a, int m ){
a[-1].Header.t = m;
}
int
mark( object a ){
return a[-1].Header.t;
}
void
mark_objects( list a ){
if( !valid(a) || mark( a ) ) return;
mark_this( a, 1 );
switch( a->t ){
case LIST: mark_objects( a->List.a );
mark_objects( a->List.b ); break;
case PARSER: mark_objects( a->Parser.v ); break;
case OPERATOR: mark_objects( a->Operator.v ); break;
case SYMBOL: mark_objects( a->Symbol.data ); break;
}
}
And finally we need to /sweep/ through the allocation list,
unlinking and freeing anything with the mark unset.
int
sweep_objects( list *po ){
int count = 0;
while( *po )
if( (*po)->t ){
(*po)->t = 0;
po = &(*po)->Header.next;
} else {
object z = *po;
*po = (*po)->Header.next;
if( z[1].t == STRING && z[1].String.disposable )
free( z[1].String.string );
free( z );
++count;
}
return count;
}
And that's the whole garbage collector. Give the gc some roots
and it will free everything not reachable by those roots.
The next basic tool we need is a 'print' function to easily
see what the intermediate results look like and to debug this
whole farce.
With one function we can print a simple representation for
simple objects and a thing known in Lisp as 'dot notation'
but without actually printing dots. What this means is that
simple lists with chains of LIST objects on the .b member
(the cdr) will be printed with lots of parens. Parens around
every cons cell.
void
print( object o ){
if( !o ){ printf( "() " ); return; }
switch( o->t ){
case INTEGER: printf( "%d ", o->Int.i ); break;
case LIST: printf( "(" );
print( o->List.a );
print( o->List.b );
printf( ") " ); break;
case SUSPENSION: printf( "... " ); break;
case PARSER: printf( "Parser " ); break;
case OPERATOR: printf( "Oper " ); break;
case STRING: printf( "\"%s\"", o->String.string ); break;
case SYMBOL: printf( "%s ", o->Symbol.pname ); break;
case INVALID: printf( "_ " ); break;
default: printf( "INVALID " ); break;
}
}
The copious parentheses are usable but distracting and irritating.
But I needed two more functions to run through a list with parens
only on the outside or around anything on the LIST object's .a
member (the car). So, print_list() is the master function to
print any object.
void
print_listn( list a ){
switch( a ? a->t : 0 ){
default: print( a ); return;
case LIST: print_list( x_( a ) ), print_listn( xs_( a ) ); return;
}
}
void
print_list( list a ){
switch( a ? a->t : 0 ){
default: print( a ); return;
case LIST: printf( "(" ), print_list( x_( a ) ), print_listn( xs_( a ) ), printf( ")" ); return;
}
}
And for my own use, I've found it convenient to use this function with a
macro that prints the name of the calling function and the expression.
#define PRINT(__) printf( "%s: %s = ", __func__, #__ ), print_list( __ ), puts("")
Next, the support layer needs some code for traversing lists. But
these lists might contain suspensions which are the handles for lazy
values that must be evaluated to produce a value.
The function 'at_' is used to force evaluation of a suspension.
And it keeps on evaluating the result until it gets something that
is not a suspension.
object
at_( object a ){
return valid( a ) && a->t == SUSPENSION ? at_( a->Suspension.f( a->Suspension.v ) ) : a;
}
Next, the functions 'x_' and 'xs_' are like lisp's car and cdr, or
first() and rest(). If the argument to either of these functions
is a suspension, then it wraps a new suspension around it. This new
suspension will perform the first() or rest() action when it is
forced to do so.
object
px_( void *v ){
list a = v;
*a = *at_( a );
return x_( a );
}
object
x_( list a ){
return valid( a ) ?
a->t == LIST ? a->List.a :
a->t == SUSPENSION ? Suspension( a, px_ ) : NIL_
: NIL_;
}
object
pxs_( void *v ){
list a = v;
*a = *at_( a );
return xs_( a );
}
object
xs_( list a ){
return valid( a ) ?
a->t == LIST ? a->List.b :
a->t == SUSPENSION ? Suspension( a, pxs_ ) : NIL_
: NIL_;
}
The 'p*' functions which implement the suspended actions have calls to
'at_()' to prevent an infinite loop. 'at_()' is not part of the API
however. The two functions provided for client code which can force
evaluation are 'take' and 'drop'.
list
take( int n, list o ){
if( n == 0 ) return NIL_;
*o = *at_( o );
return cons( x_( o ), take( n-1, xs_( o ) ) );
}
list
drop( int n, list o ){
if( n == 0 ) return NIL_;
*o = *at_( o );
return drop( n-1, xs_( o ) );
}
One more functino and the basic support layer is relatively complete.
'chars_from_string()' takes a string argument and produces a lazy
list of chars. The entry point 'chars_from_string' just checks if
the string is empty and either returns EOF or returns a suspension.
The suspended action 'pchars_from_string' extracts one char from
the string and returns the char and a suspension to itself for the
rest of the string.
list
pchars_from_string( void *v ){
char *p = v;
return *p ? cons( Int( *p ), Suspension( p+1, pchars_from_string ) ) : Symbol(EOF);
}
list
chars_from_string( void *v ){
char *p = v;
return *p ? Suspension( p, pchars_from_string ) : Symbol(EOF);
}
And that completes the fancy code for my *support layer* for Parser
Combinators. A simple, concise Lisp-like object system.
The code in this message is all from my latest draft, pc9, from
the object module, which is comprised of the three files:
$ wc pc9obj*[ch]
255 958 5760 pc9obj.c
46 201 1201 pc9obj.h
22 126 902 pc9objpriv.h
323 1285 7863 total
Browse current commit:
https://github.com/luser-dr00g/pcomb/tree/a00fdd9657d0ae0c844dfbd84765748e5a4c028c
Download zip:
https://github.com/luser-dr00g/pcomb/archive/a00fdd9657d0ae0c844dfbd84765748e5a4c028c.zip