Got the regular expression parser working again hurray!

17 views
Skip to first unread message

luser droog

unread,
Nov 4, 2021, 10:04:44 PM11/4/21
to
For the umpteenth time I've rewritten my parser combinators.
With any luck, they finally have all the necessary bells and
whistles added to be useful for other stuff.

The code is in these 3 files. Trimmed of testing gibberish,
it's about 200 lines of code.
https://github.com/luser-dr00g/pcomb/blob/5efeca34f410e8855eff9d38ba21f80983b45afd/ps/pc11are.ps
https://github.com/luser-dr00g/pcomb/blob/5efeca34f410e8855eff9d38ba21f80983b45afd/ps/pc11a.ps
https://github.com/luser-dr00g/pcomb/blob/5efeca34f410e8855eff9d38ba21f80983b45afd/ps/struct2.ps

A parser is a function that takes an <input-stream> type and
yields a <result-structure> type. The <input-stream> is a lazy
list of [char [row col]] structures. The <result-structure> is a
two element array, the first element of which will be /OK /Fail or
/Error. For an /OK result, the second element will be
[result remainder]. For a /Fail or /Error result, the second element
will be [message remainder].

So here's the regular expression parser with this new setup.
And some simple testing code and output, which should be
comprehensible using the above description of the input/output.
In effect, this is a syntax-directed compiler from regular expressions
to the syntax for constructing parsers using these same combinators.

errordict/typecheck{ps pe quit}put
%errordict/stackunderflow{pe quit}put
%errordict/stackunderflow{pq}put
%errordict/undefined{pq}put
%(../../debug.ps/db5.ps) run
(pc11a.ps)run {
fix { flatten clean }
clean { { dup zero eq { pop } if } map }
? { {maybe} compose }
+ { {some} compose }
* { {many} compose }
} pairs-begin

/Dot (.) char {pop {item} one} using def
/Meta (*+?) anyof def
/Character (*+?.|()) noneof {first {literal} curry one} using def
/Expression {-777 exec} def
/Atom //Dot
(\() char //Expression executeonly xthen (\)) char thenx alt
//Character alt def
/Factor //Atom /A
//Meta {/A load first exch first load exec one } using
maybe {dup first zero eq {pop /A load} if } using
into def
/Term //Factor //Factor many then
{ fix { {then} compose compose } reduce one } using def
//Expression 0 //Term (|) char //Term xthen many then
{ fix { {plus} compose compose } reduce one } using put

/regex { 0 0 3 2 roll string-input //Expression exec report } def

{
0 0 (ab) string-input //Dot maybe exec pc
0 0 (ab) string-input //Meta exec pc
0 0 (*) string-input //Meta exec pc
0 0 (ab) string-input //Character maybe exec pc
0 0 (ab) string-input //Atom maybe exec pc
0 0 (.) string-input //Atom maybe exec pc
%0 0 (a*) string-input //Atom //Meta then ==
0 0 (a*) string-input //Atom //Meta then exec pc
0 0 (ab) string-input Factor pc
0 0 (a*) string-input Factor pc
0 0 (ab) string-input Term pc
0 0 (ab|c) string-input Expression pc
(ab) regex
} exec
{
} pop
quit


$ gsnd -dNOSAFER pc11are.ps
GPL Ghostscript 9.52 (2020-03-19)
Copyright (C) 2020 Artifex Software, Inc. All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
stack:
[/OK [[[] []] [[(a) [0 0]] {0 1 (b) string-input}]]]
:stack
stack:
[/Fail [[{(*+?) within} (not satisfied)] [[(a) [0 0]] {0 1 (b) string-input}]]]
:stack
stack:
[/OK [[(*) []] {0 1 () string-input}]]
:stack
stack:
[/OK [[{(a) literal} []] {0 1 (b) string-input}]]
:stack
stack:
[/OK [[{(a) literal} []] {0 1 (b) string-input}]]
:stack
stack:
[/OK [[{item} []] {0 1 () string-input}]]
:stack
stack:
[/OK [[{(a) literal} [(*) []]] {0 2 () string-input}]]
:stack
stack:
[/OK [[{(a) literal} []] [[(b) [0 1]] {0 2 () string-input}]]]
:stack
stack:
[/OK [[{(a) literal many} []] {0 2 () string-input}]]
:stack
stack:
[/OK [[{(a) literal (b) literal then} []] []]]
:stack
stack:
[/OK [[{(a) literal (b) literal then (c) literal plus} []] []]]
:stack
OK
[{(a) literal (b) literal then}]
remainder:[]

Jeffrey H. Coffield

unread,
Nov 5, 2021, 11:15:55 AM11/5/21
to



On 11/04/2021 07:04 PM, luser droog wrote:
> For the umpteenth time I've rewritten my parser combinators.
> With any luck, they finally have all the necessary bells and
> whistles added to be useful for other stuff.


I spent a significant amount of time writing several versions of a
generalized parser and never got all the "bells and whistles" that I
wanted. Then I came across Antlr4 which is in Java and while I admire
your intent to parse Postscript in Postscript, Antlr4 has tons of "bells
and whistles" and can be applied to practically any language, although I
don't see that anyone has posted a PostScript grammar yet. There about
250 different languages currently available at :

https://github.com/antlr/grammars-v4

There is a grammar for something called RPN which sounds like it could
be a starting point.

https://github.com/antlr/grammars-v4/tree/master/rpn

Antlr4 takes some time to learn and I'm not sure how much Java you would
need to know (I program about 80% in Java now) but if it sounded like
something you are interested in, I would be willing to help. There is a
great book "The Definitive Antlr 4 Reference".

I am not associated directly with Antlr4 but have submitted some patches.

Jeff Coffield
www.digitalsynergyinc.com

luser droog

unread,
Nov 5, 2021, 8:21:12 PM11/5/21
to
Thanks, I'll check those out. I've had Antlr suggested to me many times before
but I've shied away thus far, probably from squeamish feelings about Java.
(If only Sun NeWS had been successful, we might never have needed Java.)

But I had to use Java for several school projects last year while finally
completing my degree. It's not so scary anymore. I'm pretty happy with
the composability of my functions the way they work now, but there are
probably ways to optimize the execution or use a different model behind
the scenes and maintain a similar interface. I'm also going to need a
nice syntax for describing grammars so I can write a parser for /that/
and have it translate to the combinators. Then you can have any color
you want as long as it's black.

Which bells and whistles were you still missing? My latest version adds
error messages which were sorely lacking in previous ones. But I've also
gotten a handle on how to do lazy execution. And bizarrely enough, I've
been able to prototype in PostScript and then re-code in C, despite the
dissimilarity between those (the C one kinda pretends to be Lisp, tho).

For PostScript Level 1, the grammar is actually super simple.

<Object> :: <Single>
|| { <Object> * }

<Single> :: <Integer> | <Real> | <Name> | <String>

Some fiddly stuff further down the tree. But it's just the one production
for procedures that needs a CFG. The rest of the language is strictly
Regular. Level 2 adds a few more kinds of Single, like binary name
tokens and binary object encodings.

So probably no one has written a grammar for PS for popular parsing
libraries because it doesn't get you anywhere. You get just [Object
Object Object ...]. It would get you like 2% closer to being able to
write a translator or interpreter for PostScript. Whereas with almost
any other structured language the grammar gets you roughly 50% there.

Reply all
Reply to author
Forward
0 new messages