Discard extraneous structure returned by parser combinators?

luserdroog

unread,

Nov 13, 2021, 11:54:09 PM11/13/21

to

Is there a systematic way to discard the extra noise that can occur
when using parser combinators? For example, the `many` combinator
which matches zero or more instances of its argument parser.
In the case of zero matches, it still needs to return a value.

In my situation, I'm simulating everything in PostScript because it's
my favorite language. I'm simulating Lisp cons cells as 2-element
arrays. So for this JSON string,

( [ 3, [ 4, [ 5 ] ] ] ) JSON-parse report

if I make no special effort, I get a resulting value that looks like this:

OK
[[3 [[4 [[5 [[] []]] []]] []]] []]
remainder:[]

All those little empty arrays need to just go away, but not any of the
important array structure. `many` and `maybe` seem to be the chief
culprits, but then their results are propagated back by `alt`s and
`then`s all the way back to the top.

Do I need to make some kind of out-of-band signal for these "zeros"
that I can filter out later? The obvious problem here is that the array
type is being used for too many things. But there's a paucity of
types in PostScript, sigh. For the JSON application, I have nametype
objects available that don't have a JSON corollary.

Do I need to rewrite all the combinators to filter out noise values at
every turn?.

Ben Bacarisse

unread,

Nov 14, 2021, 11:01:16 AM11/14/21

to

luserdroog <mij...@yahoo.com> writes:

> Is there a systematic way to discard the extra noise that can occur
> when using parser combinators? For example, the `many` combinator
> which matches zero or more instances of its argument parser.
> In the case of zero matches, it still needs to return a value.
>
> In my situation, I'm simulating everything in PostScript because it's
> my favorite language. I'm simulating Lisp cons cells as 2-element
> arrays. So for this JSON string,
>
> ( [ 3, [ 4, [ 5 ] ] ] ) JSON-parse report
>
> if I make no special effort, I get a resulting value that looks like this:
>
> OK
> [[3 [[4 [[5 [[] []]] []]] []]] []]
> remainder:[]
>
> All those little empty arrays need to just go away, but not any of the
> important array structure.

So you want

[[3 [[4 [[5 []]]]]]]

?

> `many` and `maybe` seem to be the chief
> culprits, but then their results are propagated back by `alt`s and
> `then`s all the way back to the top.
>
> Do I need to make some kind of out-of-band signal for these "zeros"
> that I can filter out later? The obvious problem here is that the array
> type is being used for too many things. But there's a paucity of
> types in PostScript, sigh. For the JSON application, I have nametype
> objects available that don't have a JSON corollary.
>
> Do I need to rewrite all the combinators to filter out noise values at
> every turn?.

It's odd to call something that you are returning (presuambly) as
noise. Are you using lists as a sort of Maybe monad with [] as Nothing?

I think you'd have to show the code to get anything more concrete as a
reply.

--
Ben.

Paul Rubin

unread,

Nov 14, 2021, 6:11:37 PM11/14/21

to

luserdroog <mij...@yahoo.com> writes:
> Is there a systematic way to discard the extra noise that can occur
> when using parser combinators? For example, the `many` combinator
> which matches zero or more instances of its argument parser.
> In the case of zero matches, it still needs to return a value.

I'd expect 'many' to return a list, an empty list in the case of zero
matches. What is the extra noise? Your PostScript example is
confusing. I'd expect ( [ 3, [ 4, [ 5 ] ] ] ) JSON-parse to give
something like [3, [4, [5]]], using square brackets to denote lists.

I didn't know parser combinators were even a thing in PostScript: or are
you trying to implement them? You could look at the Parsec paper to see
how they traditionally worked:

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/parsec-paper-letter.pdf

luserdroog

unread,

Nov 14, 2021, 10:28:00 PM11/14/21

to

On Sunday, November 14, 2021 at 10:01:16 AM UTC-6, Ben Bacarisse wrote:
> luserdroog <mij...@yahoo.com> writes:
>
> > Is there a systematic way to discard the extra noise that can occur
> > when using parser combinators? For example, the `many` combinator
> > which matches zero or more instances of its argument parser.
> > In the case of zero matches, it still needs to return a value.
> >
> > In my situation, I'm simulating everything in PostScript because it's
> > my favorite language. I'm simulating Lisp cons cells as 2-element
> > arrays. So for this JSON string,
> >
> > ( [ 3, [ 4, [ 5 ] ] ] ) JSON-parse report
> >
> > if I make no special effort, I get a resulting value that looks like this:
> >
> > OK
> > [[3 [[4 [[5 [[] []]] []]] []]] []]
> > remainder:[]
> >
> > All those little empty arrays need to just go away, but not any of the
> > important array structure.
> So you want
>
> [[3 [[4 [[5 []]]]]]]
>
> ?

I guess that's the big problem here. I'm not sure what I want. I keep having
to add extra code to clean up and delete the extra stuff. Ultimately the
result should be

[ 3 [ 4 [ 5 ] ] ]

The parser for arrays looks for the left bracket, then ...

/Jarray //begin-array
//value executeonly xthen
//value-separator //value executeonly xthen many then %{ps flatten ps} using
maybe
//end-array thenx
{ %filter-zeros first %ps
} using def

The `executeonly` are in there to prevent infinite recursion if the expanded
code ever gets printed (like in a stack dump while debugging). The /Jarray
parser is one of the components of the //value parser.

Hmm. Initially I had the `then` combinator doing a Lisp-style (append)
operation on my simulated lists, so something like

(a) char (b) char then

would -- if matched by the input -- return

[ (a) [ (b) [] ] ]

which I could then easily massage into

[ (a) (b) ]

But that led me into problems when I wanted to use the combinators
`xthen` and `thenx` which discard one of the two pieces. If the results
are just appended together in a list, then I've lost the information to
peel them back apart. So I changed `then` to just (cons) the pieces
together, and now `xthen` and `thenx` have an easy job.

And extra noise values pop up if I use combinators like `maybe`
and `many` which might succeed with zero matches. So, ...

> > `many` and `maybe` seem to be the chief
> > culprits, but then their results are propagated back by `alt`s and
> > `then`s all the way back to the top.
> >
> > Do I need to make some kind of out-of-band signal for these "zeros"
> > that I can filter out later? The obvious problem here is that the array
> > type is being used for too many things. But there's a paucity of
> > types in PostScript, sigh. For the JSON application, I have nametype
> > objects available that don't have a JSON corollary.
> >
> > Do I need to rewrite all the combinators to filter out noise values at
> > every turn?.
> It's odd to call something that you are returning (presuambly) as
> noise. Are you using lists as a sort of Maybe monad with [] as Nothing?
>

Yes, I think that's what I'm doing, clumsily.

luserdroog

unread,

Nov 14, 2021, 10:29:55 PM11/14/21

to

It's something I've been trying to do in PostScript for a while now.
A lot of the saga is detailed in comp.lang.postscript.
Code is at github.com/luser-dr00g/pcomb/ps

Ben Bacarisse

unread,

Nov 15, 2021, 5:45:12 AM11/15/21

to

I think you need to pin down what you want. You made a remark about
using two-element arrays as cons cells. In that case, parsing (a) and
(b) in sequence /should/ give [ (a) [ (b) [] ] ]. Massaging that into
something else seems like the wrong strategy.

--
Ben.

luserdroog

unread,

Nov 15, 2021, 5:13:31 PM11/15/21

to

Yes, I think I may have presented an X/Y problem or started the story
from the middle. In all of my test cases for this version (regexs,
Postscript scanner, JSON parser) I had to write a `fix` function
to convert these lists into arrays that I can work with more simply.

But with each test case, I've had to re-write the `fix`. And it keeps
running into more problems. The final one that "broke the camel's back"
and sent me searching for the deeper design flaw was this:

( [ {"a":4,"b":5}, 6, {"c":"7"}] ) dup JSON-parse report

which yields this:

OK
[[<< /a 4 /b 5 >> [6 << /c (7) >>]] []]
remainder:[]

I can remove the trailing [] easily. But the extra array in the middle
grouping the 6 and the dictionary, I don't know where to patch in
to grammar to remove that one. So that seems to show that I'm
doing something wrong deeper down in the organization of the
program.

So, it appears I need to back up and rebuild the pieces more slowly,
making sure that the regex example doesn't need any kind of `fix`
before moving on to the more complicated ones.

On a similar front, I can rewrite `xthen` and `thenx` to do their
jobs without composing off of `then`. That way I don't need to
worry about the result of `then` needing to be taken apart again into
2 pieces.

luserdroog

unread,

Nov 20, 2021, 7:49:41 PM11/20/21

to

[snip]

Ugh. I think it wasn't even an X/Y problem. It was a "doctor it hurts when
I move my arm like this; ... so don't move your arm like that" problem.

I want the "result" part of the "reply" structure (using new terms following
usage from the Parsec document) to be any of the /usual/ PostScript types:
integer, real, string, boolean, array, dictionary.

But I also need some way to arbitrarily combine or concatenate two
objects regardless of type. My `then` (aka `seq`) combinator needs to
do this. So I made a hack-y function that does the combining. If it has
two arrays, it composes the contents into a longer array. If it has one
array and some other object it extends the array by one and stuffs
the object in the front or back as appropriate. If it has two non-array
objects it makes a new 2-element array to contain them.

So, instead of building `xthen` and `thenx` off of `then` and needing
to cons, car, and cdr the stuff, I can write all 3 of these as a more
general parameterized function.

sequence{ p q u }{
{ /p exec +is-ok {
next x-xs force /q exec +is-ok {
next x-xs 3 1 roll /u exec exch consok
}{
x-xs 3 2 roll ( after ) exch cons exch cons cons
} ifelse
} if } ll } @func
then { {append} sequence }
xthen { {exch pop} sequence }
thenx { {pop} sequence }

append { 1 index zero eq { exch pop }{
dup zero eq { pop }{
1 index type /arraytype eq {
dup type /arraytype eq { compose }{ one compose } ifelse
}{ dup type /arraytype eq { curry }{ cons } ifelse } ifelse } ifelse } ifelse }

(`@func` is my own non-standard extension to PostScript that takes
a procedure body and list of parameters and wraps the procedure with
code that defines the arguments in a local dictionary. `ll` is my hack-y
PostScript way of making lambdas with hard-patched parameters, it's
short for `load all literals.`)

luserdroog

unread,

Nov 20, 2021, 9:17:24 PM11/20/21

to

On Saturday, November 13, 2021 at 10:54:09 PM UTC-6, luserdroog wrote:
> Is there a systematic way to discard the extra noise that can occur
> when using parser combinators? For example, the `many` combinator
> which matches zero or more instances of its argument parser.
> In the case of zero matches, it still needs to return a value.
>

[snip]
Sigh. I already had this same problem before. It came up when I googled it. [sad trombone]

https://stackoverflow.com/q/55346600/733077

luserdroog

unread,

Nov 20, 2021, 10:55:27 PM11/20/21

to

Ooops. Bad link. Here it at:

https://github.com/luser-dr00g/pcomb/tree/master/ps

luserdroog

unread,

Dec 4, 2021, 12:21:49 AM12/4/21

to

Coda: The rewrite is proceeding well. I got the basic parsers typed up fresh
and simplified from the previous round. And I got the regular expression
parser (test case) written with less gyrations than before. And I got the PostScript
scanner (test case) written but there I did end up needing a `fix` function.

Precisely because of my earlier decision that the result could be any type
or an array of any number of any type, when processing this result ... I do
kinda need to know whether the thing is an array or not and handle them
differently.

So, it's the same but different, you know? I need to call `fix` but it makes
sense why I have to do it, and importantly *where* it will need to be done
to get the thing to work right.

I suppose the moral is, in this situation I can follow the Lisp stuff a
little less literally. And possibly more broadly, as I've lamented in both
comp.lang.c and comp.lang.postscript.

Every few rewrites I try to clean up the lazy evaluation and be more general
and strategic with it. But every time I just end up sprinkling in calls to `force`
until it works right.