PS Level 1 grammar

52 views
Skip to first unread message

luser droog

unread,
Nov 7, 2021, 7:40:43 PM11/7/21
to
Here's a rough draft of the grammar for a PS tokenizer using my new functions.
This is almost the exact same code as the previous version pc9token.ps
with just a few function names changed and no handlers yet to transform
the data. My previous code didn't do the recursion for procedures, but
this one does or should assuming it works.

I'm building recursive parsers by starting with a "forwarding" proc
/myparser {-777 exec} def
which can be composed with other parsers and filled
in later by doing
//myparser 0 //composed-parser put
This is the simplest way I've found so far after struggling with more
complicated ways.

It's missing some stuff like e notation, hex strings, ASCII85.
pc11atoken.ps:

(pc11a.ps)run

/delimiters ( \t\n()/%[]<>{}) def
/delimiter delimiters anyof def
/octal (0)(7) range def
/digit (0)(9) range def
/alpha (a)(z) range (A)(Z) range alt def
/regular delimiters noneof def

/number //digit some def
/opt-number //digit many def
/rad-digits //digit //alpha plus some def
/rad-integer //digit //digit maybe then (#) char then //rad-digits then def
/integer (+-) anyof maybe //number then def
/real (+-) anyof maybe
//number (.) char then //opt-number then
(.) char //number then alt then def
/name //regular some def

/ps-char {-777 exec} def
/escape (\\) char
(\\) char
(\() char alt
(\)) char alt
(n) char alt
(r) char alt
(t) char alt
(b) char alt
(f) char alt
//octal //octal maybe then //octal maybe then alt
then def
/substring (\() char //ps-char many then (\)) char then def
//ps-char 0 //escape
//substring alt
(()) noneof alt put
/ps-string (\() char //ps-char many then (\)) char then def

/spaces ( \t\n) anyof many def
/object {-777 exec} def
/ps-token //spaces //object xthen def
/object 0 //rad-integer
//real alt
//integer alt
//name alt
(/) char //name then alt
(/) char (/) char then //name then alt
//ps-string alt
({) char //ps-token many then spaces (}) char xthen then alt
//delimiter alt put

luser droog

unread,
Nov 8, 2021, 12:49:30 PM11/8/21
to
/hex-char //digit (a)(f) range (A)(F) range alt alt def
/non-hex-char //hex-char none def
/hex-string (<) char //non-hex-char many //hex-char xthen many then (>) char then def


> /spaces ( \t\n) anyof many def
> /object {-777 exec} def
> /ps-token //spaces //object xthen def
> /object 0 //rad-integer
> //real alt
> //integer alt
> //name alt
> (/) char //name then alt
> (/) char (/) char then //name then alt
> //ps-string alt

//hex-string alt


> ({) char //ps-token many then spaces (}) char xthen then alt
> //delimiter alt put

Adding hex strings needed a new combinator `none` that I'd been able to avoid
until now. In earlier versions it had been a factor of `noneof` which matches
the inverse of a set of characters.

pc9.ps:
noneof { anyof none }
none {p} { { dup /p exec [] ne { zero }{ item } ifelse exec } ll } @func

But I found a simpler way to write `anyof` and `noneof` since this version
builds everything on top of `pred satisfy`. So they can both use a factor
`within` that checks a character against a string.

pc11.ps:
anyof { {within} curry satisfy }
noneof { {within not} curry satisfy }

But to do the inverse of a parser built out of 3 ranges, I really need the
more general `none` now.

So this function takes a parser as a named parameter then constructs
a new procedure with this parameter substituted inside (like a primitive
'lambda') and yields this procedure as its result.

none{ p }{
{ dup /p exec +is-ok { pop [ /p ( succeeded) ] fail }{ pop item } ifelse exec } ll
} @func

It also just includes the parameter parser as part of the error message
would could result in a very unhelpful message. But I think it's the best
that can be done here with the information available. I might be nicer
if `none` had access to a higher level description of its parameter.
But I'm not sure how to orchestrate that right now.

luser droog

unread,
Nov 9, 2021, 11:56:46 AM11/9/21
to
On Monday, November 8, 2021 at 11:49:30 AM UTC-6, luser droog wrote:
> On Sunday, November 7, 2021 at 6:40:43 PM UTC-6, luser droog wrote:
> > Here's a rough draft of the grammar for a PS tokenizer using my new functions.

A little more fleshed out, formatted, and slightly tested. I've been having
to futz around with the innards of several parsers like `then` and `many`
to get `xthen` and `thenx` to work reliably. I had been using `append` to
combine the results of two sequential parsers, and `append` works like
a Lisp list append; ie. it scans to the end of the cdr chain and then replaces
the last null with the new element. That all works for the most part.

It fails when you try to do fancy stuff like `xthen` and `thenx`. These
are sequencing combinators like `then` which runs one parser and
then the other on the remainder from the first. But `xthen` has the
extra trick of discarding the result of the first parser, and `thenx`
discards the result of the second parser.

These are great for discarding stuff during the parse. Like when
processing escape codes, some simple ones like (\\) (\() (\)) are
completely handled by simply discarding the first slash. And all
the escape handling is simplified by just doing that wholesale
in all cases.

But if you're appending results into a long list, then you've lost
the <first> vs. <second> structure! The obvious solution for that
was to replace the calls to `append` with calls to `cons` which
just groups the two parts into a 2-element array: easy to grab
the two pieces out later.

Doing that caused a bug that took a while to track down. It caused
a problem in the handlers, all the procedures composed into the
parsers with `using`. In all of them I was calling a function called
`flatten` that only knew how to deal with 1-D Lisp lists. So it went
wild with a weird non-list cons structure.

So now it all works by using a more powerful function called
`unwrap` which can tease apart whatever weird cons tangle
is thrown at it. But you can't see any of this here; it's all inside
the `fix` function from pc11a.ps.

With this fix, I only just now got hex strings to appear to work,
discarding non hex characters it finds. Still need to interpret
the hex characters and do some handling for procedures.
And e notation.

Then a further challenge if I actually want to emulate the
`token` operator. I'll need to reliably recreate the remainder substring.
This string may or may not be reliably tucked into the lazy
remainder list still in string form. So some fiddly business may
be needed to reconstruct this string.

%errordict/typecheck{pq}put
(pc11a.ps)run <<
/interpret-octal { 0 exch { first 48 sub exch 8 mul add } forall }
/to-char { 1 string dup 0 4 3 roll put }
>> begin

/delimiters ( \t\n()/%[]<>{}) def
/delimiter delimiters anyof def
/octal (0)(7) range def
/digit (0)(9) range def
/alpha (a)(z) range (A)(Z) range alt def
/regular delimiters noneof def

/rad-digit //digit //alpha alt def
/rad-integer //digit //digit maybe then (#) char then //rad-digit some then def
/number //digit some def
/opt-number //digit many def
/integer (+-) anyof maybe //number then def
/real (+-) anyof maybe
//number (.) char then //opt-number then
(.) char //number then alt then def

/name //regular some def

/ps-char {-777 exec} def
/escape (\\) char
(\\) char
(\() char alt
(\)) char alt
(n) char { pop (\n) one } using alt
(r) char { pop (\r) one } using alt
(t) char { pop (\t) one } using alt
(b) char { pop (\b) one } using alt
(f) char { pop (\f) one } using alt
//octal //octal maybe then //octal maybe then
{ fix interpret-octal to-char one } using alt
xthen def
/ps-string (\() char //ps-char executeonly many then (\)) char then def
//ps-char 0 //escape
//ps-string alt
(()) noneof alt put

/hex-char //digit (a)(f) range (A)(F) range alt alt def
/non-hex-char //hex-char (>) char alt none def
/hex-string (<) char
//non-hex-char many //hex-char xthen many then //non-hex-char many thenx
(>) char then def

/spaces ( \t\n) anyof many def
/object {-777 exec} def
/ps-token //spaces //object executeonly xthen def

//object 0 //rad-integer { fix to-string cvi } using
//real { fix to-string cvr } using alt
//integer { fix to-string cvi } using alt
//name { fix to-string cvn cvx } using alt
(/) char //name then { fix to-string rest cvn cvlit } using alt
(/) char (/) char then //name then { fix to-string rest rest cvn load } using alt
//ps-string { fix to-string 1 1 index length 2 sub getinterval } using alt
//hex-string { fix 1 1 index length 2 sub getinterval } using alt
({) char //ps-token many then //spaces (}) char xthen then alt
//delimiter { fix to-string cvn cvx } using alt
put

/mytoken {
dup length 0 gt {
0 0 3 2 roll string-input //ps-token exec
}{ pop false } ifelse
} def

{
0 0 (47) string-input //integer exec pc
0 0 (47) string-input //number exec pc
0 0 (8#117) string-input
//digit //digit maybe then (#) char then //rad-digit some then exec pc
%quit
0 0 (8#117) string-input //rad-integer exec pc
0 0 (1.17) string-input //real exec pc
} pop

(8#117) mytoken pc
(47) mytoken pc
(string) mytoken pc
([stuff) mytoken pc
(/litname) mytoken pc
(42.42) mytoken pc
((a\\117 \\\\string\\n)) mytoken ps second first print clear
/thing 12 def
(//thing) mytoken pc
(<abc defg>) mytoken pc

quit

$ gsnd -dNOSAFER pc11atoken.ps
GPL Ghostscript 9.52 (2020-03-19)
Copyright (C) 2020 Artifex Software, Inc. All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
stack:
[/OK [79 []]]
:stack
stack:
[/OK [47 {0 2 () string-input}]]
:stack
stack:
[/OK [string []]]
:stack
stack:
[/OK [[ {0 1 (stuff) string-input}]]
:stack
stack:
[/OK [/litname []]]
:stack
stack:
[/OK [42.42 []]]
:stack
stack:
[/OK [(aO \\string\n) {0 18 () string-input}]]
:stack
aO \string
stack:
[/OK [12 []]]
:stack
stack:
[/OK [[(a) (b) (c) (d) (e) (f)] {0 10 () string-input}]]
:stack

luser droog

unread,
Nov 11, 2021, 6:02:01 PM11/11/21
to
On Tuesday, November 9, 2021 at 10:56:46 AM UTC-6, luser droog wrote:
> On Monday, November 8, 2021 at 11:49:30 AM UTC-6, luser droog wrote:
> > On Sunday, November 7, 2021 at 6:40:43 PM UTC-6, luser droog wrote:
> > > Here's a rough draft of the grammar for a PS tokenizer using my new functions.
> A little more fleshed out, formatted, and slightly tested.
[snip]
> Still need to interpret
> the hex characters and do some handling for procedures.
> And e notation.
>
> Then a further challenge if I actually want to emulate the
> `token` operator. I'll need to reliably recreate the remainder substring.
> This string may or may not be reliably tucked into the lazy
> remainder list still in string form. So some fiddly business may
> be needed to reconstruct this string.
>

It is done. All that stuff.

https://github.com/luser-dr00g/pcomb/blob/f2d20f01a4a4a0fb28e184143f66d0d0f0584bdb/ps/struct2.ps
https://github.com/luser-dr00g/pcomb/blob/f2d20f01a4a4a0fb28e184143f66d0d0f0584bdb/ps/pc11a.ps
https://github.com/luser-dr00g/pcomb/blob/f2d20f01a4a4a0fb28e184143f66d0d0f0584bdb/ps/pc11atoken.ps

$ cat pc11atoken.ps
%errordict/typecheck{pq}put
(pc11a.ps)run <<
/middle { 1 1 index length 2 sub getinterval }
/interpret-octal { 0 exch { first 48 sub exch 8 mul add } forall }
/interpret-hex {
{ dup (9) le { first 48 sub }{ first 55 sub dup 15 gt { 32 sub } if } ifelse } map
dup length 2 mod 1 eq { [ 0 ] compose } if
[ exch 2 { aload pop exch 16 mul add to-char } fortuple ]
to-string }
/interpret-ascii85 {
{ dup (z) eq { pop (!)(!)(!)(!)(!) }{ dup ( \t\n) within { pop } if } ifelse } map
[ 1 index length 5 mod { (u) } repeat ] compose
[ exch 5 {
0 exch { first 33 sub exch 85 mul add } forall
4 { dup 256 mod exch 256 idiv } repeat pop 4 aa reverse
{ to-char } forall
} fortuple ]
to-string }
>> begin

/delimiters ( \t\n()/%[]<>{}) def
/initials ([]) anyof def
/delimiter delimiters anyof def
/octal (0)(7) range def
/digit (0)(9) range def
/alpha (a)(z) range (A)(Z) range alt def
/regular delimiters noneof def
/spaces ( \t\n) anyof many def

/rad-digit //digit //alpha alt def
/rad-integer //digit //digit maybe then (#) char then //rad-digit some then def
/number //digit some def
/opt-number //digit many def
/eE (eE) anyof (+-) anyof maybe then //number then def
/integer (+-) anyof maybe //number then def
/real (+-) anyof maybe
//number (.) char then //opt-number then //eE maybe then
(.) char //number then //eE maybe then alt
//number //eE then alt
then def

/name //regular some def

/ps-char {-777 exec} def
/escape (\\) char
(\\) char
(\() char alt
(\)) char alt
(n) char { pop (\n) one } using alt
(r) char { pop (\r) one } using alt
(t) char { pop (\t) one } using alt
(b) char { pop (\b) one } using alt
(f) char { pop (\f) one } using alt
//octal //octal maybe then //octal maybe then
{ fix interpret-octal to-char one } using alt
xthen def
/ps-string (\() char //ps-char executeonly many then (\)) char then def
//ps-char 0 //escape
//ps-string alt
(()) noneof alt put

/hex-char //digit (a)(f) range (A)(F) range alt alt def
/hex-string (<) char
//spaces //hex-char xthen many then //spaces thenx
(>) char then def

/ascii85-char ( )(z) range (\t\n) anyof alt def
/ascii85-string (<~) str
//spaces //ascii85-char xthen many xthen //spaces thenx
(~>) str thenx def

/object {-777 exec} def
/ps-token //spaces //object xthen def

//object 0 //rad-integer { fix to-string cvi } using
//real { fix to-string cvr } using alt
//integer { fix to-string cvi } using alt
(/) char (/) char then //name then { fix to-string rest rest cvn load } using alt
(/) char //name maybe then { fix to-string rest cvn cvlit } using alt
//name { fix to-string cvn cvx } using alt
//ps-string { fix to-string middle } using alt
//hex-string { fix middle interpret-hex } using alt
//ascii85-string { fix interpret-ascii85 } using alt
({) char
//ps-token many executeonly xthen
//spaces %{(s)= ps}using
(}) char %{(b)= ps}using
xthen
thenx { first cvx } using alt
//initials (<<) str alt (>>) str alt { fix to-string cvn cvx } using alt
put

/remainder-length {
dup zero eq { pop 0 }{
dup type /arraytype ne { what? }{
dup xcheck {
2 get length
}{
second remainder-length 1 add
} ifelse
} ifelse } ifelse
} def

/mytoken {
dup length 0 gt {
dup 0 0 3 2 roll string-input //ps-token exec +is-ok { % s result=ok
second aload pop % s res rem
%dup zero eq { 3 -1 roll pop pop () exch true }{
%dup type /arraytype eq 1 index xcheck and { 3 -1 roll pop 2 get exch true }{
3 -1 roll exch remainder-length 1 index length 1 index sub exch getinterval
exch true
%} ifelse } ifelse
}{ % s result=not-ok
pop pop false
} ifelse
}{ pop false } ifelse
} def

/test-mytoken {
/s exch def
s token %1 index type =
s mytoken %1 index type =
} def

{
0 0 (47) string-input //integer exec pc
0 0 (47) string-input //number exec pc
0 0 (8#117) string-input
//digit //digit maybe then (#) char then //rad-digit some then exec pc
%quit
0 0 (8#117) string-input //rad-integer exec pc
0 0 (1.17) string-input //real exec pc
} pop

(8#117) test-mytoken pc
(47) test-mytoken pc
(string) test-mytoken pc
([stuff) test-mytoken pc
(/litname) test-mytoken pc
(42.42) test-mytoken pc
((a\\117 \\\\string\\n)) test-mytoken pc
/thing 12 def
(//thing) test-mytoken pc
(<abc def >) test-mytoken pc
(name[delim) test-mytoken pc
({a proc}) test-mytoken pc
(/(str)) test-mytoken pc
(2e5) test-mytoken pc
(<~ 9jq o^~>) test-mytoken pc

quit

$ gsnd -dNOSAFER pc11atoken.ps
GPL Ghostscript 9.52 (2020-03-19)
Copyright (C) 2020 Artifex Software, Inc. All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
stack:
true
79
()
true
79
()
:stack
stack:
true
47
()
true
47
()
:stack
stack:
true
string
()
true
string
()
:stack
stack:
true
[
(stuff)
true
[
(stuff)
:stack
stack:
true
/litname
()
true
/litname
()
:stack
stack:
true
42.42
()
true
42.42
()
:stack
stack:
true
(aO \\string\n)
()
true
(aO \\string\n)
()
:stack
stack:
true
12
()
true
12
()
:stack
stack:
true
(\253\315\357)
()
true
(\253\315\357)
()
:stack
stack:
true
name
([delim)
true
name
([delim)
:stack
stack:
true
{a proc}
()
true
{a proc}
()
:stack
stack:
true
/
(\(str\))
true
/
(\(str\))
:stack
stack:
true
200000.0
()
true
200000.0
()
:stack
stack:
true
(Man )
()
true
(Man )
()
:stack

Reply all
Reply to author
Forward
0 new messages