Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

remove block of text containing a pattern

46 views
Skip to first unread message

alb

unread,
Jul 13, 2017, 5:21:18 PM7/13/17
to
Hi there,

I have a postscript file and I want to remove some content in it before
converting it in pdf.
I'd like to remove every block of code delimited by gsave and grestore that
cointains a certain PATTERN.

Here is an example:

newpath
100 100 moveto
0 100 rlineto
100 0 rlineto
0 -100 rlineto
-100 0 rlineto
closepath
gsave
0.5 1 0.5 setrgbcolor
fill
grestore
1 0 0 setrgbcolor
4 setlinewidth
stroke

Assume I'm interested in removing the block that cointains 'fill'. The expected
result will look like this:

newpath
100 100 moveto
0 100 rlineto
100 0 rlineto
0 -100 rlineto
-100 0 rlineto
closepath
1 0 0 setrgbcolor
4 setlinewidth
stroke


To add a little bit more to the fun, gsave and grestore can be nested with
other gsave and grestore and this scenario should not break our ps.

I've tried following the approach from here:
https://stackoverflow.com/questions/13670009/remove-block-of-text-between-two-lines-based-on-content

It manages to detect the block of code, but it seems it removes a lot more than
anticipated.

Any pointer is appreciated,

Al

--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

Kaz Kylheku

unread,
Jul 13, 2017, 6:51:37 PM7/13/17
to
On 2017-07-13, alb <a...@notmymail.com> wrote:
> To add a little bit more to the fun, gsave and grestore can be nested with
> other gsave and grestore and this scenario should not break our ps.

But which nesting is of interest?

gsave
gsave
gsave
...
fill
...
grestore
grestore
grestore

Here are three gsave/grestore blocks here which contain a fill.

Is it required to remove all of them, or just the innermost one?

alb

unread,
Jul 14, 2017, 12:20:28 AM7/14/17
to
Kaz Kylheku <686-67...@kylheku.com> Wrote in message:
Indeed just the innermost.
--


----Android NewsGroup Reader----
http://usenet.sinaapp.com/

Dave Sines

unread,
Jul 14, 2017, 8:16:36 AM7/14/17
to
#! /usr/bin/nawk -f

BEGIN {
fill = "(^|[[:space:]])fill([[:space:]]|$)"
gsave = "(^|[[:space:]])gsave([[:space:]]|$)"
grestore = "(^|[[:space:]])grestore([[:space:]]|$)"
}

function buf(lead, rc, a, f, nl) {
rc = lead
nl = "\n"
f = 0
while (getline a > 0) {
if (a ~ fill)
f = 1
if (a ~ gsave) {
a = buf(a)
if (a != "") {
rc = rc nl a
}
} else {
rc = rc nl a
if (a ~ grestore)
break
}
}
return f ? "" : rc
}

$0 ~ gsave {
h = buf($0)
if (h != "")
print h
next
}

{ print }

Ed Morton

unread,
Jul 14, 2017, 8:44:55 AM7/14/17
to
On 7/13/2017 4:21 PM, alb wrote:
> Hi there,
>
> I have a postscript file and I want to remove some content in it before
> converting it in pdf.
> I'd like to remove every block of code delimited by gsave and grestore that
> cointains a certain PATTERN.

Avoid using the word "pattern" as it's ambiguous - use "string" or "regexp" (or
for shell file operations "globbing pattern"), whichever you really mean.
You should have provided sample input that had those characteristics (and others
such as one of the 3 strings you care about being part of another word) so we'd
be testing a potential solution against that instead of just your simplest case.

>
> I've tried following the approach from here:
> https://stackoverflow.com/questions/13670009/remove-block-of-text-between-two-lines-based-on-content
>
> It manages to detect the block of code, but it seems it removes a lot more than
> anticipated.
>
> Any pointer is appreciated,
>
> Al
>

With GNU awk for word boundaries:

$ cat tst.awk
/\<gsave\>/ {
printf "%s", block
inBlock = 1
foundTgt = 0
block = ""
}
inBlock {
block = block $0 ORS
if (/\<fill\>/) {
foundTgt = 1
}
else if (/\<grestore\>/) {
if (!foundTgt) {
printf "%s", block
}
inBlock = 0
}
next
}
{ print }

$ awk -f tst.awk file
newpath
100 100 moveto
0 100 rlineto
100 0 rlineto
0 -100 rlineto
-100 0 rlineto
closepath
1 0 0 setrgbcolor
4 setlinewidth
stroke

With other awks replace \< with ^[[:space:]]* and \> with [[:space:]]*$ or
similar or if you don't really have leading white space just use $0 == "gsave", etc.

Ed.

Ed Morton

unread,
Jul 14, 2017, 9:25:14 AM7/14/17
to
IIRC the above is ambiguous and you need to write

while ( (getline a) > 0 ) {

instead.

Not that I'm suggesting using getline for this would be a good idea of course
since, among other things, it'd make it harder to enhance the code later to do
anything else like print all lines containing "foo" to file "bar" or count all
lines containing "stuff" or.... (since you're introducing a 2nd "work loop"
reading lines you'd have to duplicate any code you want to run on all lines -
once in the normal body of the script and then again inside the getline loop)

Ed.

Janis Papanagnou

unread,
Jul 14, 2017, 10:12:43 AM7/14/17
to
On 14.07.2017 15:24, Ed Morton wrote:
> On 7/14/2017 7:15 AM, Dave Sines wrote:
>> alb <a...@notmymail.com> wrote:
>>> Hi there,
>>>
>>> I have a postscript file and I want to remove some content in it before
>>> converting it in pdf.
>>> I'd like to remove every block of code delimited by gsave and grestore that
>>> cointains a certain PATTERN.
>>>
>>> Here is an example:
>>>
[...]
>>>
>>> Assume I'm interested in removing the block that cointains 'fill'. The
>>> expected
>>> result will look like this:
>>>
[...]
>>>
>>> To add a little bit more to the fun, gsave and grestore can be nested with
>>> other gsave and grestore and this scenario should not break our ps.
>>
>>
>> #! /usr/bin/nawk -f
>>
>> BEGIN {
>> fill = "(^|[[:space:]])fill([[:space:]]|$)"
>> gsave = "(^|[[:space:]])gsave([[:space:]]|$)"
>> grestore = "(^|[[:space:]])grestore([[:space:]]|$)"
>> }
>>
>> function buf(lead, rc, a, f, nl) {
>> rc = lead
>> nl = "\n"
>> f = 0
>> while (getline a > 0) {
>
> IIRC the above is ambiguous and you need to write
>
> while ( (getline a) > 0 ) {
>
> instead.

You are certainly confusing that with the print/printf commands.

>
> Not that I'm suggesting using getline for this would be a good idea of course
> since, among other things, it'd make it harder to enhance the code later to do
> anything else like print all lines containing "foo" to file "bar" or count all
> lines containing "stuff" or.... (since you're introducing a 2nd "work loop"
> reading lines you'd have to duplicate any code you want to run on all lines -
> once in the normal body of the script and then again inside the getline loop)

You may have missed that Dave has written a recursive function to solve the
task. (A recusive function is the straigtforward way to handle a task that
involves parsing recusively defined data as in the OP's case.) I don't see
how to write a recursive function with awk's implicit getline; if you think
that's possible you might want to elaborate.

It's noteworthy to mention, though, that the *mix* of explicit and implicit
getline's is not a reliable and well extensible approach - if that's what
you were criticizing, the mix, then I'm with you -, but it's also unnecessary
since you can write a recursive solution without the implicit getlines (which
would, BTW, be the straightforward approach anyway).

Janis

Kaz Kylheku

unread,
Jul 14, 2017, 10:32:11 AM7/14/17
to
On 2017-07-14, alb <al.b...@gmail.com> wrote:
> Kaz Kylheku <686-67...@kylheku.com> Wrote in message:
>> On 2017-07-13, alb <a...@notmymail.com> wrote:
>>> To add a little bit more to the fun, gsave and grestore can be nested with
>>> other gsave and grestore and this scenario should not break our ps.
>>
>> But which nesting is of interest?
>>
>> gsave
>> gsave
>> gsave
>> ...
>> fill
>> ...
>> grestore
>> grestore
>> grestore
>>
>> Here are three gsave/grestore blocks here which contain a fill.
>>
>> Is it required to remove all of them, or just the innermost one?
>>
>
> Indeed just the innermost.

There are lots of ways to attack this. I'm taking the approach of
parsing the structure to a simple "abstract syntax tree" (AST)
represented as a nested list data structure in Lisp, and walking
that structure to delete it.

The tree structure is this:

- a line is represented as a character string
- the lines in the PS file are a list of character strings
- a gsave/grestore block, however, is a nested list.

Given these functions for printing the AST as the original syntax:

(defun print-ast-rec (ast)
(cond
((stringp ast) (put-line ast))
((consp ast) (put-line "gsave")
[mapdo print-ast-rec ast]
(put-line "grestore"))))

(defun print-ast (ast)
[mapdo print-ast-rec ast])

We can explore this interactively:

$ txr -i ast.tl
1> (print-ast '("a" "b" "c"))
a
b
c
nil
2> (print-ast '("a" "b" "c" ("d" "e") "f"))
a
b
c
gsave
d
e
grestore
f
nil
3> (print-ast '("a" "b" "c" ("d" "e") "f" ("g" ("h" "i"))))
a
b
c
gsave
d
e
grestore
f
gsave
g
gsave
h
i
grestore
grestore
nil

Here, nil is the result value of the evaluation, not part of the stream output.

With this data structure, we can use these functions for removing a gsave block
which contains "fill" by defining a function remove-fills which takes an AST
and returns a filtered AST:

(defun has-fill (ast)
(and (consp ast) (find "fill" ast)))

(defun remove-fills-rec (ast)
(cond
((atom ast) ast)
((has-fill ast) nil)
(t (let ((ast-rm [remove-if has-fill ast]))
[mapcar remove-fills-rec ast-rm]))))

(defun remove-fills (ast)
(let ((ast-rm [remove-if has-fill ast]))
[mapcar remove-fills-rec ast-rm]))

Interactive:

1> (remove-fills '("a" "b" "c" ("d" "e") "f" ("g" ("h" "i"))))
("a" "b" "c" ("d" "e") "f" ("g" ("h" "i")))
2> (remove-fills '("a" "b" "c" ("d" "fill" "e") "f" ("g" ("h" "i"))))
("a" "b" "c" "f" ("g" ("h" "i")))
3> (remove-fills '("a" "b" "c" ("d" "e") "f" ("g" "fill" ("h" "i"))))
("a" "b" "c" ("d" "e") "f")
4> (remove-fills '("a" "b" "c" ("d" "e") "f" ("g" ("h" "fill" "i"))))
("a" "b" "c" ("d" "e") "f" ("g"))

Now we just need a parser for the file format which produces the AST.
One way to do it is via procedural list building using the build macro,
and some recursion:

(defun parse ()
(build
(whilet ((next-line (get-line)))
(casequal next-line
("grestore" (return))
("gsave" (add (parse)))
(t (add next-line))))))

Interactive test:

1> (parse)
a
b
c
("a" "b" "c")
2> (parse)
a
gsave
b
grestore
c
("a" ("b") "c")
3> (parse)
a
gsave
gsave
b
gsave
c
grestore
grestore
d
grestore
("a" (("b" ("c")) "d"))

To solve the overall problem now, we have all the pieces. We just need
to tie them together with this simple expression:

(print-ast (remove-fills (parse)))

The complete solution:

;; Printing

(defun print-ast-rec (ast)
(cond
((stringp ast) (put-line ast))
((listp ast) (put-line "gsave")
[mapdo print-ast-rec ast]
(put-line "grestore"))))

(defun print-ast (ast)
[mapdo print-ast-rec ast])

;; Remove gsave blocks containing fill

(defun has-fill (ast)
(and (consp ast) (find "fill" ast)))

(defun remove-fills-rec (ast)
(cond
((atom ast) ast)
((has-fill ast) nil)
(t (let ((ast-rm [remove-if has-fill ast]))
[mapcar remove-fills-rec ast-rm]))))

(defun remove-fills (ast)
(let ((ast-rm [remove-if has-fill ast]))
[mapcar remove-fills-rec ast-rm]))

;; Parse input stream

(defun parse ()
(build
(whilet ((next-line (get-line)))
(casequal next-line
("grestore" (return))
("gsave" (add (parse)))
(t (add next-line))))))

;; parse, remove fills, output:

(print-ast (remove-fills (parse)))

On your simple test case, the run looks like this:

$ txr gsave.tl < data
newpath
100 100 moveto
0 100 rlineto
100 0 rlineto
0 -100 rlineto
-100 0 rlineto
closepath
1 0 0 setrgbcolor
4 setlinewidth
stroke

A case with nesting, typed from tty:

$ txr gsave.tl
gsave
x
y
z
gsave
a
b
gsave
1
fill
2
grestore
c
grestore
grestore
[Ctrl-D][Enter]
gsave
x
y
z
gsave
a
b
c
grestore
grestore

Another case:

$ txr gsave.tl
gsave
gsave
a
b
c
grestore
fill
grestore
[Ctrl-D][Enter]

In this case there is no output; everything is deleted since the whole file is
a gsave block and it contains a fill at the top level.

Dave Sines

unread,
Jul 14, 2017, 10:48:39 AM7/14/17
to
Ed Morton <morto...@gmail.com> wrote:
> On 7/14/2017 7:15 AM, Dave Sines wrote:

>> while (getline a > 0) {
>
> IIRC the above is ambiguous and you need to write
>
> while ( (getline a) > 0 ) {

You may be thinking of reading input from a file:

while (getline a < 0) {

Kaz Kylheku

unread,
Jul 14, 2017, 10:56:04 AM7/14/17
to
On 2017-07-14, alb <al.b...@gmail.com> wrote:
> Kaz Kylheku <686-67...@kylheku.com> Wrote in message:
>> On 2017-07-13, alb <a...@notmymail.com> wrote:
>>> To add a little bit more to the fun, gsave and grestore can be nested with
>>> other gsave and grestore and this scenario should not break our ps.
>>
>> But which nesting is of interest?
>>
>> gsave
>> gsave
>> gsave
>> ...
>> fill
>> ...
>> grestore
>> grestore
>> grestore
>>
>> Here are three gsave/grestore blocks here which contain a fill.
>>
>> Is it required to remove all of them, or just the innermost one?
>>
>
> Indeed just the innermost.

Now here is a "crazy" solution, using regular expressions.

In TXR, we have the regular expression "complement/negation" operator
which is spelled ~ in the regex syntax: ~foo means match all strings
except for the string "foo". We also have the "intersection/match-both"
operator, written &.

Thanks to these we can have a solution which takes the entire input as a
giant string and iterates on it, removing occurrences of gsave ... fill
... grestore which do NOT contain gsave or grestore.

Ready?

The whole solution:

(defvar regex #/gsave\n(([^\n]*\n)*fill\n([^\n]*\n)*&~.*(gsave|grestore).*)grestore\n/)

(let ((whole-input (get-string)))
(while t
(let ((try-remove (regsub regex "" whole-input)))
(when (equal try-remove whole-input)
(put-string whole-input)
(return))
(set whole-input try-remove))))



The regex basically has this overall pattern

PRE(INNER&~NOTTHIS)POST

match text delimited by PRE and POST, with some INNER thing in between,
but not containing NOTTHIS. We can read &~ as "and not".

PRE and POST are the line matches "gsave\n" and "grestore\n".

INNER is ([^\n]*\n)*fill\n([^\n]*\n)*: a match for any mixture of lines
containing at least one "fill\n" line.

NOTTHIS is: .*(gsave|grestore).*: the set of all strings containing
a submatch for either gsave or restore. The complement of this is all
other strings: the set of all strings not containing such a submatch:
and that is what we match.

Ed Morton

unread,
Jul 14, 2017, 11:07:33 AM7/14/17
to
Yes, I think I'm thinking of:

while ( (getline a < file) > 0 ) {

Per POSIX: The getline operator can form ambiguous constructs when there are
unparenthesized binary operators (including concatenate) to the right of the '<'
(up to the end of the expression containing the getline). The result of
evaluating such a construct is unspecified, and conforming applications shall
parenthesize properly all such usages.

Ed.

Ed Morton

unread,
Jul 14, 2017, 11:12:55 AM7/14/17
to
No, just a slightly different flavor of getline call:

while ( (getline a < file) > 0 ) {

>>
>> Not that I'm suggesting using getline for this would be a good idea of course
>> since, among other things, it'd make it harder to enhance the code later to do
>> anything else like print all lines containing "foo" to file "bar" or count all
>> lines containing "stuff" or.... (since you're introducing a 2nd "work loop"
>> reading lines you'd have to duplicate any code you want to run on all lines -
>> once in the normal body of the script and then again inside the getline loop)
>
> You may have missed that Dave has written a recursive function to solve the
> task. (A recusive function is the straigtforward way to handle a task that
> involves parsing recusively defined data as in the OP's case.) I don't see
> how to write a recursive function with awk's implicit getline; if you think
> that's possible you might want to elaborate.
>
> It's noteworthy to mention, though, that the *mix* of explicit and implicit
> getline's is not a reliable and well extensible approach - if that's what
> you were criticizing, the mix, then I'm with you -, but it's also unnecessary
> since you can write a recursive solution without the implicit getlines (which
> would, BTW, be the straightforward approach anyway).

I was just commenting that it's not necessary or useful to use getline to solve
this problem and it would make potential future enhancements more difficult.

Ed.

Janis Papanagnou

unread,
Jul 14, 2017, 12:09:59 PM7/14/17
to
On 14.07.2017 17:12, Ed Morton wrote:
> On 7/14/2017 9:12 AM, Janis Papanagnou wrote:
>> On 14.07.2017 15:24, Ed Morton wrote:
>>> On 7/14/2017 7:15 AM, Dave Sines wrote:
>>>> alb <a...@notmymail.com> wrote:
>>>>
>>>> while (getline a > 0) {
>>>
>>> IIRC the above is ambiguous and you need to write
>>>
>>> while ( (getline a) > 0 ) {
>>>
>>> instead.
>>
>> You are certainly confusing that with the print/printf commands.
>
> No, just a slightly different flavor of getline call:
>
> while ( (getline a < file) > 0 ) {

A similar construct _without_ parenthesis is even given in A.,K.,&W.'s book,
or IOW; obviously you don't need the parenthesis here, as I said.

while ( getline a < "file" > 0 ) # literal file name
while ( getline a < file > 0 ) # file name in a variable


>
>>>
>>> Not that I'm suggesting using getline for this would be a good idea of course
>>> since, among other things, it'd make it harder to enhance the code later to do
>>> anything else like print all lines containing "foo" to file "bar" or count all
>>> lines containing "stuff" or.... (since you're introducing a 2nd "work loop"
>>> reading lines you'd have to duplicate any code you want to run on all lines -
>>> once in the normal body of the script and then again inside the getline loop)
>>
>> You may have missed that Dave has written a recursive function to solve the
>> task. (A recusive function is the straigtforward way to handle a task that
>> involves parsing recusively defined data as in the OP's case.) I don't see
>> how to write a recursive function with awk's implicit getline; if you think
>> that's possible you might want to elaborate.
>>
>> It's noteworthy to mention, though, that the *mix* of explicit and implicit
>> getline's is not a reliable and well extensible approach - if that's what
>> you were criticizing, the mix, then I'm with you -, but it's also unnecessary
>> since you can write a recursive solution without the implicit getlines (which
>> would, BTW, be the straightforward approach anyway).
>
> I was just commenting that it's not necessary or useful to use getline to
> solve this problem and it would make potential future enhancements more
> difficult.

Yes, and I've disagreed with your statements. Solving recursive structures
is best done by recursive algorithms. And using recursive functions makes
it impossible to use the built-in getline loop, you'd rather have to use an
explicit getline, as you'd do in other programming languages that support
recusive functions, actually most languages nowadays. Extending a recursive
algorithm is easy while you still work on the same recusive data structures.

Janis

> [...]


Kenny McCormack

unread,
Jul 14, 2017, 12:51:36 PM7/14/17
to
In article <okaqch$ltv$1...@news-1.m-online.net>,
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
...
>>> You are certainly confusing that with the print/printf commands.
>>
>> No, just a slightly different flavor of getline call:
>>
>> while ( (getline a < file) > 0 ) {
>
>A similar construct _without_ parenthesis is even given in A.,K.,&W.'s book,
>or IOW; obviously you don't need the parenthesis here, as I said.
>
> while ( getline a < "file" > 0 ) # literal file name
> while ( getline a < file > 0 ) # file name in a variable

As Dave pointed out, the one ambiguous case is:

% gawk 'BEGIN { print getline a < 0;print }'
-1

%

Now do: touch 0
and re-run the above command.
Or, parenthesize "getline a".

Although I haven't tested them all (only some of them!), I'm pretty sure
that all the other cases resolve themselves (i.e., are unambiguous).

--
The randomly chosen signature file that would have appeared here is more than 4
lines long. As such, it violates one or more Usenet RFCs. In order to remain
in compliance with said RFCs, the actual sig can be found at the following URL:
http://user.xmission.com/~gazelle/Sigs/Aspergers

alb

unread,
Jul 14, 2017, 4:26:33 PM7/14/17
to
Hi Kaz,

Kaz Kylheku <686-67...@kylheku.com> wrote:
> On 2017-07-14, alb <al.b...@gmail.com> wrote:
>> Kaz Kylheku <686-67...@kylheku.com> Wrote in message:
[]
>
> There are lots of ways to attack this. I'm taking the approach of
> parsing the structure to a simple "abstract syntax tree" (AST)
> represented as a nested list data structure in Lisp, and walking
> that structure to delete it.
[]

Wow, I've always said to myself I should learn LISP one day. Since I know
nothing about LISP I cannot judge for the code, but the algorithm you described
is actually pretty straight forward and I appreciated the ride!
And I suppose that if I have two 'characters strings' to remove from the blocks
thand I can simply add another remove-somethingelse function and combine the
have a final call like this:

(print-ast (remove-fills (remove-somethingelse (parse))))

Correct?

I'm extremely interested in learning LISP, would you recommend a starting point
for learning?

Anyway, thanks again!

Al

Kaz Kylheku

unread,
Jul 16, 2017, 10:52:09 AM7/16/17
to
On 2017-07-14, alb <a...@notmymail.com> wrote:
> And I suppose that if I have two 'characters strings' to remove from the blocks
> thand I can simply add another remove-somethingelse function and combine the
> have a final call like this:
>
> (print-ast (remove-fills (remove-somethingelse (parse))))
>
> Correct?

Yes; of course. There is also a syntactic sugar in TXR for expressing
pipelining in a linear left-to-right way: more advanced topic.

> I'm extremely interested in learning LISP, would you recommend a starting point
> for learning?

These examples are in a dialect called TXR Lisp, which I developed
myself, in combination with a whole-document pattern-based extraction
language. It's a tool geared toward for modern scripting. The home page
is http://www.nongnu.org/txr/ .

In the Lisp world, the two mainstream dialects, much more widely
known and used, are ANSI Common Lisp ("CL") and Scheme.

There are lots of resources for learning Lisp. Numerous tutorials; forums
(comp.lang.lisp, reddit/{lisp,learnlisp}); books, some of which
are free, and so on.
0 new messages