Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Lines between last START and next FINISH, exclusively

40 views
Skip to first unread message

Robert Mesibov

unread,
Dec 25, 2015, 12:04:55 AM12/25/15
to
Hi and best wishes for the 2015-2016 holiday season! I have a holiday brain-buster. While writing an explainer for how AWK uses flags, I used a file similar to this, called 'test', in order to show how two flags might be used in one command:

$ cat test
no1
START
START
yes1
FINISH
FINISH
no2
no3
START
no4
START
yes2
yes3
FINISH
no5

To get just the 'yes*' lines between START and FINISH (that is, lines between a last START and a next FINISH) for this particular file:

$ awk '/FINISH/ {f=g=0}; f && g; /START/ && !f {f=1; next}; /START/ && f {g=1}' test
yes1
yes2
yes3

This won't work if there's another START, say after 'no3' in 'test'.

$ cat testA
no1
START
START
yes1
FINISH
FINISH
no2
no3
START
START
no4
START
yes2
yes3
FINISH
no5

$ awk '/FINISH/ {f=g=0}; f && g; /START/ && !f {f=1; next}; /START/ && f {g=1}' testA
yes1
no4
START
yes2
yes3

Is there a _general_ solution to extracting the lines between a last START and a next FINISH, if those lines exist? Can this be done with a single pass through the file in AWK?

I haven't been able to think of a such a single-pass AWK solution, so I've been fiddling (unsuccessfully) with a part-AWK workaround. The plan is to convert the file to a single line:

$ paste -s -d ' ' testA
no1 START START yes1 FINISH FINISH no2 no3 START START no4 START yes2 yes3 FINISH no5

then replace START [anything but FINISH] START with a single START, then convert back to multi-line, then do a simple 1-flag AWK pass:

awk '/START/ {f=1; next} /FINISH/ {f=0} f'

No luck so far. Any thoughts?

Joe User

unread,
Dec 25, 2015, 1:53:02 AM12/25/15
to
Robert Mesibov wrote:

> Is there a _general_ solution to extracting the lines between a last START
> and a next FINISH, if those lines exist? Can this be done with a single
> pass through the file in AWK?

Some constraints of the problem are not completely specified, but this should
work (with gawk and linux record seperators; untested):

BEGIN { RS = '^$' } # Read entire files
# Isolate lines between first START and last FINISH
match($0, /(^|\n)START\n(.*)\nFINISH(\n|$)/, a) {
# Isolate text after last START
s = gensub(/.*(^|\n)START\n/, "", 1, a[2])
# Isolate text before first FINISH after last START
s = gensub(/(^|\n)FINISH(\n|$).*/, "", 1, s)
print s
}


Luuk

unread,
Dec 25, 2015, 7:44:21 AM12/25/15
to
On 25-12-15 06:04, Robert Mesibov wrote:
> Hi and best wishes for the 2015-2016 holiday season!
> I have a holiday brain-buster.
> While writing an explainer for how AWK uses flags,
> I used a file similar to this, called 'test',
>
>
> $ cat test
> no1
> START
> START
> yes1
> FINISH
> FINISH
> no2
> no3
> START
> no4
> START
> yes2
> yes3
> FINISH
> no5
>

> Is there a _general_ solution to extracting the lines between a last START and a next FINISH, if those lines exist? Can this be done with a single pass through the file in AWK?


$ awk '/START/{ c++; next }/FINISH/{ c--; next }c!=0{ print $0 }' test
yes1
no4
yes2
yes3
no5


Kenny McCormack

unread,
Dec 25, 2015, 8:31:05 AM12/25/15
to
In article <567d3a23$0$23845$e4fe...@news.xs4all.nl>,
Luuk <lu...@invalid.lan> wrote:
>On 25-12-15 06:04, Robert Mesibov wrote:
...
>$

While OP's intent is certainly unclear - it could mean just about anything -
it seems clear based on the choice of "yes" and "no", that the desired
output must be:

yes1
yes2
yes3

The question then is how to get there. So, your solution must be wrong.
I'm not sure about why your program generates "no4", but it seems clear to
me that outputting "no5" /has/ to be wrong.

Now, you can fix the "no5" problem by simplifying your code to:

/START/ { c=1; next }
/FINISH/ { c=0; next }
c

but that still leaves "no4" in the output. It is beginning to look to me
like the answer to OP's question is "no" - that it is not possible to do
this without, in some sense, passing the file multiple times. That is,
either by reading the whole file into memory (as the first responder did),
or by explicitly passing it twice [but see below - *]. There doesn't seem
to be any way to avoid outputting "no4" until you see that there's another
"START" coming up with no intervening "FINISH".

[*] Of course, you probably could avoid reading the whole file either into
memory or twice by setting up a small window that would only buffer up
what's necessary in order to implement the algorithm. Coding this is left
as an exercise for the OP.

--
Ted Cruz sounds like every straight man's first wife.

Ted Cruz is such a closet case his first name should have been Tom.
(Note: The "Ted" moniker is a fictional creation)

Luuk

unread,
Dec 25, 2015, 8:44:42 AM12/25/15
to
The 'unclearness' is mostly about how to handle nested START/FINISH
situations

Your simplification will not handle nested START/FINISH

My solution does not complain about a missing FINISH matching the START
just before 'no4'


The line 'no5' is in the START/FINISH which starts at line#9, but does
not finish because of a missing FINISH.... ;)

$ nl test
1 no1
2 START
3 START
4 yes1
5 FINISH
6 FINISH
7 no2
8 no3
9 START
10 no4
11 START
12 yes2
13 yes3
14 FINISH
15 no5


pk

unread,
Dec 25, 2015, 9:35:26 AM12/25/15
to
On Thu, 24 Dec 2015 21:04:53 -0800 (PST), Robert Mesibov
<robert....@gmail.com> wrote:

> Hi and best wishes for the 2015-2016 holiday season! I have a holiday
> brain-buster. While writing an explainer for how AWK uses flags, I used a
> file similar to this, called 'test', in order to show how two flags might
> be used in one command:
>
> $ cat test
> no1
> START
> START
> yes1
> FINISH
> FINISH
> no2
> no3
> START
> no4
> START
> yes2
> yes3
> FINISH
> no5

IIUC, this is trivially done as follows:

awk '
/START/{ inb = 1; b = s = ""; next }
/FINISH/ && inb { inb = 0; print b; b = s = ""; next }
inb { b = b s $0; s = RS }
'

Ed Morton

unread,
Dec 25, 2015, 10:59:41 AM12/25/15
to
The general solution is to use a buffer of whatever lines are between the last
START and the subsequent FINISH, clear it on START and print it on FINISH:

$ cat tst.awk
/FINISH/ { if (f) printf "%s", buf; f=0 }
f { buf = buf $0 ORS }
/START/ { buf=""; f=1 }

$ awk -f tst.awk file
yes1
yes2
yes3

Regards,

Ed.

Ed Morton

unread,
Dec 25, 2015, 11:04:32 AM12/25/15
to
That would print a blank line if there was a START line immediately followed by
a FINISH line.

Ed.

Robert Mesibov

unread,
Dec 25, 2015, 3:36:17 PM12/25/15
to
Many thanks Ed and pk! Those do look like general solutions.

Apologies if I wasn't perfectly clear. You can think of the problem this way: there's a pool of strings from which you can randomly draw a list of lines, one string per line. The pool contains 'START', 'FINISH' and a bunch of other strings not containing 'START' or 'FINISH'.


Any list of lines drawn from this pool which contains one or more 'START' and one or more 'FINISH' will always have somewhere in it one or more sub-lists with START followed by zero or more non-START, non-FINISH lines followed by FINISH, unless the first START in the list follows the last FINISH in the list.

The problem was to use AWK in a single pass to extract all such sub-lists, minus the starting START and the ending FINISH.

Ed Morton

unread,
Dec 25, 2015, 3:54:44 PM12/25/15
to
I thought I understood before but I now don't have the faintest idea what you're
talking about. Please provide sample input and expected output that demonstrates
in what way this solution:

/FINISH/ { if (f) printf "%s", buf; f=0 }
f { buf = buf $0 ORS }
/START/ { buf=""; f=1 }

does not solve your problem.

Ed.

Robert Mesibov

unread,
Dec 25, 2015, 4:02:26 PM12/25/15
to
Your solution _does_ solve the problem, and you did understand what I meant by 'general solution'.

What you didn't understand in my reply was my mathematical way of explaining what the universe of possible cases would look like. Sorry for that, and thanks again.



Kaz Kylheku

unread,
Dec 26, 2015, 12:01:11 PM12/26/15
to
What you were trying to explain is that there is no need to handle the
START/FINISH nesting, because what you're trying to extract is
a subset recognizable by a regular language.

It's a parenthesization problem. If we want to know whether something
like this is balanced:

((a()(((b)()))c)(de))

we can do that with a counter, or a push-down automaton or recursion.
But if we just want to extract zero or more non-parentheses between
parentheses--namely the (), (b) and (de), the (b) and (de)--we can
completely ignore the nesting aspect of the syntax and just scan for
matches for the regular pattern "[(][^)]*[)]": an opening paren followed
by zero or more non-parens, followed by a closing paren.

Robert Mesibov

unread,
Dec 26, 2015, 3:52:12 PM12/26/15
to
> It's a parenthesization problem. If we want to know whether something
> like this is balanced:
>
> ((a()(((b)()))c)(de))
>
> we can do that with a counter, or a push-down automaton or recursion.
> But if we just want to extract zero or more non-parentheses between
> parentheses--namely the (), (b) and (de), the (b) and (de)--we can
> completely ignore the nesting aspect of the syntax and just scan for
> matches for the regular pattern "[(][^)]*[)]": an opening paren followed
> by zero or more non-parens, followed by a closing paren.

I suppose that's what I was trying to do with my workaround, as I wrote

"then replace START [anything but FINISH] START with a single START"

which is a negative lookahead, but I haven't yet got a negative lookahead to work for me.
0 new messages