question: grouping regex

mss

unread,

Jan 15, 2011, 8:39:04 AM1/15/11

to

In gawk, I'm trying to reduce this:

while (($0 ~ /^[[:blank:]]+/) || ($0 !~ /^./)) {...}

To something like:

while ($0 ~ /^([[[:blank:]]|^.])+/) {...}

What am I doing wrong?

--
later on,
Mike

http://www.topcat.hypermart.net/index.html

Janis Papanagnou

unread,

Jan 15, 2011, 8:49:11 AM1/15/11

to

On 15.01.2011 14:39, mss wrote:
> In gawk, I'm trying to reduce this:
>
> while (($0 ~ /^[[:blank:]]+/) || ($0 !~ /^./)) {...}

What do you plan to achieve to match with /^./ ?

>
>
> To something like:
>
> while ($0 ~ /^([[[:blank:]]|^.])+/) {...}

Three '[' - what do you think that means?

>
>
> What am I doing wrong?
>

Depends on what you want to do.

Janis

mss

unread,

Jan 15, 2011, 8:59:41 AM1/15/11

to

Janis Papanagnou wrote:

> What do you plan to achieve to match with /^./ ?

Detect an empty line.

> Three '[' - what do you think that means?

Two groupings...

[[[:group1:]]|^group2]

But because the posix class [[:blank:]] already has square
brackets, I'm unsure how to group all together...

Kenny McCormack

unread,

Jan 15, 2011, 9:19:37 AM1/15/11

to

In article <igs8u3$4vt$1...@news.m-online.net>,

Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>On 15.01.2011 14:39, mss wrote:
>> In gawk, I'm trying to reduce this:
>>
>> while (($0 ~ /^[[:blank:]]+/) || ($0 !~ /^./)) {...}
>
>What do you plan to achieve to match with /^./ ?

It is an alternative to "length".

E.g., you could do:

while (($0 ~ /^[[:blank:]]+/) || !length)

(Yes, GAWK still recognizes "length" alone as equivalent to length($0))

I think what OP is really going for is:

while (!NF)

or maybe:

while ($0 ~ /^[[:blank:]]*$/)

--
Faced with the choice between changing one's mind and proving that there is
no need to do so, almost everyone gets busy on the proof.

- John Kenneth Galbraith -

mss

unread,

Jan 15, 2011, 9:30:06 AM1/15/11

to

Kenny McCormack wrote:

>>What do you plan to achieve to match with /^./ ?
>
> It is an alternative to "length".

Yes thats it Kenny.

Here's the written definition of the problem:

'while $0 contains leading whitespace OR $0 has no length then loop'

this also works fine:

while (($0 ~ /^[[:blank:]]+/) || (! $0)) {...}

Just seeing how I could shorten it with regex.

pk

unread,

Jan 15, 2011, 9:17:38 AM1/15/11

to

On Sat, 15 Jan 2011 14:30:06 +0000 (UTC)
mss <m...@dev.null> wrote:

> Kenny McCormack wrote:
>
> >>What do you plan to achieve to match with /^./ ?
> >
> > It is an alternative to "length".
>
> Yes thats it Kenny.
>
> Here's the written definition of the problem:
>
> 'while $0 contains leading whitespace OR $0 has no length then loop'
>
> this also works fine:
>
> while (($0 ~ /^[[:blank:]]+/) || (! $0)) {...}
>
> Just seeing how I could shorten it with regex.

Well, maybe too obvious, but of course

while (/^([[:blank:]]+|$)/) { ... }

mss

unread,

Jan 15, 2011, 9:40:03 AM1/15/11

to

pk wrote:

>> Here's the written definition of the problem:
>>
>> 'while $0 contains leading whitespace OR $0 has no length then loop'
>>
>> this also works fine:
>>
>> while (($0 ~ /^[[:blank:]]+/) || (! $0)) {...}
>>
>> Just seeing how I could shorten it with regex.
>
> Well, maybe too obvious, but of course
>
> while (/^([[:blank:]]+|$)/) { ... }

Spot on pk. In fact so obvious I missed it...
(seems I made the problem harder than it needed to be)

Thank you.

mss

unread,

Jan 15, 2011, 9:47:51 AM1/15/11

to

pk wrote:

> while (/^([[:blank:]]+|$)/) { ... }

Noticed too that what this regex is inferring
is a default of $0 since its not explicitly
stated, just like a bare 'length' or 'print'
(as Kenny noted) will default to using $0.

I'm sure there's others as well...

Ed Morton

unread,

Jan 15, 2011, 9:49:18 AM1/15/11

to

On 1/15/2011 8:30 AM, mss wrote:
<snip>

> Here's the written definition of the problem:
>
> 'while $0 contains leading whitespace OR $0 has no length then loop'
>
> this also works fine:
>
> while (($0 ~ /^[[:blank:]]+/) || (! $0)) {...}

No, it doesn't.

a) !$0 would evaluate to true if $0 was the number zero which isn't what you want.
b) No need for "+" if 1 leading space is enough.
c) No need for "$0 ~".

while (/^([[:blank:]]|$)/ {...}

as pk pointed out should do it.

Ed.

Janis Papanagnou

unread,

Jan 15, 2011, 10:04:04 AM1/15/11

to

On 15.01.2011 15:30, mss wrote:
> Kenny McCormack wrote:
>
>>> What do you plan to achieve to match with /^./ ?
>>
>> It is an alternative to "length".
>
> Yes thats it Kenny.
>
> Here's the written definition of the problem:
>
> 'while $0 contains leading whitespace OR $0 has no length then loop'

Aha. So probably this pattern will do

!/^[^[:blank:]]/

>
> this also works fine:
>
> while (($0 ~ /^[[:blank:]]+/) || (! $0)) {...}

BTW, this looks like you plan to use getline in {...}.
Not that good an idea.

Janis

Ed Morton

unread,

Jan 15, 2011, 10:04:31 AM1/15/11

to

On 1/15/2011 7:59 AM, mss wrote:
> Janis Papanagnou wrote:
>
>> What do you plan to achieve to match with /^./ ?
>
> Detect an empty line.
>
>> Three '[' - what do you think that means?
>
> Two groupings...
>
> [[[:group1:]]|^group2]
>
> But because the posix class [[:blank:]] already has square
> brackets, I'm unsure how to group all together...
>

[[:blank:]] is NOT a class. [:blank:] is. [[:blank:]] is a character list that
contains the class [:blank:].

[:group1:] is a character class.
[...] is a character list. Anything within those operators (unless the first and
last character are ":") are characters, some with special meanings depending
where they appear in the list (e.g. ^ at the start of the list means negation).
group2 is the character $ in your specific example. Within a character list $
just means the character $, it does not mean the "end of string" RE operator.
^group2 is a character list of all the characters except group2.

[[:group1:]] is a character list of the characters in the character class [:group1:]
[[:group1:]$] is a character list of the characters in the character class
[:group1:] plus the character $

There is no way to mid-way through a character list throw in the negation
operator (^). That MUST appear at the start of the character list, if it appears
elsewhere in the character list then it just means the character ^, not the
negation operator. Inside a character list "^" NEVER means the start-of-string
RE operator, just like "$" never means the end-of-string RE operator.

So, [[:group1:]^$] is a character list of the characters in the character class
[:group1:] plus the character ^ plus the character $.

[[[:group1:]^group2]] is a character list of the character [ plus the characters
in the character class [:group1:] plus the character ^ plus the character $ plus
the character ].

To describe "a character class OR a negated set of characters" you need to use
an RE, not a single character list:

([[:group1:]]|[^group2])

but if group2 is really "$" and you want "$" to represent the end-of-string RE
operator and "^" to represent the start-of-string operator then you need:

([[:group1:]]|^$)

Regards,

Ed.

mss

unread,

Jan 15, 2011, 10:04:59 AM1/15/11

to

Ed Morton wrote:

>> this also works fine:
>>
>> while (($0 ~ /^[[:blank:]]+/) || (! $0)) {...}
>
> No, it doesn't.
>
> a) !$0 would evaluate to true if $0 was the number zero which isn't what you want.

Er... huh? I mean I understand but you're saying,
but how could that happen in the data?

Just a single zero on a line by itself?

mss

unread,

Jan 15, 2011, 10:13:20 AM1/15/11

to

Janis Papanagnou wrote:

> BTW, this looks like you plan to use getline in {...}.
> Not that good an idea.

Chuckle it does use getline (very good guess Janis).

*But how do I avoid it?*

Here's the whole script, please critique if you want to:

BEGIN {RS="\n"; TAG = trim(tolower(TAG))}

TAG == "-i" {gettags(); next}

tagscan() {readblock()}

# ---------------------------------------------------------------------------

function gettags() {if (! /^([[:blank:]]|$)/) {print}}

# ---------------------------------------------------------------------------

function tagscan( x, y, cell) {

if (/^[[:blank:]]/) {return 0}

x = split(tolower($0), cell, ",")

for (y = 1; y <= x; y++) {
if (trim(cell[y]) == TAG) {return 1}
}

return 0
}

# ---------------------------------------------------------------------------

function readblock( x) {

x = 1

do {
print (x ? "Tags:" : ""), $0; x = 0
if (! getline) {break}
if (tagscan($0)) {readblock()}
} while (/^([[:blank:]]|$)/)
}

# ---------------------------------------------------------------------------

function trim(s) {return rtrim(ltrim(s))}
function ltrim(s) {sub(/^[[:blank:]]+/, "", s); return s}
function rtrim(s) {sub(/[[:blank:]]+$/, "", s); return s}

# eof

Ed Morton

unread,

Jan 15, 2011, 10:13:40 AM1/15/11

to

On 1/15/2011 9:04 AM, Janis Papanagnou wrote:
> On 15.01.2011 15:30, mss wrote:
>> Kenny McCormack wrote:
>>
>>>> What do you plan to achieve to match with /^./ ?
>>>
>>> It is an alternative to "length".
>>
>> Yes thats it Kenny.
>>
>> Here's the written definition of the problem:
>>
>> 'while $0 contains leading whitespace OR $0 has no length then loop'
>
> Aha. So probably this pattern will do
>
> !/^[^[:blank:]]/

Of course!

This is yet another example of how once someone posts a type of solution (In
this case "(($0 ~ /^[[:blank:]]+/) || ($0 !~ /^./))") and asks how to improve on
it, it gets us focused on improving that solution rather than just thinking
about what we'd do to solve the requirements. We typically want people to post
what they've attempted but it sometimes just derails the thread. Good thing
there's enough of us around it usually gets back on track eventually - the
benefit of posting in a topical newsgroup. Nice recovery Janis!

Ed.

Ed Morton

unread,

Jan 15, 2011, 10:18:04 AM1/15/11

to

On 1/15/2011 9:04 AM, mss wrote:
> Ed Morton wrote:
>
>>> this also works fine:
>>>
>>> while (($0 ~ /^[[:blank:]]+/) || (! $0)) {...}
>>
>> No, it doesn't.
>>
>> a) !$0 would evaluate to true if $0 was the number zero which isn't what you want.
>
> Er... huh? I mean I understand but you're saying,
> but how could that happen in the data?
>
> Just a single zero on a line by itself?
>

That'd be one way. Since you're testing $0 in a loop you must be modifying $0
within that loop or you'd never break out of it, so all it'd take to end up with
$0 numerically equal to "0" (so 0 or 000 or 0.0 or...) would be the result of
one of those operations stripping away everything else.

Ed.

Janis Papanagnou

unread,

Jan 15, 2011, 10:22:18 AM1/15/11

to

On 15.01.2011 16:13, Ed Morton wrote:
> On 1/15/2011 9:04 AM, Janis Papanagnou wrote:
>> On 15.01.2011 15:30, mss wrote:
>>> Kenny McCormack wrote:
>>>
>>>>> What do you plan to achieve to match with /^./ ?
>>>>
>>>> It is an alternative to "length".
>>>
>>> Yes thats it Kenny.
>>>
>>> Here's the written definition of the problem:
>>>
>>> 'while $0 contains leading whitespace OR $0 has no length then loop'
>>
>> Aha. So probably this pattern will do
>>
>> !/^[^[:blank:]]/
>
> Of course!
>
> This is yet another example of how once someone posts a type of solution (In
> this case "(($0 ~ /^[[:blank:]]+/) || ($0 !~ /^./))") and asks how to improve
> on it, it gets us focused on improving that solution rather than just thinking
> about what we'd do to solve the requirements.

Indeed.

Actually, that was the reason why I asked questions in my first posting of
the thread instead of guessing. Kenny courteously answered my question but
the intention (maybe too subtle) was rather to make the OP think about it.

> We typically want people to post
> what they've attempted but it sometimes just derails the thread. Good thing
> there's enough of us around it usually gets back on track eventually - the
> benefit of posting in a topical newsgroup. Nice recovery Janis!

Thanks.

Janis

>
> Ed.

Ed Morton

unread,

Jan 15, 2011, 10:25:15 AM1/15/11

to

On 1/15/2011 9:13 AM, mss wrote:
> Janis Papanagnou wrote:
>
>> BTW, this looks like you plan to use getline in {...}.
>> Not that good an idea.
>
> Chuckle it does use getline (very good guess Janis).
>
> *But how do I avoid it?*
>
> Here's the whole script, please critique if you want to:

Rather than having us try to figure out what the script does, could you post a
brief paragraph telling us what it does along with a small set of sample input
and the output it produces from that input?

Ed.

mss

unread,

Jan 15, 2011, 10:26:20 AM1/15/11

to

Janis Papanagnou wrote:

>> 'while $0 contains leading whitespace OR $0 has no length then loop'
>
> Aha. So probably this pattern will do
>
> !/^[^[:blank:]]/

Okay, just did some quick tests with it.
Works great.

That is very nice!

mss

unread,

Jan 15, 2011, 10:32:03 AM1/15/11

to

Janis Papanagnou wrote:

> the intention (maybe too subtle) was rather to make the OP think about it.

Not too subtle, but consider:

By asking for help, it points me in the right direction.

Subtly is sometimes not so good, because meanings can be
misconstrued - directness is better IMO.

Janis Papanagnou

unread,

Jan 15, 2011, 10:36:16 AM1/15/11

to

On 15.01.2011 16:13, mss wrote:
> Janis Papanagnou wrote:
>
>> BTW, this looks like you plan to use getline in {...}.
>> Not that good an idea.
>
> Chuckle it does use getline (very good guess Janis).

Not so much a guess; some constructs would not appear if done right.
;-)

>
> *But how do I avoid it?*

Just make use of what's already in awk (awk's builtin loop and field
splitting), and let the data control the flow.

>
> Here's the whole script, please critique if you want to:

Refactoring a getline-based program written by someone else is mind
numbing. Sorry. You should avoid those getlines in the first place;
I'd start from scratch, from the requirements - but I haven't seen
any documentation in the code.

You should browse the Web; Ed Morton wrote a paper about the getline
topic, maybe it helps to understand the inherent problem, and you can
design your code appropriately.

Janis

>
>
> [snip code]

mss

unread,

Jan 15, 2011, 10:45:58 AM1/15/11

to

Janis Papanagnou wrote:

> You should browse the Web; Ed Morton wrote a paper about the getline
> topic, maybe it helps to understand the inherent problem, and you can
> design your code appropriately.

I honestly don't understand sometimes...

- I dont want anyone to refactor anything, just add their
thoughts (is so compelled)...

- If getline is so bad then why is it included w/ awk to begin with?

Does anyone have a link to Ed's getline paper?

Ed Morton

unread,

Jan 15, 2011, 10:52:41 AM1/15/11

to

On 1/15/2011 9:45 AM, mss wrote:
> Janis Papanagnou wrote:
>
>> You should browse the Web; Ed Morton wrote a paper about the getline
>> topic, maybe it helps to understand the inherent problem, and you can
>> design your code appropriately.
>
> I honestly don't understand sometimes...
>
> - I dont want anyone to refactor anything, just add their
> thoughts (is so compelled)...

But we'll mostly have the same thought - it needs to be refactored to get rid of
getline. There's several things we could point out off the bat (e.g. setting RS
unnecessarily, using that "x" variable in readblock() instead of just moving the
print outside the loop, etc.), but it's pointless when the script should be
refactored.

>
> - If getline is so bad then why is it included w/ awk to begin with?

It's not bad at all when used correctly but it has very specific uses. Using it
in the wrong context is bad, just like using anything else in the wrong context
is bad, but getline allows you to completely miss the point of using awk in the
first place so misusing it to stay in a procedural paradigm instead of learning
how to really use awk is REALLY bad.

>
> Does anyone have a link to Ed's getline paper?
>

http://awk.info/?tip/getline

Ed

Kenny McCormack

unread,

Jan 15, 2011, 12:40:12 PM1/15/11

to

In article <igsf6s$7vj$1...@news.m-online.net>,

Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>On 15.01.2011 16:13, mss wrote:
>> Janis Papanagnou wrote:
>>
>>> BTW, this looks like you plan to use getline in {...}.
>>> Not that good an idea.
>>
>> Chuckle it does use getline (very good guess Janis).
>
>Not so much a guess; some constructs would not appear if done right.
>;-)
>
>>
>> *But how do I avoid it?*
>
>Just make use of what's already in awk (awk's builtin loop and field
>splitting), and let the data control the flow.

I think that there is a basic divide in programming between the kind of
programming that we were taught in school was good and morally of high
standing vs. the "just throw it together" method. AWK lends itself to
the later method, which can be a detriment to teaching to some kinds of
people (i.e., to their willingness to adopt it as a tool).

I speak from experience on this last topic - there have been people who
are so schooled in the "structured programming" ethic, that they just
can't wrap their brains around AWK's "automatic input loop" - because it
just smacks of "quick and dirty" (which is, of course, precisely the
reason why a lot of us love AWK)

That said, looking briefly at OP's code, it looks to me like if one's
coding ethic (whether it be self-imposed or imposed from without)
requires the use of meaningfully named functions for every little thing
(like we do in C and other such languages), then it becomes difficult to
use the automatic input loop.

Putting the whole program in BEGIN {} and using getline fits more in
with the programming paradigms of other languages.

--
Just for a change of pace, this sig is *not* an obscure reference to
comp.lang.c...

mss

unread,

Jan 15, 2011, 12:57:41 PM1/15/11

to

Ed Morton wrote:

> But we'll mostly have the same thought - it needs to be refactored to get rid of
> getline. There's several things we could point out off the bat (e.g. setting RS
> unnecessarily, using that "x" variable in readblock() instead of just moving the
> print outside the loop, etc.), but it's pointless when the script should be
> refactored.
>
>>
>> - If getline is so bad then why is it included w/ awk to begin with?
>
> It's not bad at all when used correctly but it has very specific uses. Using it
> in the wrong context is bad, just like using anything else in the wrong context
> is bad, but getline allows you to completely miss the point of using awk in the
> first place so misusing it to stay in a procedural paradigm instead of learning
> how to really use awk is REALLY bad.
>
>>
>> Does anyone have a link to Ed's getline paper?
>>
>
> http://awk.info/?tip/getline

Okay, this is as concise as I can manage till I learn gawk better,
fire-away (please) if you find anything funky Ed...

# Michael S. Sanders 2011
#
# invocation: gawk --posix -v TAG='corn' -f program.awk data.file
#
# data format:
#
# 'tags' - are comma-delimited
# 'blocks' - begin with a tab
#
# example:
#
# corn, squash, beans...
#
# block 1
# block 2
# block n...
#
# isbn, title, author...
#
# block 1
# block 2
# block n...
#

BEGIN {TAG = trim(tolower(TAG))}

TAG == "-i" {gettags(); next}

s = scan()

s {p = p ? s : 0; if(p) {print}}

# ---------------------------------------------------------------------------

function gettags() {if (/^[^[:blank:]]/) {print}}

# ---------------------------------------------------------------------------

function scan( x, y, cell) {

if (s && (!/^[^[:blank:]]/)) {return 1}

x = split(tolower($0), cell, ",")

for (y = 1; y <= x; y++) {if (trim(cell[y]) == TAG) {return 1}}

return 0
}

# ---------------------------------------------------------------------------

function trim(s) {return rtrim(ltrim(s))}

function ltrim(s) {sub(/^[[:blank:]]+/, "", s); return s}
function rtrim(s) {sub(/[[:blank:]]+$/, "", s); return s}

# eof

--

mss

unread,

Jan 15, 2011, 1:06:46 PM1/15/11

to

Kenny McCormack wrote:

> I think that there is a basic divide in programming between the kind of
> programming that we were taught in school was good and morally of high
> standing vs. the "just throw it together" method. AWK lends itself to
> the later method, which can be a detriment to teaching to some kinds of
> people (i.e., to their willingness to adopt it as a tool).
>
> I speak from experience on this last topic - there have been people who
> are so schooled in the "structured programming" ethic, that they just
> can't wrap their brains around AWK's "automatic input loop" - because it
> just smacks of "quick and dirty" (which is, of course, precisely the
> reason why a lot of us love AWK)
>
> That said, looking briefly at OP's code, it looks to me like if one's
> coding ethic (whether it be self-imposed or imposed from without)
> requires the use of meaningfully named functions for every little thing
> (like we do in C and other such languages), then it becomes difficult to
> use the automatic input loop.
>
> Putting the whole program in BEGIN {} and using getline fits more in
> with the programming paradigms of other languages.

Mighty subjective post Kenny...

I do in fact understand the awk iterates over the data,
and to subvert that gums up the works.

But... (please just think about this)

Does different really mean 'evil' (I added that in defiance <grin>),
or 'thrown together', or 'ethics' or whatever, imply less than
genuine effort from someone just starting out?

C'mon guys, I'm seeking input & trying to learn (from you all),
no need to physco -analyze the confounded thing - I just want learn
awk honest =)

mss

unread,

Jan 15, 2011, 1:26:27 PM1/15/11

to

Ed Morton wrote:

Many thanks for the insight Ed, I missed this post 1st time around,
(some times posts arrive out of order on my end).

I'll add these tips to my knowledgebase ASAP.

mss

unread,

Jan 15, 2011, 4:40:29 PM1/15/11

to

mss wrote:

> Okay, this is as concise as I can manage till I learn gawk better,
> fire-away (please) if you find anything funky Ed...

Ruther refined...

# ---------------------------------------------------------------------------
#

# Michael S. Sanders 2011
#

# invocation: gawk --posix -v TAG='peas' -f program.awk data.file

#
# data format:
#
# 'tags' - are comma-delimited
# 'blocks' - begin with a tab
#
# example:
#

# peas, carrots, tag n...

#
# block 1
# block 2
# block n...
#

# beans, squash, tag n...

#
# block 1
# block 2
# block n...
#

# ---------------------------------------------------------------------------

BEGIN{TAG = trim(tolower(TAG))}

# ---------------------------------------------------------------------------

TAG == "-i" {if (/^[^[:blank:]]/) {print}; next}

{
if (s && /^([[:blank:]]|$)/) {
print; next
} else {

x = split(tolower($0), cell, ",")
for (y = 1; y <= x; y++) {

if (trim(cell[y]) == TAG) {s = 1; print; next} else {s = 0}

mss

unread,

Jan 15, 2011, 5:01:16 PM1/15/11

to

mss wrote:

> Ruther refined...

Umm... 'further'

Ed Morton

unread,

Jan 15, 2011, 5:04:05 PM1/15/11

to

> if (s&& /^([[:blank:]]|$)/) {

> print; next
> } else {
> x = split(tolower($0), cell, ",")
> for (y = 1; y<= x; y++) {
> if (trim(cell[y]) == TAG) {s = 1; print; next} else {s = 0}
> }
> }
> }
>
> # ---------------------------------------------------------------------------
>
> function trim(s) {return rtrim(ltrim(s))}
> function ltrim(s) {sub(/^[[:blank:]]+/, "", s); return s}
> function rtrim(s) {sub(/[[:blank:]]+$/, "", s); return s}
>
> # eof
>
>

What does it do - look for a tag in a header line and print the subsequent block
of text? If so and there's always a blank line between tag lines and blocks then
I'd start with something like this:

BEGIN{ RS="" }
NR%2 && ( $0 ~ "(^|,)TAG(,|$)" ) { found=1; next }
found { print; found=0 }

If that doesn't help, provide some more info on what it does, the input format,
some small sample input set and the expected output from that input.

Regards,

Ed.

mss

unread,

Jan 15, 2011, 6:37:42 PM1/15/11

to

Ed Morton wrote:

> What does it do - look for a tag in a header line and print the subsequent block
> of text? If so and there's always a blank line between tag lines and blocks then
> I'd start with something like this:
>
> BEGIN{ RS="" }
> NR%2 && ( $0 ~ "(^|,)TAG(,|$)" ) { found=1; next }
> found { print; found=0 }
>
> If that doesn't help, provide some more info on what it does, the input format,
> some small sample input set and the expected output from that input.

Yes, thats very close to what the script does.
Here's some example data:

peas, carrots, tag n...
block 1
block 2
block n...

beans,squash, tag n...

block 1

block 2
block n...

The irregularities in the above are intentional btw...

Ed, I believe I've got it now, it handles everything
I've thrown at it so long as the blocks begin with a
tab & the tags do not. I really learned allot on this
project, thanks to all who helped, I appreciate your
input & patience. Slowly but surely, my efforts are
maturing. Here's the latest version of the script:

# Michael S. Sanders 2011
#
# invocation: gawk --posix -v TAG='peas' -f program.awk data.file
#
# data format:
#
# 'tags' - are comma-delimited
# 'blocks' - begin with a tab
#
# example:
#
# peas, carrots, tag n...
#
# block 1
# block 2
# block n...
#
# beans, squash, tag n...
#
# block 1
# block 2
# block n...
#

BEGIN{TAG = trim(tolower(TAG))}

TAG == "-i" {if (/^[^[:blank:]]/) {print}; next}

{
if (z && /^([[:blank:]]|$)/) {
print; next
} else {
y = split(tolower($0), cell, ",")
for (x = 1; x <= y; x++) {
if (trim(cell[x]) == TAG) {z = 1; print; next} else {z = 0}
}
}
}

function trim(s) {return rtrim(ltrim(s))}
function ltrim(s) {sub(/^[[:blank:]]+/, "", s); return s}
function rtrim(s) {sub(/[[:blank:]]+$/, "", s); return s}

# eof

--

Ed Morton

unread,

Jan 15, 2011, 7:21:54 PM1/15/11

to

On 1/15/2011 5:37 PM, mss wrote:
> Ed Morton wrote:
>
>> What does it do - look for a tag in a header line and print the subsequent block
>> of text? If so and there's always a blank line between tag lines and blocks then
>> I'd start with something like this:
>>
>> BEGIN{ RS="" }

>> NR%2&& ( $0 ~ "(^|,)TAG(,|$)" ) { found=1; next }

>> found { print; found=0 }
>>
>> If that doesn't help, provide some more info on what it does, the input format,
>> some small sample input set and the expected output from that input.
>
> Yes, thats very close to what the script does.
> Here's some example data:
>
> peas, carrots, tag n...
> block 1
> block 2
> block n...
>
> beans,squash, tag n...
>
>
> block 1
>
>
> block 2
> block n...
>
>
> The irregularities in the above are intentional btw...

Then block lines either start with tabs or are empty while tag lines are
non-empty and do not start with tabs (but can start with other spaces). Right?
You didn't post the expect output given the above but if it's the full list of
block lines between the tag lines then I'd approach this with just something like:

BEGIN { block=++i; tag=++i; TAG=tolower(TAG); gsub(/[[:space:]]/,"",TAG) }
{ type = (/^(\t|$)/ ? block : tag) }
type == tag { found = ( tolower($0) ~ "(^|,)[[:space:]]*" TAG
"[[:space:]]*(,|$)" ) }
(type == block) && found

Regards,

Ed.

>
> Ed, I believe I've got it now, it handles everything
> I've thrown at it so long as the blocks begin with a

> tab& the tags do not. I really learned allot on this

> project, thanks to all who helped, I appreciate your

> input& patience. Slowly but surely, my efforts are

> if (z&& /^([[:blank:]]|$)/) {

mss

unread,

Jan 15, 2011, 10:34:14 PM1/15/11

to

Ed Morton wrote:

> Then block lines either start with tabs or are empty while tag lines are
> non-empty and do not start with tabs (but can start with other spaces). Right?

Yes sir *exactly*.

> You didn't post the expect output given the above but if it's the full list of
> block lines between the tag lines then I'd approach this with just something like:

Block lines can be any number of lines...

> BEGIN { block=++i; tag=++i; TAG=tolower(TAG); gsub(/[[:space:]]/,"",TAG) }
> { type = (/^(\t|$)/ ? block : tag) }
> type == tag { found = ( tolower($0) ~ "(^|,)[[:space:]]*" TAG
> "[[:space:]]*(,|$)" ) }
> (type == block) && found

Darn! Now that's compact Ed. Must study your work...

I'm not too far behind you, "(^|,)[[:space:]]*" TAG "[[:space:]]*(,|$)",
was what I could not get right (so I lifted yours... thank you).

Here's where I'm at, not as tight as yours, but getting there:

BEGIN{TAG = tolower(TAG); gsub(/[[:space:]]/,"",TAG)}

TAG == "-i" {if (/^[^[:blank:]]/) {print}; next} # indices

{
if (tolower($0) ~ "(^|,)[[:space:]]*" TAG "[[:space:]]*(,|$)") {
z = 1} else if (z && /^([[:blank:]]|$)/) {z = 1} else { z = 0}

if (z) {print}

mss

unread,

Jan 15, 2011, 11:14:29 PM1/15/11

to

mss wrote:

> BEGIN{TAG = tolower(TAG); gsub(/[[:space:]]/,"",TAG)}
>
> TAG == "-i" {if (/^[^[:blank:]]/) {print}; next} # indices
>
> {
> if (tolower($0) ~ "(^|,)[[:space:]]*" TAG "[[:space:]]*(,|$)") {
> z = 1} else if (z && /^([[:blank:]]|$)/) {z = 1} else { z = 0}
>
> if (z) {print}
>
> }

z is killing me '{}'...

BEGIN{TAG = tolower(TAG); gsub(/[[:space:]]/,"",TAG)}
TAG == "-i" {if (/^[^[:blank:]]/) {print}; next}

{if (tolower($0) ~ "(^|,)[[:space:]]*" TAG "[[:space:]]*(,|$)") {z = 1}
else if (z && /^([[:blank:]]|$)/) {} else {z = 0}; if (z) {print}}

Janis Papanagnou

unread,

Jan 16, 2011, 3:46:03 AM1/16/11

to

On 16.01.2011 05:14, mss wrote:
> mss wrote:
>
>> BEGIN{TAG = tolower(TAG); gsub(/[[:space:]]/,"",TAG)}
>>
>> TAG == "-i" {if (/^[^[:blank:]]/) {print}; next} # indices
>>
>> {
>> if (tolower($0) ~ "(^|,)[[:space:]]*" TAG "[[:space:]]*(,|$)") {
>> z = 1} else if (z && /^([[:blank:]]|$)/) {z = 1} else { z = 0}
>>
>> if (z) {print}
>>
>> }
>
> z is killing me '{}'...

This code formatting is killing me... :-)

>
> BEGIN{TAG = tolower(TAG); gsub(/[[:space:]]/,"",TAG)}
> TAG == "-i" {if (/^[^[:blank:]]/) {print}; next}
> {if (tolower($0) ~ "(^|,)[[:space:]]*" TAG "[[:space:]]*(,|$)") {z = 1}
> else if (z && /^([[:blank:]]|$)/) {} else {z = 0}; if (z) {print}}
>

Okay, I'm not sure I've seen a specification yet, that describes what
you intend to do. So in this posting I will take your program, and the
data that you provided as comment in your program, see how it behaves
if fed with "-i", some existing tag, and some trash. And with the help
of the output as regression test base I'll start reorganizing the code
in steps. Here we go...

1. do a clean formatting, a few comments to have intention documented

BEGIN {
TAG = tolower(TAG); gsub(/[[:space:]]/,"",TAG)
}

# print only headers

TAG == "-i" {
if (/^[^[:blank:]]/) print
next
}

# print the selected header and data block
{ if (tolower($0) ~ "(^|,)[[:space:]]*" TAG "[[:space:]]*(,|$)")
z = 1
else if (z && /^([[:blank:]]|$)/) # that's the same pattern as above?
{}

else
z = 0
if (z) print
}

2. some function decoupling, handling of invariants, find common constructs

BEGIN {
TAG = tolower(TAG); gsub(/[[:space:]]/,"",TAG)

tag_i = (TAG == "-i")
}

# print only headers
tag_i {

if (/^[^[:blank:]]/) print
next
}

# print the selected header and data block
{ if (tolower($0) ~ "(^|,)[[:space:]]*" TAG "[[:space:]]*(,|$)")
z = 1
else if (z && !/^[^[:blank:]]/)
{}
else
z = 0
}

z { print }

3. isolate the conditions, organize the FSM, join print condition

BEGIN {
TAG = tolower(TAG); gsub(/[[:space:]]/,"",TAG)

tag_i = (TAG == "-i")
}

{ a_tag = ($0 ~ /^[^[:blank:]]/) }

a_tag {
matching_blk = (tolower($0)~"(^|,)[[:space:]]*"TAG"[[:space:]]*(,|$)")
}

(tag_i && a_tag) || matching_blk

Note that the print condition in the last line resembles the specification!

4. reformat and cryptify code to resemble the OP's version

BEGIN{TAG=tolower(TAG);gsub(/[[:space:]]/,"",TAG);i=(TAG=="-i")}
{a=($0~/^[^[:blank:]]/)}
a{m=(tolower($0)~"(^|,)[[:space:]]*"TAG"[[:space:]]*(,|$)")}
(i&&a)||m

I suggest to omit step 4, though. ;-)

The one thing that still incommodes me is the bulky pattern that defines the
matching block; I haven't spent a minute to see whether it can be improved.
(Disclamer: while I tested the steps against your test data there might be
copy-paste errors.)

A final remark. The code analysis and code refactoring required non-trivial
effort. Two things help to avoid imposing that effort to the helpful souls
here; a clear and unambiguous original data/expected data sample, and a clear
specification. A clear specification will help us to be able to ignore what
has been done and focus on the original problem, if necessary, and the data
can be the basis to see how the program should behave and as a basis to test
the suggestions.

Janis

Janis Papanagnou

unread,

Jan 16, 2011, 4:35:24 AM1/16/11

to

On 16.01.2011 09:46, Janis Papanagnou wrote:
> On 16.01.2011 05:14, mss wrote:
>> mss wrote:
>>
>>> BEGIN{TAG = tolower(TAG); gsub(/[[:space:]]/,"",TAG)}
>>>
>>> TAG == "-i" {if (/^[^[:blank:]]/) {print}; next} # indices
>>>
>>> {
>>> if (tolower($0) ~ "(^|,)[[:space:]]*" TAG "[[:space:]]*(,|$)") {
>>> z = 1} else if (z && /^([[:blank:]]|$)/) {z = 1} else { z = 0}
>>>
>>> if (z) {print}
>>>
>>> }
>>
>> z is killing me '{}'...
>

[...]

I wrote:
>
> BEGIN {
> TAG = tolower(TAG); gsub(/[[:space:]]/,"",TAG)

> tag_i = (TAG == "-i")
> }
>
> { a_tag = ($0 ~ /^[^[:blank:]]/) }
>
> a_tag {
> matching_blk = (tolower($0)~"(^|,)[[:space:]]*"TAG"[[:space:]]*(,|$)")
> }
>
> (tag_i && a_tag) || matching_blk
>
>

[...]

I missed that Ed already provided a specification that you confirmed:

> Ed Morton wrote:
>> Then block lines either start with tabs or are empty while tag lines are
>> non-empty and do not start with tabs (but can start with other spaces).
>> Right?
>
> Yes sir *exactly*.

Above code doesn't handle the possible blank at start of tag line.

That means that you have to change the condition what "a_tag" is.
Since now after the refactoring we have isolated that definition
on one single line it's simple to adjust.

According to the definitions of the character classes
[:blank:] [ \t] Space and tab
[:space:] [ \t\r\n\v\f] Whitespace characters
what we need is just "not a TAB" according to our specification.

The respective line in the code above thus becomes even simpler:

{ a_tag = ($0 ~ /^[^\t]/) }

Everything else remains unchanged.

Janis

mss

unread,

Jan 16, 2011, 7:49:37 AM1/16/11

to

Janis Papanagnou wrote:

> This code formatting is killing me... :-)

=) I know sorry, (I just wanted to try to
match Ed, but I can not yet do it - you all
are too far ahead of me still). Its a silly
thing men do like racing cars. I'll take your
advice and format the code more properly.

Yes let me study this (there is much here for me to work out). Thank you very
much for your help Janis, I always learn from your posts.

mss

unread,

Jan 16, 2011, 8:15:59 AM1/16/11

to

Janis Papanagnou wrote:

> The respective line in the code above thus becomes even simpler:
>
> { a_tag = ($0 ~ /^[^\t]/) }
>
> Everything else remains unchanged.

Okay, I'll change your example to include this
so to handle leading spaces in the tags.

Ed Morton

unread,

Jan 16, 2011, 9:47:53 AM1/16/11

to

On 1/16/2011 6:49 AM, mss wrote:
<snip>

> Yes let me study this (there is much here for me to work out). Thank you very
> much for your help Janis, I always learn from your posts.
>

Here's mine and Janis's posted scripts next to each other using the same
formatting and same or similar variable names so you can easily see the [small]
differences clearly.

Ed:
--------------

BEGIN {
TAG = tolower(TAG); gsub(/[[:space:]]/,"",TAG)

block=++i; tag=++i;
}

{ type = (/^(\t|$)/ ? block : tag) }

type == tag {
matching_blk = ( tolower($0) ~ "(^|,)[[:space:]]*" TAG "[[:space:]]*(,|$)" )
}

(type == block) && matching_blk
--------------

Janis:
--------------

BEGIN {
TAG = tolower(TAG); gsub(/[[:space:]]/,"",TAG)
tag_i = (TAG == "-i")
}

{ typeIsTag = ($0 ~ /^[^\t]/) }

typeIsTag {
matching_blk = ( tolower($0) ~ "(^|,)[[:space:]]*" TAG "[[:space:]]*(,|$)" )
}

(tag_i && typeIsTag) || matching_blk
--------------

Mine doesn't do anything with TAG "-i", and when Janis' finds a matching tag
line it will print that tag line plus the subsequent block of text whereas mine
would just print the subsequent block of text. I don't know which output is what
you want.

Note that it's probably not an issue in this case but in other situations
solutions like this have a problem if TAG can contain RE characters (e.g. "*" or
"+") due to the variable being part of a dynamic RE evaluation (matching_blk = ...).

Ed.

mss

unread,

Jan 16, 2011, 11:16:02 AM1/16/11

to

Ed Morton wrote:

> Here's mine and Janis's posted scripts next to each other using the same
> formatting and same or similar variable names so you can easily see the [small]
> differences clearly.

[...]

Thank you Ed, I'll study this very closely in fact. I've got alot of
thinking to do truth be known.

> Mine doesn't do anything with TAG "-i", and when Janis' finds a matching tag
> line it will print that tag line plus the subsequent block of text whereas mine
> would just print the subsequent block of text. I don't know which output is what
> you want.
>
> Note that it's probably not an issue in this case but in other situations
> solutions like this have a problem if TAG can contain RE characters (e.g. "*" or
> "+") due to the variable being part of a dynamic RE evaluation (matching_blk = ...).

Yes, I'm displaying the matching tags to provide context.

Here's my initial specs for the format:

TAG

A TAG is always located above the BLOCK it describes on a single line.
TAG lines never contain a tab character. Multiple TAGS within a given
line of TAGS are comma delimited.

TAGS must not contain meta characters like (*, ., ?) full list still
to be defined...

BLOCK

A BLOCK is defined as a group of lines that begin with a horizontal
tab character (ASCII: 9, hex: 0x9).

Empty lines within a block are acceptable to preserve formatting.

mss

unread,

Jan 16, 2011, 11:32:46 AM1/16/11

to

mss wrote:

> BLOCK
>
> A BLOCK is defined as a group of lines that begin with a horizontal
> tab character (ASCII: 9, hex: 0x9).
>
> Empty lines within a block are acceptable to preserve formatting.

Or rather:

A BLOCK is defined as a group of lines that 'each' begin with a

horizontal tab character (ASCII 9, hex: 0x9).

Empty lines within a BLOCK are acceptable to preserve formatting.