Back-reference with gsub

Patrick Naef

unread,

Aug 6, 1999, 3:00:00 AM8/6/99

to

I'd like to do a back-reference in awk. If I write something like this:

echo "abba" | awk '{ gsub( /(bb)/, "\1", $0 ); print $0 }'

I get

aa

but I wanted

bb

Anybody there who will enjoy telling me my glaring mistake? Please,
because otherwise I'll suffer a nervous breakdown - for the past hour or
two I read all books, man-pages, faqs etc. that I could get my hands on,
but I only found out that it ***should*** work! perl does it, but I hate
perl...

If you think it's a bug: I'm doing this on SCO Openserver 5.05 with the
awk supplied by SCO. But: I also tried it out - with the same result -
with gawk 3.0.4 (on the same system) and with the SCO awk on an older
SCO Unix 3.2v4.2.

-Patrick

Harlan Grove

unread,

Aug 6, 1999, 3:00:00 AM8/6/99

to

In article <37AB54A0...@datacomm.ch>,
Patrick Naef <herz...@datacomm.ch> wrote:

>I'd like to do a back-reference in awk. If I write something like this:
>
>echo "abba" | awk '{ gsub( /(bb)/, "\1", $0 ); print $0 }'
>
>I get
>
>aa
>
>but I wanted
>
>bb

<snip>

Two problems. First, most awks don't remember parenthesized patterns,
so the perl-like positional pattern "\1" should have been an ASCII 0x01
char. If SCO doesn't print a graphic character for ASCII 0x01, that
would explain why the pattern was eating bb. Second, even if SCO awk
did support positional patterns, all this would have done was reproduce
bb, it wouldn't have deleted the a's.

If you want to delete the a's, try gsub(/a/,""). It you want to delete
everything other than b's, try gsub(/[^b]/,"").

Sent via Deja.com http://www.deja.com/
Share what you know. Learn what you don't.

Jim Monty

unread,

Aug 7, 1999, 3:00:00 AM8/7/99

to

Patrick Naef <herz...@datacomm.ch> wrote:
> I'd like to do a back-reference in awk. If I write something like this:
>
> echo "abba" | awk '{ gsub( /(bb)/, "\1", $0 ); print $0 }'
>
> I get
>
> aa
>
> but I wanted
>
> bb
>

> Anybody there who will enjoy telling me my glaring mistake? Please,
> because otherwise I'll suffer a nervous breakdown - for the past hour or
> two I read all books, man-pages, faqs etc. that I could get my hands on,
> but I only found out that it ***should*** work!

You did? Hmm... To the best of my knowledge, only MKS awk supports
remembered parenthesized subexpressions in sub() and gsub(). These
are different from backreferences within regular expression patterns.
Perhaps _that_ is what you read about in the documentation.

> perl does it, but I hate perl...

Don't hate Perl.

$ echo abba | perl -pe 's/.*(bb).*/$1/'
bb
$

Do you hate sed?

$ echo abba | sed -e 's/.*$bb$.*/\1/'
bb
$

First, let's see what the script you posted actually prints:

$ echo abba | gawk '{ gsub(/(bb)/, "\1"); print }'
aa
$ echo abba | gawk '{ gsub(/(bb)/, "\1"); print }' | od -c
0000000 a 001 a \n
0000004
$

It successfully replaces the substring "bb" with a single character:
\x01. The parentheses in the regular expression only serve to group
the subexpression within the pattern, and since you've placed them
around the entire pattern (and you're not using alternation anyway),
they serve no useful purpose here.

This works:

$ echo abba | gawk '{ gsub(/bb/, "&"); print }'
abba
$

But, as you stated, that's not the result you're after. You want
to "substitute away" the text surrounding the substring you're
matching with the pattern /bb/. In other words, you want to extract
from some string the substring matching some pattern. Right? I
often use the following general-purpose, user-defined function
named "extract":

function extract(string, regex) {
return match(string, regex) ? substring(string, RSTART, RLENGTH) : ""
}

You could invoke it like this, for example:

{ $0 = extract($0, "bb"); print }

Gawk has a function named "gensub" that _does_ support remembered
parenthesized subexpressions:

$ echo abba | gawk '{ $0 = gensub(/.*(bb).*/, "\\1", 1); print }'
bb
$

But, of course, "gensub" is an extension to the language and is
not portable. The function "extract" IS portable.

Good luck.

--
Jim Monty
mo...@primenet.com
Tempe, Arizona USA

Patrick Naef

unread,

Aug 7, 1999, 3:00:00 AM8/7/99

to

Jim Monty wrote:
>
> Patrick Naef <herz...@datacomm.ch> wrote:
> > but I wanted
> >
> > bb

Well, after I got some sleep I realize now that my example was plain
wrong. Of course I wanted "abba" (as the other previous poster wrote),
otherwise I should have written the match as /.*(bb).*/

> > but I only found out that it ***should*** work!
>
> You did? Hmm... To the best of my knowledge, only MKS awk supports

After my beauty sleep (*cough*) I have also to admit that I probably
interpreted a passage of the regexp(M) man page a little bit too
freely :-)

> Don't hate Perl.

I know I shouldn't. But I can't help it (even after, you know,
sleeping :-)

> But, as you stated, that's not the result you're after. You want
> to "substitute away" the text surrounding the substring you're
> matching with the pattern /bb/. In other words, you want to extract
> from some string the substring matching some pattern. Right? I

Correct. What I'm ***really*** after is to "substitute away" whitespace
characters around the first "=" character on a line but without touching
any other characters on the line.

foo = bar

should look like

foo=bar

And I want to do it inside awk because it is part of a script for
parsing a configuration file. I have solved the problem now with
brute force (split and loops, yuck!) but it's so ugly that I'll
probably have to rewrite it (in perl?!) or it will give me, well,
sleepless nights...

Thanks for your help
Patrick

Jim Monty

unread,

Aug 7, 1999, 3:00:00 AM8/7/99

to

Patrick Naef <herz...@datacomm.ch> wrote:
> Correct. What I'm ***really*** after is to "substitute away" whitespace
> characters around the first "=" character on a line but without touching
> any other characters on the line.
>
> foo = bar
>
> should look like
>
> foo=bar

$ echo "foo = bar" | awk '{ sub(/ *= */, "="); print }'
foo=bar
$

Perhaps you need more sleep... ;-)

Have you considered setting FS to the regular expression "[ \t]*=[ \t]*"
and parsing the configuration file like this?

BEGIN {
FS = "[ \t]*=[ \t]*"
}

{
config[$1] = $2
}

You may want to trim leading and trailing whitespace from the keys
and their associated values, respectively:

{
config[ltrim($1)] = rtrim($2)
}

function ltrim(str) {
sub(/^[ \t]+/, "", str)
return str
}

function rtrim(str) {
sub(/[ \t]+$/, "", str)
return str
}

This is overly simplistic for the general case and will not function
correctly for a line such as this one:

EQUATION = "1 + 1 = 2"

If your configuration files may contain lines with multiple equal
signs on them, you may have to get a bit more sophisticated:

BEGIN {
FS = SUBSEP
}

{
if (sub(/[ \t]*=[ \t]*/, SUBSEP)) {
config[ltrim($1)] = dequote(rtrim($2))
}
}

function dequote(str) {
sub(/^"/, "", str)
sub(/"$/, "", str)
return str
}

As the Perl folks are fond of acronyming, TMTOWTDI. But you get
the idea.

[N.B. None of the code above was tested.]

Patrick Naef

unread,

Aug 8, 1999, 3:00:00 AM8/8/99

to

Jim Monty wrote:
> If your configuration files may contain lines with multiple equal
> signs on them, you may have to get a bit more sophisticated:

That's exactly the point: as I wrote I just want to trim around the
*first* equal sign. If this back-reference business had worked it
would have been one single gsub, something like this:

BEGIN {
whiteSpace = "[ \t]+"
}
{
gsub( "^([^=]*)"whiteSpace"="whiteSpace", "\1=", $0 )
}

Well, at least I know now what one of my three wishes will be...
"Genie, back-references in awk, please!"

> [N.B. None of the code above was tested.]

Of course, but it looks interesting and I'll close in on it on Monday.
For now I go get some sleep. Thanks for all your effort!

Patrick

Jim Monty

unread,

Aug 8, 1999, 3:00:00 AM8/8/99

to

Patrick Naef <herz...@datacomm.ch> wrote:
> Jim Monty wrote:
> > If your configuration files may contain lines with multiple equal
> > signs on them, you may have to get a bit more sophisticated:
>
> That's exactly the point: as I wrote I just want to trim around the
> *first* equal sign. If this back-reference business had worked it
> would have been one single gsub, something like this:
>
> BEGIN {
> whiteSpace = "[ \t]+"
> }
> {
> gsub( "^([^=]*)"whiteSpace"="whiteSpace", "\1=", $0 )
> }

Am I missing something here? If you only want to replace the first
occurrence of some substring that matches a pattern, then use sub(),
not gsub(). I demonstrated this in my last post:

$ echo "foo = bar" | awk '{ sub(/ *= */, "="); print }'
foo=bar

$ echo "foo = bar = baz" | awk '{ sub(/ *= */, "="); print }'
foo=bar = baz
$

Look, Ma! No "backreferences" (remembered parenthesized subexpressions)!

From the man page gawk(1):

sub(r, s [, t]) just like gsub(), but only the
first matching substring is
replaced.

> Well, at least I know now what one of my three wishes will be...
> "Genie, back-references in awk, please!"

There is a version of awk that has many nice enhanced regular
expression features. It's called Perl. (And don't call him "Genie"!
The name's Larry. ;-)