echo "abba" | awk '{ gsub( /(bb)/, "\1", $0 ); print $0 }'
I get
aa
but I wanted
bb
Anybody there who will enjoy telling me my glaring mistake? Please,
because otherwise I'll suffer a nervous breakdown - for the past hour or
two I read all books, man-pages, faqs etc. that I could get my hands on,
but I only found out that it ***should*** work! perl does it, but I hate
perl...
If you think it's a bug: I'm doing this on SCO Openserver 5.05 with the
awk supplied by SCO. But: I also tried it out - with the same result -
with gawk 3.0.4 (on the same system) and with the SCO awk on an older
SCO Unix 3.2v4.2.
-Patrick
>I'd like to do a back-reference in awk. If I write something like this:
>
>echo "abba" | awk '{ gsub( /(bb)/, "\1", $0 ); print $0 }'
>
>I get
>
>aa
>
>but I wanted
>
>bb
<snip>
Two problems. First, most awks don't remember parenthesized patterns,
so the perl-like positional pattern "\1" should have been an ASCII 0x01
char. If SCO doesn't print a graphic character for ASCII 0x01, that
would explain why the pattern was eating bb. Second, even if SCO awk
did support positional patterns, all this would have done was reproduce
bb, it wouldn't have deleted the a's.
If you want to delete the a's, try gsub(/a/,""). It you want to delete
everything other than b's, try gsub(/[^b]/,"").
Sent via Deja.com http://www.deja.com/
Share what you know. Learn what you don't.
You did? Hmm... To the best of my knowledge, only MKS awk supports
remembered parenthesized subexpressions in sub() and gsub(). These
are different from backreferences within regular expression patterns.
Perhaps _that_ is what you read about in the documentation.
> perl does it, but I hate perl...
Don't hate Perl.
$ echo abba | perl -pe 's/.*(bb).*/$1/'
bb
$
Do you hate sed?
$ echo abba | sed -e 's/.*\(bb\).*/\1/'
bb
$
First, let's see what the script you posted actually prints:
$ echo abba | gawk '{ gsub(/(bb)/, "\1"); print }'
aa
$ echo abba | gawk '{ gsub(/(bb)/, "\1"); print }' | od -c
0000000 a 001 a \n
0000004
$
It successfully replaces the substring "bb" with a single character:
\x01. The parentheses in the regular expression only serve to group
the subexpression within the pattern, and since you've placed them
around the entire pattern (and you're not using alternation anyway),
they serve no useful purpose here.
This works:
$ echo abba | gawk '{ gsub(/bb/, "&"); print }'
abba
$
But, as you stated, that's not the result you're after. You want
to "substitute away" the text surrounding the substring you're
matching with the pattern /bb/. In other words, you want to extract
from some string the substring matching some pattern. Right? I
often use the following general-purpose, user-defined function
named "extract":
function extract(string, regex) {
return match(string, regex) ? substring(string, RSTART, RLENGTH) : ""
}
You could invoke it like this, for example:
{ $0 = extract($0, "bb"); print }
Gawk has a function named "gensub" that _does_ support remembered
parenthesized subexpressions:
$ echo abba | gawk '{ $0 = gensub(/.*(bb).*/, "\\1", 1); print }'
bb
$
But, of course, "gensub" is an extension to the language and is
not portable. The function "extract" IS portable.
Good luck.
--
Jim Monty
mo...@primenet.com
Tempe, Arizona USA
Well, after I got some sleep I realize now that my example was plain
wrong. Of course I wanted "abba" (as the other previous poster wrote),
otherwise I should have written the match as /.*(bb).*/
> > but I only found out that it ***should*** work!
>
> You did? Hmm... To the best of my knowledge, only MKS awk supports
After my beauty sleep (*cough*) I have also to admit that I probably
interpreted a passage of the regexp(M) man page a little bit too
freely :-)
> Don't hate Perl.
I know I shouldn't. But I can't help it (even after, you know,
sleeping :-)
> But, as you stated, that's not the result you're after. You want
> to "substitute away" the text surrounding the substring you're
> matching with the pattern /bb/. In other words, you want to extract
> from some string the substring matching some pattern. Right? I
Correct. What I'm ***really*** after is to "substitute away" whitespace
characters around the first "=" character on a line but without touching
any other characters on the line.
foo = bar
should look like
foo=bar
And I want to do it inside awk because it is part of a script for
parsing a configuration file. I have solved the problem now with
brute force (split and loops, yuck!) but it's so ugly that I'll
probably have to rewrite it (in perl?!) or it will give me, well,
sleepless nights...
Thanks for your help
Patrick
$ echo "foo = bar" | awk '{ sub(/ *= */, "="); print }'
foo=bar
$
Perhaps you need more sleep... ;-)
Have you considered setting FS to the regular expression "[ \t]*=[ \t]*"
and parsing the configuration file like this?
BEGIN {
FS = "[ \t]*=[ \t]*"
}
{
config[$1] = $2
}
You may want to trim leading and trailing whitespace from the keys
and their associated values, respectively:
{
config[ltrim($1)] = rtrim($2)
}
function ltrim(str) {
sub(/^[ \t]+/, "", str)
return str
}
function rtrim(str) {
sub(/[ \t]+$/, "", str)
return str
}
This is overly simplistic for the general case and will not function
correctly for a line such as this one:
EQUATION = "1 + 1 = 2"
If your configuration files may contain lines with multiple equal
signs on them, you may have to get a bit more sophisticated:
BEGIN {
FS = SUBSEP
}
{
if (sub(/[ \t]*=[ \t]*/, SUBSEP)) {
config[ltrim($1)] = dequote(rtrim($2))
}
}
function dequote(str) {
sub(/^"/, "", str)
sub(/"$/, "", str)
return str
}
As the Perl folks are fond of acronyming, TMTOWTDI. But you get
the idea.
[N.B. None of the code above was tested.]
That's exactly the point: as I wrote I just want to trim around the
*first* equal sign. If this back-reference business had worked it
would have been one single gsub, something like this:
BEGIN {
whiteSpace = "[ \t]+"
}
{
gsub( "^([^=]*)"whiteSpace"="whiteSpace", "\1=", $0 )
}
Well, at least I know now what one of my three wishes will be...
"Genie, back-references in awk, please!"
> [N.B. None of the code above was tested.]
Of course, but it looks interesting and I'll close in on it on Monday.
For now I go get some sleep. Thanks for all your effort!
Patrick
Am I missing something here? If you only want to replace the first
occurrence of some substring that matches a pattern, then use sub(),
not gsub(). I demonstrated this in my last post:
$ echo "foo = bar" | awk '{ sub(/ *= */, "="); print }'
foo=bar
$ echo "foo = bar = baz" | awk '{ sub(/ *= */, "="); print }'
foo=bar = baz
$
Look, Ma! No "backreferences" (remembered parenthesized subexpressions)!
From the man page gawk(1):
sub(r, s [, t]) just like gsub(), but only the
first matching substring is
replaced.
> Well, at least I know now what one of my three wishes will be...
> "Genie, back-references in awk, please!"
There is a version of awk that has many nice enhanced regular
expression features. It's called Perl. (And don't call him "Genie"!
The name's Larry. ;-)