writing a good gsub regexp for matching between two specific characters

Bryan

unread,

Mar 11, 2023, 7:06:11 PM3/11/23

to

I'm using gawk 5.1.0, bash 5.1.16, Ubuntu 22.04.2. I will write and provide a lot of material in case it is useful or there is conflict in the script, but I am trying not to ramble.

I prepared a test script below - which should be easy to copy/paste into a shell, e.g. bash. I am focused on the gsub regexps, which are obviously contrived to replace all these different strings which - as they vary from output from another program - take the general form (attempting a "plain English" version):

[open apostrophe][the word "path"][maybe an underscore][various digits][end apostrophe]

I want to take all of that ^^^ and delete it - or equivalently replace it with nothing (ideally), to prepare input to gnuplot as "x,y" or "x y" data - two columns.

I tried using this type of command :

gsub("^[a-z]{4}$","TEST") ;

... and more, e.g. trying sub and gensub - but did not get far - I am aware of a curly brace escape that is important or not depending on the awk version, so I also tried with \{ and \}.

I put "TEST" in the present case for testing a few different cases. I wrote this script based on extensive reading of a certain popular online resource and the The Awk Programming Language (1988 - maybe time for a newer edition?). This is a useful script because as I find new types of output from the upstream program (a whole other story), I might add new gsub commands to take care of it.

copy/paste example script:

echo "\
{\"path_1234567\"\
:[`seq -s',' -f '%f' 1 20 `],\
\"path_123456\"\
:[`seq -s',' -f '%f' 1 20 `],\
\"path_1234\"\
:[`seq -s',' -f '%f' 1 20 `],\
\"path1234\"\
:[`seq -s',' -f '%f' 1 20 `]}" | \
gawk -F, '
{
gsub("\{","") ;
gsub("\}","") ;
gsub("\]","") ;
gsub("^[a-z]{4}$","TEST") ;
gsub("\"[a-z][a-z][a-z][a-z]_[0-9][0-9][0-9][0-9][0-9][0-9][0-9]\":\\\[","TESTSEVEN") ;
gsub("\"[a-z][a-z][a-z][a-z][0-9][0-9][0-9][0-9][0-9][0-9]\":\\\[","TESTSIX") ;
gsub("\"[a-z][a-z][a-z][a-z][0-9][0-9][0-9][0-9]\":\\\[","TESTFOURB") ;
gsub("\"[a-z][a-z][a-z][a-z]_[0-9][0-9][0-9][0-9]\":\\\[","TESTFOURA") ;
for (i=1;i<=NF;i++)
{
printf("%s%s",$i,i%2?",":"\n")
}
}'

... the last printf thing is perhaps for another post, but (IIUC) matches every 2nd comma and replaces it with a newline. So that's the "x,y" data idea. I hope that is clear - I imagine the regexps in the [a-z][0-9] parts ought to be able to go all into one gsub if I knew the syntax or what to read about.

Janis Papanagnou

unread,

Mar 11, 2023, 9:52:31 PM3/11/23

to

First, I cannot really decipher what you actually want to do and
where your problems are. The usual procedure is to post sample data:
input data and the corresponding output data at least (not shell
code that creates the input data). Anyway you find below some hints
and suggestions...

On 12.03.2023 01:06, Bryan wrote:
> I'm using gawk 5.1.0, bash 5.1.16, Ubuntu 22.04.2. I will write and
> provide a lot of material in case it is useful or there is conflict
> in the script, but I am trying not to ramble.
>
> I prepared a test script below - which should be easy to copy/paste
> into a shell, e.g. bash. I am focused on the gsub regexps, which are
> obviously contrived to replace all these different strings which - as
> they vary from output from another program - take the general form
> (attempting a "plain English" version):
>
> [open apostrophe][the word "path"][maybe an underscore][various
> digits][end apostrophe]
>
> I want to take all of that ^^^ and delete it - or equivalently
> replace it with nothing (ideally), to prepare input to gnuplot as
> "x,y" or "x y" data - two columns.
>
> I tried using this type of command :
>
> gsub("^[a-z]{4}$","TEST") ;

This is fine to substitutes lines containing _only_ a sequence of
four lower case letters to "TEST". gsub() _without_ the ^ and $
anchors will substitute any occurrence of that pattern on a line.
You can provide a third argument to gsub() to operate on variables
or specific fields; in that case the anchors ^ and $ will define
the beginning and end of that variable or field respectively.
It is also advantageous to use /.../ syntax for constant patterns
instead of the string form "...".

>
> ... and more, e.g. trying sub and gensub - but did not get far - I am
> aware of a curly brace escape that is important or not depending on
> the awk version, so I also tried with \{ and \}.

There's no need to escape these braces.

Instead of echo arguments with quotes and newline-escapes I suggest,
in shell, to use here-documents with this syntax:

awk '
# ... your awk program ...
...
' <<EOT
your data line 1
your data line 2
...
EOT

and with the more contemporary $(...) a line might be

{"path_1234567":[$(seq -s',' -f '%f' 1 20)], ...

but I wouldn't call seq many times but only once and assign it to a
variable and use that repeatedly

s=$(seq -s',' -f '%f' 1 20)
awk '
...
' <<EOT
{"path_1234567":[${s}], ...
...
EOT

If you pipe in or redirect other input just omit the code from <<EOT
onward.
data_from_some_process | awk '...'
awk '...' < data_from_some_file

(But for testing the here-documents have advantages.)

>
> ... the last printf thing is perhaps for another post, but (IIUC)
> matches every 2nd comma and replaces it with a newline.

printf doesn't replace anything. It prints every other time a newline
instead of a comma.

> So that's the
> "x,y" data idea. I hope that is clear - I imagine the regexps in the
> [a-z][0-9] parts ought to be able to go all into one gsub if I knew
> the syntax or what to read about.

To match more than one regexp for the _same_ replacement you can
combine them with the | (or) operator. For an example from your
code above use, e.g., gsub(/{|}|]/, "") to remove those three
braces/brackets in one expression.

But with your samples above you can also use other regexp syntaxes,
like ? (for optional parts) and use grouping with parenthesis (...)
for longer subexpressions, e.g.
[a-z][4}_?[0-9]{4}([0-9]{2})?
for an optional underscore and two optional digits.

Janis

Bryan

unread,

Mar 12, 2023, 12:25:51 PM3/12/23

to

Apologies for the `seq` synthetic data, I'll prepare it the better way next time.

> But with your samples above you can also use other regexp syntaxes,
> like ? (for optional parts) and use grouping with parenthesis (...)
> for longer subexpressions, e.g.
> [a-z][4}_?[0-9]{4}([0-9]{2})?
> for an optional underscore and two optional digits.

This is exactly what I was looking for and it works (I think a typo is in there but let's leave it for now).

I tried {1-4} to get a range, but it didn't work - is that the idea? so

[a-z]{4}_?[0-9]{4}([0-9]{1-4})?

to match any number of digits from 1 to 4?

Kenny McCormack

unread,

Mar 12, 2023, 12:49:44 PM3/12/23

to

In article <13052000-b214-4a5f...@googlegroups.com>,

It is: {1,4}

--
"If our country is going broke, let it be from feeding the poor and caring for
the elderly. And not from pampering the rich and fighting wars for them."

--Living Blue in a Red State--

Bryan

unread,

Mar 12, 2023, 4:11:10 PM3/12/23

to

This is great. My old awk book (Aho, Kernighan, and Weinberger) has a table on p.32 saying :

"expression [c1-c2] matches any character in the range beginning with c1 and ending with c2."

... p.30 has more discussion, and I never saw anything about the comma "," to indicate a range - perhaps this is a strong indication I need to get a better book.

And, I apologize, but I must say - this discussion reached a good answer in less than 24 hours - even though discussion doesn't "scale", and I can't cast a vote on it.

IOW Thank you!

Bryan

unread,

Mar 12, 2023, 4:43:40 PM3/12/23

to

addendum : in writing a separate question about the printf statement, I found a better way to print a newline instead of every 2nd comma from a long string of signed floating points, so I simply share the method here :

digits=$(seq -s',' -f '%f' -10 10)
gawk -F, '
{

for (i=1;i<=NF;i++)
{

printf("%3.6f%s",$i,i%2?",":"\n")
}
}' <<EOT
${digits}
EOT

Janis Papanagnou

unread,

Mar 12, 2023, 5:42:13 PM3/12/23

to

On 12.03.2023 21:11, Bryan wrote:
> This is great. My old awk book (Aho, Kernighan, and Weinberger) has a
> table on p.32 saying :
>
> "expression [c1-c2] matches any character in the range beginning
> with c1 and ending with c2."

You are referring here to something different. Slightly simplified said
[a-z] is a regexp matching any single lowercase letter
[0-9] any single digit
[0-9a-fA-F] any hexadecimal digit

The multiplicity syntax {N}, {N,}, {,M}, {N,M} is not supported by the
classic awk ("nawk") that is based of Aho's, etc. book. More recent and
commonly used Awks like GNU awk supports it, though. That's why there's
no mention in that book.

>
> ... p.30 has more discussion, and I never saw anything about the
> comma "," to indicate a range - perhaps this is a strong indication I
> need to get a better book.

The old book is excellently written and contains all what comprises
the power of the awk language. (Don't ignore it nor throw it away!)

But I suggest, especially if you use GNU awk which supports yet more
features, to get a copy of Arnold Robbin's "Effective Awk Programming"
which is based on GNU Awk. (It's also online available in a searchable
digital form.)

Janis

Janis Papanagnou

unread,

Mar 13, 2023, 5:03:30 PM3/13/23

to

On 12.03.2023 22:42, Janis Papanagnou wrote:
> On 12.03.2023 21:11, Bryan wrote:

>> This is great. My old awk book (Aho, Kernighan, and Weinberger) [...]

>
> The multiplicity syntax {N}, {N,}, {,M}, {N,M} is not supported by the
> classic awk ("nawk") that is based of Aho's, etc. book. More recent and
> commonly used Awks like GNU awk supports it, though. That's why there's
> no mention in that book.

While true for classic awk ("nawk") Arnold Robbins informed me that
in more recent versions of "nawk" this syntax is also supported, now
already for years. (Just in case my post was misinterpreted.)

To my knowledge, though, there's no newer/updated releases of the book
you mentioned; it is based on the old version of (n)awk, and thus it
does not describe that (newer) feature. (Which was my point.)

Janis

Bryan

unread,

Mar 14, 2023, 9:55:59 AM3/14/23

to

I noticed in the "Computerphile" video with Brian Kernighan - shared on this user group - that a new version of The Awk Book might be in the works as of August 2022.

Meanwhile, the overnight delivery is in-hand now, and, from page 45:

"[begin quote]
{n}
{n,}
{n,m}
One or two numbers inside braces denote an *interval expression*. If there is one number in the braces, the preceeding regexp is repeated n times. If there are two numbers separated by a comma, the preceding regexp is repeated n to m times. if [p. 46] there is one number followed by a comma, then the preceding regexp is repeated at least n times:[end quote]"

... examples shown are :
wh{3}y Matches 'whhhy', but not 'why' or 'whhhhy'.
wh{3,5}y matches 'whhhy', 'whhhy', or 'whhhhhy' only.
wh{2,}y matches 'whhy', 'whhhy', and so on.

There is more.

Lastly, fom the back cover :

"You have the freedom to copy and modify this GNU manual."

Glad to support the FSF in this way!

Janis Papanagnou

unread,

Mar 14, 2023, 7:14:33 PM3/14/23

to

On 14.03.2023 14:55, Bryan wrote:
> I noticed in the "Computerphile" video with Brian Kernighan - shared
> on this user group - that a new version of The Awk Book might be in
> the works as of August 2022.

I cannot find a new version of the original Awk book with Google
(or other commercial providers). Could you provide a link, please?

Or are you speaking about Arnold Robbin's book? (Especially since
below you mention GNU and the FSF.)

I'm certainly confused by your mention of Brian Kernighan, one of
the authors of the original book.

>
> Meanwhile, the overnight delivery is in-hand now, [...] There is

> more.
>
> Lastly, fom the back cover :
> "You have the freedom to copy and modify this GNU manual."
>
> Glad to support the FSF in this way!

Janis

Ben Bacarisse

unread,

Mar 14, 2023, 7:46:28 PM3/14/23

to

Janis Papanagnou <janis_pap...@hotmail.com> writes:

> On 14.03.2023 14:55, Bryan wrote:
>> I noticed in the "Computerphile" video with Brian Kernighan - shared
>> on this user group - that a new version of The Awk Book might be in
>> the works as of August 2022.
>
> I cannot find a new version of the original Awk book with Google
> (or other commercial providers). Could you provide a link, please?
>
> Or are you speaking about Arnold Robbin's book? (Especially since
> below you mention GNU and the FSF.)
>
> I'm certainly confused by your mention of Brian Kernighan, one of
> the authors of the original book.

Th phrase "might be in the works" means only that there is a possibility
that a new edition might be in preparation. Is that's what's confusing?

Bryan is clearly talking about a new version of the original book, but
he is referring to the most vague suggestion that there might, soon, be
a new edition. As far as I can tell there isn't one, but there could be
on "in the works" (i.e. in preparation).

--
Ben.

Keith Thompson

unread,

Mar 14, 2023, 7:49:07 PM3/14/23

to

That's the GNU Awk manual. I don't have a printed version, but it
appears to have the same content as the online manual available by
typing "info gawk" (if you have the right things installed)
or at <https://www.gnu.org/software/gawk/manual/gawk.html>.

"The Awk Book" presumably refers to the original "The AWK Programming
Language" by Aho, Kernighan, and Weinberger, published in 1988.

--
Keith Thompson (The_Other_Keith) Keith.S.T...@gmail.com
Working, but not speaking, for XCOM Labs
void Void(void) { Void(); } /* The recursive call of the void */

Janis Papanagnou

unread,

Mar 14, 2023, 8:22:25 PM3/14/23

to

On 15.03.2023 00:46, Ben Bacarisse wrote:
> Janis Papanagnou <janis_pap...@hotmail.com> writes:
>
>> On 14.03.2023 14:55, Bryan wrote:
>>> I noticed in the "Computerphile" video with Brian Kernighan - shared
>>> on this user group - that a new version of The Awk Book might be in
>>> the works as of August 2022.
>>
>> I cannot find a new version of the original Awk book with Google
>> (or other commercial providers). Could you provide a link, please?
>>
>> Or are you speaking about Arnold Robbin's book? (Especially since
>> below you mention GNU and the FSF.)
>>
>> I'm certainly confused by your mention of Brian Kernighan, one of
>> the authors of the original book.
>
> Th phrase "might be in the works" means only that there is a possibility
> that a new edition might be in preparation. Is that's what's confusing?

It was various things that confused me (but not the "in works" per se):
- "might be in the works" vs. "the overnight delivery is in-hand now"
- "GNU" and "FSF" vs. "The [original][commercial] Awk Book"
- and the date "August 2022" I couldn't assign to both books mentioned

>
> Bryan is clearly talking about a new version of the original book, but
> he is referring to the most vague suggestion that there might, soon, be
> a new edition. As far as I can tell there isn't one, but there could be
> on "in the works" (i.e. in preparation).

I am certainly interested in any new version. Read his post as if he
already had got it. But I didn't find anything online.

Janis

Bryan

unread,

Mar 15, 2023, 11:31:04 AM3/15/23

to

I apologize for the confusion!

I will make a note on the Brian Kernighan video thread - the video I listened to/watched when stuck (not a bad idea, IMHO).

Ed Morton

unread,

Mar 15, 2023, 1:12:11 PM3/15/23

to

On 3/15/2023 10:31 AM, Bryan wrote:
> I apologize for the confusion!
>
> I will make a note on the Brian Kernighan video thread - the video I listened to/watched when stuck (not a bad idea, IMHO).

You're posting on usenet, not a forum, so please make sure every post
has enough context included to make sense stand-alone. Right now you're
truncating/removing all context on all of your posts.

Thanks.

Kpop 2GM

unread,

Aug 1, 2023, 12:11:22 AM8/1/23

to

> "The Awk Book" presumably refers to the original "The AWK Programming
> Language" by Aho, Kernighan, and Weinberger, published in 1988.

I've seen the entirety of the original 1988 book scanned and viewable in PDF format online

( I'll refrain from linking it here since I'm uncertain about copyrights of the PDFs, but shouldn't be too hard to locate via google search or somewhere on github )

That said, even the original authors didn't do a particular good job at selling awk's real strengths. If i began my awk journey with that book, I would've jumped ship to perl longlong ago.

thank goodness I didn't step into that sarlacc pit that is perl5, or worse, raku.

The 4Chan Teller

#####################

Janis Papanagnou

unread,

Aug 1, 2023, 11:19:45 AM8/1/23

to

On 01.08.2023 06:11, Kpop 2GM wrote:
>> "The Awk Book" presumably refers to the original "The AWK
>> Programming Language" by Aho, Kernighan, and Weinberger, published
>> in 1988.
>

> That said, even the original authors didn't do a particular good job
> at selling awk's real strengths.

When I had first read about the awk command I was curiously looking
for more detailed information than just "it's a language to process
text patterns", so I was quite glad to find that book. It came out
very quickly, only a year after the official release (three years
after the stable version had been developed). The book is very well
written and provides everything you need to understand the concepts
of Awk which are, IMO, the "real strengths" of the Awk language. Of
course it's not a long developed "hacker book" with tips and tricks.
Neither does it has all that fancy stuff that we were publishing or
discussing here in this newsgroup during the past decades. I agree
with you, though, that there wasn't - maybe still isn't - anything
worth on that "hacker-level". But I wouldn't blame that old book or
their authors for this deficiency. After all folks who came up with
advanced ideas likely read that book (and maybe other later sources)
to develop application ideas that the original authors did not have
in mind.

And I also think that the more advanced methods that contribute to
Awk's strengths further would likely have repelled possible users;
many are cryptic and not too easy to understand for newbies. - The
book was, IMHO, exactly what was necessary at that time! - I would
still recommend it to Awk-beginners, even today.[*]

> If i began my awk journey with that
> book, I would've jumped ship to perl longlong ago.

I had been starting with that book (and a brain that came for free),
and nothing else. (And at times I'm still locking into that book to
look up things.)

With which sources have you "began [your] awk journey", since you
seem to avoid Perl and enjoy Awk on an advanced level?

Janis

[*] With the cutback of the unpleasantly high price of the booklet.

jeorge

unread,

Aug 1, 2023, 4:32:36 PM8/1/23

to

On 8/1/23 9:19 AM, Janis Papanagnou wrote:
> On 01.08.2023 06:11, Kpop 2GM wrote:
>>> "The Awk Book" presumably refers to the original "The AWK
>>> Programming Language" by Aho, Kernighan, and Weinberger, published
>>> in 1988.

<snip>
> .. I would still recommend it to Awk-beginners, even today.[*]
<snip>> [*] With the cutback of the unpleasantly high price of the booklet.

Speaking of, I came across an announcement of a new edition:

The AWK Programming Language, Second Edition
https://awk.dev/
"The book will be available by the end of September."

Janis Papanagnou

unread,

Aug 1, 2023, 5:02:53 PM8/1/23

to

It would be interesting to know whether it's just a reprint or
a reworked (updated/enhanced/extended) edition.

Janis

Keith Thompson

unread,

Aug 1, 2023, 5:14:35 PM8/1/23

to

Kpop 2GM <jason....@gmail.com> writes:
>> "The Awk Book" presumably refers to the original "The AWK Programming
>> Language" by Aho, Kernighan, and Weinberger, published in 1988.
>
> I've seen the entirety of the original 1988 book scanned and viewable in PDF format online
>
> ( I'll refrain from linking it here since I'm uncertain about
> copyrights of the PDFs, but shouldn't be too hard to locate via google
> search or somewhere on github )

I'm far more certain. The 1988 book is still under copyright, and any
PDF copy that's not explicitly authorized by the publisher is in
violation of that copyright.

(The 1988 AWK book doesn't appear to be available in electronic form.
Amazon has it in paperback for $114.71. The second edition is supposed
to be available 2023-09-22, at a much more reasonable price.)

[...]

--
Keith Thompson (The_Other_Keith) Keith.S.T...@gmail.com

Will write code for food.

Keith Thompson

unread,

Aug 1, 2023, 5:20:50 PM8/1/23

to

A mere reprint would not be called the "Second Edition".

From the cited web page:

The first edition was written by Al Aho, Brian Kernighan and Peter
Weinberger in 1988. Awk has evolved since then, there are multiple
implementations, and of course the computing world has changed
enormously. The new edition of the Awk book reflects some of those
changes.

--
Keith Thompson (The_Other_Keith) Keith.S.T...@gmail.com

Will write code for food.

Janis Papanagnou

unread,

Aug 1, 2023, 6:33:56 PM8/1/23

to

On 01.08.2023 23:20, Keith Thompson wrote:
> Janis Papanagnou <janis_pap...@hotmail.com> writes:
>>

>> It would be interesting to know whether it's just a reprint or
>> a reworked (updated/enhanced/extended) edition.
>
> A mere reprint would not be called the "Second Edition".

Ah, okay, thanks for the hint.[*]

I can only speak from publishers hereabouts; "zweite Auflage" (en.
"second edition") just means a new edition after the first one, and
if there isn't anything mentioned like "überarbeitete" (en. revised),
"verbesserte" (en. improved), "durchgesehene" (en. revised version),
"korrigierte" (en. corrected), "erweiterte" (en. extended), or many
other possible adjectives declaring the type of the edition, then
it's usually (or even generally?) just a reprint because of new or
significant more customer demands than originally expected.

> From the cited web page:

> [...]

And thanks for the quote. (I could have looked it up myself but was
too lazy.)

Janis

[*] Though I have also seen in the English domain books that have a
note "rev. ed." adjective (e.g. Bolsky, Korn), so I guess it varies?