Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Trimming leading and trailing blanks (reg exp question)

86 views
Skip to first unread message

Kenny McCormack

unread,
Sep 11, 2015, 4:31:27 PM9/11/15
to
In GAWK, is the following function correct (for the indicated purpose):

function strTrim(s) {
return gensub(/^ *| *$/,"","g",s)
}

My testing indicates that it works right, but I'm never all that sure about
the precedence rules in reg exps. They seem kinda arbitrary to me.
And I don't like to crud things up with parens if I don't have to. I.e.,
I'd rather have an official word that the above is OK, rather than to just
"play safe" and put in a bunch of parens.

--
Mike Huckabee has yet to consciously uncouple from Josh Duggar.

Janis Papanagnou

unread,
Sep 11, 2015, 5:56:33 PM9/11/15
to
Am 11.09.2015 um 23:31 schrieb Kenny McCormack:
> In GAWK, is the following function correct (for the indicated purpose):
>
> function strTrim(s) {
> return gensub(/^ *| *$/,"","g",s)
> }
>
> My testing indicates that it works right, but I'm never all that sure about
> the precedence rules in reg exps. They seem kinda arbitrary to me.
> And I don't like to crud things up with parens if I don't have to. I.e.,
> I'd rather have an official word that the above is OK, rather than to just
> "play safe" and put in a bunch of parens.

Again an "official word" required? (Who could give that with such
questions?) - Looks right to me. - Don't be too paranoid! :-)

Janis

Andrew Schorr

unread,
Sep 11, 2015, 7:55:47 PM9/11/15
to
On Friday, September 11, 2015 at 4:31:27 PM UTC-4, Kenny McCormack wrote:
> In GAWK, is the following function correct (for the indicated purpose):
>
> function strTrim(s) {
> return gensub(/^ *| *$/,"","g",s)
> }
>

An interesting question is whether it's faster to use " +" instead of " *". I have no idea, but would be curious to learn if somebody feels like benchmarking it.

Regards,
Andy

Kenny McCormack

unread,
Sep 12, 2015, 5:56:25 AM9/12/15
to
In article <msvimd$tqo$1...@speranza.aioe.org>,
Heh heh - good advice.

But the reason I posted was because I've solved this problem so many times
over the years, and each time, I have to ask myself about the precedence
rules. I seem to remember there being some case(s) where it does matter
and you have to parenthesize in order to get what you want.

Oh yeah, now I remember! The weird case is when you do something like:

/^something|somethingElse$/

Now, that matches either:
/^something/
or:
/somethingElse$/
And not:

/^(something|somethingElse)$/

As was probably intended.

But, in the instant case, where we are looking for leading or trailing
spaces, the non-parenthesized form is correct. So, looks like problem
solved.

--
"If our country is going broke, let it be from
feeding the poor and caring for the elderly.
And not from pampering the rich and fighting
wars for them."

--Living Blue in a Red State--

Ed Morton

unread,
Sep 12, 2015, 9:58:21 AM9/12/15
to
On 9/11/2015 3:31 PM, Kenny McCormack wrote:
> In GAWK, is the following function correct (for the indicated purpose):
>
> function strTrim(s) {
> return gensub(/^ *| *$/,"","g",s)
> }
>
> My testing indicates that it works right, but I'm never all that sure about
> the precedence rules in reg exps. They seem kinda arbitrary to me.
> And I don't like to crud things up with parens if I don't have to. I.e.,
> I'd rather have an official word that the above is OK, rather than to just
> "play safe" and put in a bunch of parens.
>

The precedence is correct for what you want, but you should be using "+" instead
of "*" as you only want to operate on 1 or more spaces, not zero spaces, even
though functionally it won't matter in this case.

Ed.

Kenny McCormack

unread,
Sep 12, 2015, 10:22:48 AM9/12/15
to
In article <mt1aua$pnn$1...@dont-email.me>,
Agreed.

And usually when I'm doing this for real, it'll be [ \t] not just space.

And, of course if you're the sort who goes for such things, you might also
use something like [:space:] or whatever that is...

--
The motto of the GOP "base": You can't *be* a billionaire, but at least you
can vote like one.

Ed Morton

unread,
Sep 12, 2015, 10:33:28 AM9/12/15
to
Yes, [[:space:]] would be best so it trims leading/trailing newlines too (when
the RS is set to something other than a newline).

Ed.

Kees Nuyt

unread,
Sep 12, 2015, 10:37:01 AM9/12/15
to
I don't feel like benchmarking today, but I think this is faster:

function strTrim(s){
sub(/^ +/,"",s)
sub(/ +$/,"",s)
return s
}

I'd usually go for character classes though:
function strTrim(s){
sub(/^[[:blank:]]+/,"",s)
sub(/[[:blank:]]+$/,"",s)
return s
}

--
Regards,
Kees Nuyt

Hermann Peifer

unread,
Sep 12, 2015, 10:47:29 AM9/12/15
to
On 2015-09-12 16:36, Kees Nuyt wrote:
>
> I don't feel like benchmarking today, but I think this is faster:
>
> function strTrim(s){
> sub(/^ +/,"",s)
> sub(/ +$/,"",s)
> return s
> }
>
> I'd usually go for character classes though:
> function strTrim(s){
> sub(/^[[:blank:]]+/,"",s)
> sub(/[[:blank:]]+$/,"",s)
> return s
> }
>

gsub() is even faster, as far as I remember some earlier benchmarking:

gsub(/^[[:space:]]+|[[:space:]]+$/,"",s)

Hermann

Luuk

unread,
Sep 12, 2015, 11:12:53 AM9/12/15
to
Ok, i did some benchmarking ;)

for (( i=1; i<1000000; i++ )) do echo "$i $i $i "; done >testfile

~/tmp> time awk -f gensub.awk testfile >gensub.tst

real 0m5.470s
user 0m5.188s
sys 0m0.252s
~/tmp> time awk -f sub.awk testfile >sub.tst

real 0m3.382s
user 0m3.200s
sys 0m0.100s
~/tmp> time awk -f gsub.awk testfile >gsub.tst

real 0m2.482s
user 0m2.448s
sys 0m0.032s


~/tmp> cat gensub.awk
function strTrim(s)
{
return gensub(/^ *| *$/,"","g",s)
}

{
$1 = strTrim($1);
$2 = strTrim($2);
$3 = strTrim($3);
print $1, $2, $3;
}
~/tmp> cat sub.awk
function strTrim(s)
{
sub(/^ +/,"",s);
sub(/ +$/,"",s);
return s
}

{
$1 = strTrim($1);
$2 = strTrim($2);
$3 = strTrim($3);
print $1, $2, $3;
}
~/tmp> cat gsub.awk
function strTrim(s)
{
return gsub(/^ +| +$/,"",s)
}

{
$1 = strTrim($1);
$2 = strTrim($2);
$3 = strTrim($3);
print $1, $2, $3;
}


Kees Nuyt

unread,
Sep 14, 2015, 4:59:06 AM9/14/15
to
Great! Thanks!

>
>~/tmp> cat gensub.awk

[...]

--
Regards,
Kees Nuyt

Dave Sines

unread,
Sep 14, 2015, 12:51:38 PM9/14/15
to
Luuk <lu...@invalid.lan> wrote:

> ~/tmp> cat gsub.awk
> function strTrim(s)
> {
> return gsub(/^ +| +$/,"",s)

gsub(/^ +| +$/,"", s)

Luuk

unread,
Sep 15, 2015, 12:38:07 PM9/15/15
to
that would not make much difference, at least not in time......

~/tmp> time awk -f gensub.awk testfile >gensub.txt

real 0m5.413s
user 0m5.256s
sys 0m0.144s
~/tmp> time awk -f sub.awk testfile >sub.txt

real 0m3.204s
user 0m3.128s
sys 0m0.064s
~/tmp> time awk -f gsub.awk testfile >gsub.txt

real 0m2.293s
user 0m2.236s
sys 0m0.052s

Dave Sines

unread,
Sep 16, 2015, 8:56:44 AM9/16/15
to
The use of $1, $2 and $3 as arguments to strTrim ensures that leading and
trailing spaces are already removed so no substitutions occur.

With the main blocks in the scripts replaced with:

{
print strTrim($0)
}

The double sub variant is marginally quicker (around 0.1 seconds) than
gsub.

Janis Papanagnou

unread,
Sep 16, 2015, 11:56:35 AM9/16/15
to
My tests show that with _small strings_ there's about a 50% degradation
of performance when using the * instead of the + quantifier.
With _large strings_ that gets even worse; with a 350+ character string
I get factors of 3-4 times slower processing with * if compared to + .

Note: Tested only with the specific regexps given in this thread.

Janis

>
> Regards, Andy
>

Janis Papanagnou

unread,
Sep 16, 2015, 11:58:24 AM9/16/15
to
On 16.09.2015 14:57, Dave Sines wrote:
> [...]
> The double sub variant is marginally quicker (around 0.1 seconds) than
> gsub.

In my environment gensub() has good characteristics, in typical cases
slightly better than gsub() or two sub(), but it degrades (similar with
gsub()) if used on large strings (350+ character) by roughly 15-20% if
compared to the two sub()'s.

Note: Tested only with the specific regexps given in this thread. (The
characteristics of the anchors may have some influence here.)

Janis

Harlan Grove

unread,
Sep 25, 2015, 2:21:07 PM9/25/15
to
Hermann Peifer wrote:
...
>gsub() is even faster, as far as I remember some earlier benchmarking:
>
> gsub(/^[[:space:]]+|[[:space:]]+$/,"",s)
...

On my system, this is slighly slower than the gensub approach, and both gsub and gensub approaches take roughly twice as long as two sub calls.

Hermann Peifer

unread,
Sep 26, 2015, 1:04:54 AM9/26/15
to
My main use case is (actually: was) to simulate normalize-space() when
processing XML data files with gawk's XML extension. It could well be
that things went faster because I collapsed the 2 earlier functions into
one, rather than by replacing the 2 sub() calls by a single gsub() call.

I vaguely remember having tried various options where [2] was fastest.
It might well be that my memory is wrong. I am getting older every day,
as I noted recently.

;-) Hermann

[1]

function normalize_space(str)
{
str = trim(str)
gsub(/[[:space:]]+/, " ", str)
return str
}

function trim(str)
{
sub(/^[[:space:]]+/, "", str)
sub(/[[:space:]]+$/, "", str)
return str
}

[2]

function normalize_space(str)
{
gsub(/^[[:space:]]+|[[:space:]]+$/, "", str)
gsub(/[[:space:]]+/, " ", str)
return str
}

0 new messages