Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Gawk IGNORECASE=0 vs =1

26 views
Skip to first unread message

J Naman

unread,
Feb 23, 2022, 4:34:42 PM2/23/22
to
Benchmarks have been mentioned recently in some of these posts. Recently, I benchmarked IGNORECASE=0 vs =1 under GAWK Ver 5.1.1
The result is no surprise, 128% difference for this one benchmark.
I am just reporting the quantitative difference. The wall clock time difference
can be non-trivial for processing large files having hundreds of
thousands of lines of text.
The benchmark was for a set of six statements:
four of the form:
if(str~/^text/) {return;}
plus two statements:
if(str~/^[A-Z]+$/) {return;}
if(str~/^[a-z]+$/) {return;}
The results over a aggregate 3 million loops:
Score Test
817 Avg IGN=1
640 Avg IGN=0
128% difference

Also, I reran the same test with "? at the beginning
and end of all six regexps. The scores were
not significantly different than the above.

* These Scores are scaled. I was previously warned
not to report actual CPU or clock times for one particular system.

Ed Morton

unread,
Feb 23, 2022, 4:55:36 PM2/23/22
to
Were there 128% more matches or some other difference in the matched
strings? Without knowing what the input contained it's hard to know what
those results mean. What were you hoping to test by adding `"?` to the
regexps? Without knowing how IGN=1 compares to the alternative of
`tolower(str) ~ /^[a-z]+$/` I'm not sure what we could actually do with
this information.

Ed.

J Naman

unread,
Feb 23, 2022, 11:45:10 PM2/23/22
to
Here are six results, scaled: (not surprising to me)
IC=0 IC=1 1/0
low 108 226 109% longer
Mix 100 228 128% longer
UP 111 225 102% longer

low = "include variable function namespace x"
Mix = "Include Variable Function NameSpace x"
UP = "INCLUDE VARIABLE FUNCTION NAMESPACE X"

function testmatch(str, x){ # all 7 regexp are tested every call
if(str~/^include variable function namespace x/) {x++} # lower
if(str~/^INCLUDE VARIABLE FUNCTION NAMESPACE X/) {x++} # upper
if(str~/^Include Variable Function NameSpace x/) {x++} # mixed
if(str~/^iNCLUDE vARIABLE fUNCTION nAMEsPACE X/) {x++} # Invert case
if(str~/^INclUde varIABLE FUnCtIon NamEspACe X/) {x++} # random case
if(str~/^[A-Z ]+$/) {x++}
if(str~/^[a-z ]+$/) {x++}
} #eofunc testmatch(str)

So, worst case, IGNORECASE=1 takes about twice as long. No surprise.
I forced testing all 7 regexp are every call because my real data doesn't match very often.
All of my regexp are mixed case and the file data are supposed to be.
tolower() on both input and regexp looks to be no better than
mixed case input to mixed case regexp
btw: 'random case' is a quirky feature of my editor I never had any use for before.

Ed Morton

unread,
Feb 24, 2022, 7:51:12 AM2/24/22
to
Am I right in thinking that by the above you mean your test script is
basically a script that calls that function some large number of times
in a loop with 1 of the stated strings, e.g.

BEGIN {
low = "include variable function namespace x"
for (i=1;i<=1000000;i++) testmatch(low)
}

>
> So, worst case, IGNORECASE=1 takes about twice as long. No surprise.
> I forced testing all 7 regexp are every call because my real data doesn't match very often.
> All of my regexp are mixed case and the file data are supposed to be.
> tolower() on both input and regexp looks to be no better than
> mixed case input to mixed case regexp
> btw: 'random case' is a quirky feature of my editor I never had any use for before.

I'm still struggling to understand what we're supposed to **do** with
the above information. I mean if we need to match a regexp against
mixed-case input we have 2 choices:

1) IGNORECASE=1; .. $0 ~ /foo/
2) tolower($0) ~ /foo/

and what we cannot do is just:

3) $0 ~ /foo/

so what can we do with the information that "1" would be slower than "3"
since we can't use "3" for this anyway? If you told us that "1" was
slower than "2" then we could use that information to write scripts
using "2" instead of "1" but I just don't see how the speed of "1" vs
the speed of "3" is something we can act on.

Ed.

J Naman

unread,
Feb 24, 2022, 6:55:44 PM2/24/22
to
Ed, you re quite right. And I apologize for not telling you what motivated all this. I have files of mixed case text and regexps that are mixed case, e.g. /New York, NY/. I did an @include "foo" that included some function that set IGNORECASE=1 and everything s-l-o-w-e-d down. Once I figured out that IGNORECASE was probably responsible, I benchmarked to see what the times were. Thus, for my data, if and when possible, exact match text to regexp with IGNORECASE=0. As I said, no surprise. Sorry if I have wasted people's time. John

Ed Morton

unread,
Feb 24, 2022, 7:16:42 PM2/24/22
to
Ah, now I understand what this was about. Thanks for the information.

Ed.

Kpop 2GM

unread,
Mar 28, 2022, 3:59:05 PM3/28/22
to
% ( time ( pvE0 < testnamespacecase_9999999.txt | mawk2 'tolower($0)~"^include variable function namespace x$"' FS='^$' ) | pvE9)| wc5

in0: 3.40GiB 0:00:04 [ 818MiB/s] [ 818MiB/s] [=============================>] 100%
out9: 64.3MiB 0:00:04 [15.1MiB/s] [15.1MiB/s] [ <=> ]
( pvE 0.1 in0 < testnamespacecase_9999999.txt | mawk2 FS='^$'; ) 4.07s user 0.79s system 113% cpu 4.275 total
rows = 1773954. | UTF8 chars = 67410252. | bytes = 67410252.


% ( time ( pvE0 < testnamespacecase_9999999.txt | mawk2 'toupper($0)~"^INCLUDE VARIABLE FUNCTION NAMESPACE X$"' FS='^$' ) | pvE9)| wc5
in0: 80.2MiB 0:00:00 [ 801MiB/s] [ 801MiB/s] [> ] 2% ETA 0:00:00
out9: 64.3MiB 0:00:04 [15.1MiB/s] [15.1MiB/s] [ <=> ]
in0: 3.40GiB 0:00:04 [ 820MiB/s] [ 820MiB/s] [=============================>] 100%
( pvE 0.1 in0 < testnamespacecase_9999999.txt | mawk2 FS='^$'; ) 4.06s user 0.79s system 113% cpu 4.268 total
rows = 1773954. | UTF8 chars = 67410252. | bytes = 67410252.


% ( time ( pvE0 < testnamespacecase_9999999.txt | mawk2 '/^[Ii][Nn][Cc][Ll][Uu][Dd][Ee] [Vv][Aa][Rr][Ii][Aa][Bb][Ll][Ee] [Ff][Uu][Nn][Cc][Tt][Ii][Oo][Nn] [Nn][Aa][Mm][Ee][Ss][Pp][Aa][Cc][Ee] [Xx]$/' FS='^$' )| pvE9) | wc5

out9: 64.3MiB 0:00:02 [32.0MiB/s] [32.0MiB/s] [ <=> ]
in0: 3.40GiB 0:00:02 [1.70GiB/s] [1.70GiB/s] [=============================>] 100%
( pvE 0.1 in0 < testnamespacecase_9999999.txt | mawk2 FS='^$'; ) 1.58s user 0.82s system 118% cpu 2.026 total
rows = 1773954. | UTF8 chars = 67410252. | bytes = 67410252.


What I'm seeing is if you simply make EVERY letter a combo test of both upper and lower cases, and also prevent it from splitting fields, it's more than 200% time savings. And that's only for mawk-2. For gawk, the savings are unearthly :



out9: 64.3MiB 0:00:43 [1.48MiB/s] [1.48MiB/s] [ <=> ]
in0: 3.40GiB 0:00:43 [80.5MiB/s] [80.5MiB/s] [=============================>] 100%
( pvE 0.1 in0 < testnamespacecase_9999999.txt | gawk -Se FS='^$'; ) 43.01s user 1.12s system 101% cpu 43.317 total
rows = 1773954. | UTF8 chars = 67410252. | bytes = 67410252.



in0: 3.40GiB 0:00:44 [78.0MiB/s] [78.0MiB/s] [=============================>] 100%
out9: 64.3MiB 0:00:44 [1.44MiB/s] [1.44MiB/s] [ <=> ]
( pvE 0.1 in0 < testnamespacecase_9999999.txt | gawk -Se FS='^$'; ) 44.35s user 1.14s system 101% cpu 44.671 total
rows = 1773954. | UTF8 chars = 67410252. | bytes = 67410252.



out9: 64.3MiB 0:00:05 [10.7MiB/s] [10.7MiB/s] [ <=> ]
in0: 3.40GiB 0:00:05 [ 582MiB/s] [ 582MiB/s] [=============================>] 100%
( pvE 0.1 in0 < testnamespacecase_9999999.txt | gawk -Se FS='^$'; ) 5.83s user 0.81s system 110% cpu 6.006 total
rows = 1773954. | UTF8 chars = 67410252. | bytes = 67410252.

=====================
echo; ( time ( pvE0 < testnamespacecase_9999999.txt | gawk -Se 'toupper($0)~"^INCLUDE VARIABLE FUNCTION NAMESPACE X$"' FS='^$' ) | pvE9)| wc5; sleep 1; ( time ( pvE0 < testnamespacecase_9999999.txt | gawk -Se 'tolower($0)~"^include variable function namespace x$"' FS='^$' ) | pvE9)| wc5 ; sleep 1; ( time ( pvE0 < testnamespacecase_9999999.txt | gawk -Se '/^[Ii][Nn][Cc][Ll][Uu][Dd][Ee] [Vv][Aa][Rr][Ii][Aa][Bb][Ll][Ee] [Ff][Uu][Nn][Cc][Tt][Ii][Oo][Nn] [Nn][Aa][Mm][Ee][Ss][Pp][Aa][Cc][Ee] [Xx]$/' FS='^$' )| pvE9) | wc5
===============================================================

The 4Chan Teller
0 new messages