mawk versus gawk performance

Andrew Savige

unread,

Feb 11, 2000, 3:00:00 AM2/11/00

to

I ran the following test on Linux on a 12 MB and a 24 MB file:
cp fred fred.tmp
mawk '{print;}' fred >fred.tmp
gawk '{print;}' fred >fred.tmp
perl -ne 'print' fred >fred.tmp
12 MB 24 MB
Timings user sys user sys
cp 0.02 1.70 0.03 2.97
mawk 3.30 0.59 4.90 1.10
gawk 6.19 0.57 10.20 1.12
perl 5.47 0.71 9.59 1.11

Does anyone know why mawk is so much faster?

Andrew Savige

Peter S. Tillier

unread,

Feb 11, 2000, 3:00:00 AM2/11/00

to

Andrew Savige wrote in message <87v7ua$678$1...@merki.connect.com.au>...

<snip>
Possibly because the internal compiled form may be optimized in a different
way to gawk, What's really interesting is that it's a lot faster than perl!

Peter
--
Peter S Tillier Peter....@BTInternet.com
pet...@eq1152.demon.co.uk
Opinions expressed are my own and not necessarily those
of my employer

Eiso AB

unread,

Feb 11, 2000, 3:00:00 AM2/11/00

to

Andrew Savige wrote:
>
> I ran the following test on Linux on a 12 MB and a 24 MB file:
> cp fred fred.tmp
> mawk '{print;}' fred >fred.tmp
> gawk '{print;}' fred >fred.tmp
> perl -ne 'print' fred >fred.tmp
> 12 MB 24 MB
> Timings user sys user sys
> cp 0.02 1.70 0.03 2.97
> mawk 3.30 0.59 4.90 1.10
> gawk 6.19 0.57 10.20 1.12
> perl 5.47 0.71 9.59 1.11
>
> Does anyone know why mawk is so much faster?
>

> Andrew Savige

from mawk-1.3.3/man/mawk.doc:
"
mawk, on the other hand, allows RS to be a regular expres-
sion. When "\n" appears in records, it is treated as space,
and FS always determines fields.

Removing the line at a time paradigm can make some programs
simpler and can often improve performance. For example,
redoing example 3 from above,

BEGIN { RS = "[^A-Za-z]+" }

{ word[ $0 ] = "" }

END { delete word[ "" ]
for( i in word ) cnt++
print cnt
}

counts the number of unique words by making each word a
record. On moderate size files, mawk executes twice as
fast, because of the simplified inner loop.
"

there are some more comparisons of [mng]awk , perl and awka on
http://www.linuxstart.com/~awka/compare.html

Eiso
__________________________________________________________________

o Eiso AB

o Dept. of Biochemistry
University of Groningen
The Netherlands
o
. .
o ^
| - _
\__|__/
|
|
/ \
/ \
| |
________ ._| |_. ________________________________________________

Tim Menzies

unread,

Feb 12, 2000, 3:00:00 AM2/12/00

to

On Fri, 11 Feb 2000, Peter S. Tillier wrote:

> Andrew Savige wrote in message <87v7ua$678$1...@merki.connect.com.au>...

> >I ran the following test on Linux on a 12 MB and a 24 MB file:
> > cp fred fred.tmp
> > mawk '{print;}' fred >fred.tmp
> > gawk '{print;}' fred >fred.tmp
> > perl -ne 'print' fred >fred.tmp
> > 12 MB 24 MB
> >Timings user sys user sys
> >cp 0.02 1.70 0.03 2.97
> >mawk 3.30 0.59 4.90 1.10
> >gawk 6.19 0.57 10.20 1.12
> >perl 5.47 0.71 9.59 1.11
> >
> >Does anyone know why mawk is so much faster?
> >

> <snip>
> Possibly because the internal compiled form may be optimized in a different
> way to gawk, What's really interesting is that it's a lot faster than perl!
>

other result suggest that perl is not necessarily faster than awk:

@MISC{kern98,
author = "B.W. Kerningham and C.J. Van Wyk",
title = "Timing Trials, or, the Trials of Timing:
Experiments with Scripting and User-Interface
Languages",
year = "1998",
note = " From Lucent Technologies Inc. Available from
\url{http://netlib.bell-labs.com/cm/cs/who/bwk/interps/pap.html}",
}

--
T...@Menzies.com; http://www.tim.menzies.com | PH +1-304-367-8447
NASA, 100 University Dr,Fairmont WV, 26554,USA | FAX +1-304-367-8211

The older I grow, the less important the comma becomes. Let the reader
catch his own breath. -- Elizabeth Clarkson Zwart

Brian Inglis

unread,

Feb 24, 2000, 3:00:00 AM2/24/00

to

On Fri, 11 Feb 2000 07:43:13 +1100, "Andrew Savige"
<andrew...@ir.com> wrote:

>I ran the following test on Linux on a 12 MB and a 24 MB file:
> cp fred fred.tmp
> mawk '{print;}' fred >fred.tmp
> gawk '{print;}' fred >fred.tmp
> perl -ne 'print' fred >fred.tmp
> 12 MB 24 MB
>Timings user sys user sys
>cp 0.02 1.70 0.03 2.97
>mawk 3.30 0.59 4.90 1.10
>gawk 6.19 0.57 10.20 1.12
>perl 5.47 0.71 9.59 1.11
>
>Does anyone know why mawk is so much faster?
>

>Andrew Savige

IMHO, not knowing mawk, the difference is probably in the built
in limits: program and data line lengths, program size, variable
and array numbers and sizes, etc. This applies to most versions
of awk. Simplifying assumptions and code may then be used to
speed up the code, on that platform, at the cost of not being
able to process all possible data which might be thrown at it.

GNU software in general, including gawk, are designed to be
portable across many platforms, have no built in limits, any size
awk program script, read lines of any length, store any amount of
data in any number of variables or sizes of arrays, until your
page/swap space or patience is exhausted, and the program
crashes, is crashed, or completes. So gawk uses very flexible and
portable algorithms everywhere you might otherwise hit a stop,
and this extra code exacts a price in performance, particularly
in the simple code you see here. In a more elaborate benchmark,
running more complex code, you may see a smaller difference. With
larger programs and larger chunks of data input, particularly
storing the data internally in large arrays, gawk will keep
running and finish the job, but other awks will squawk and splat.

Try making your test file input lines over 2K long, with some
long fields > 256 characters, store a few long fields in arrays
in your program, and write them out in reverse order at the end.
Most awks will not read lines over a certain fixed size, and may
complain and exit, overwrite something and crash, read only part
of each line and not work properly, run out of memory storing
data, etc.

But gawk should just suck it in, process it and spit it out
again, regardless of what you do. ISTR Arnold Robbins posting the
gawk limits fairly recently, and I remember they were either
unspecified or ridiculously large -- you could try looking on
deja for the article.

Thanks. Take care, Brian Inglis Calgary, Alberta, Canada
--
Brian_...@CSi.com (Brian dot Inglis at SystematicSw dot ab dot ca)
use address above to reply

brian hiles

unread,

Feb 25, 2000, 3:00:00 AM2/25/00

to

Peter S. Tillier <Peter....@BTinternet.com> wrote:
> Andrew Savige wrote in message <87v7ua$678$1...@merki.connect.com.au>...

> > ...

> Possibly because the internal compiled form may be optimized in a different
> way to gawk, What's really interesting is that it's a lot faster than perl!

Well, since we are on this subject, I think everyone would very
much enjoy a treatise on the benchmarking of interpreters, including
scheme limbo (AT&T's alternative to Java), tcl, perl, awk, and C,
by none other than Brian Kernighan (anyone heard of him in this
newsgroup?)

http://cm.bell-labs.com/cm/cs/who/bwk/interps/pap.html

To encapsulate, awk benchmarks quite favorably to perl.

-Brian

Andrew Savige

unread,

Feb 25, 2000, 3:00:00 AM2/25/00

to

brian hiles <b...@rainey.blueneptune.com> wrote in message
news:sbbpqr...@corp.supernews.com...

I enjoyed BrianK's article. However, I could not find any mention in
his article of WHICH version of awk he used. This can be crucial.
For example, on Solaris, running the following trivial awk program:
awk '/^ *SERVER *#DBCOL *$/,/^ *END_SERVER *$/ { print "!" $0; next; }
{ print; }' fred > fred.tmp
here are the results:
user system total
oawk 55.92 0.65 56.57
nawk 10.78 0.67 11.45
mawk 4.73 0.61 5.34
onetrueawk 49.96 0.77 50.73
perl 11.22 0.62 11.84
On this test, BrianK's onetrueawk performed miserably
(to be fair, onetrueawk 'out of the box' did not use the compiler's -O
switch).
The implementation of a given language is sometimes more important
than the choice of language.

AndrewS.

Patrick TJ McPhee

unread,

Feb 25, 2000, 3:00:00 AM2/25/00

to

In article <irr9bssul0jsfjvf9...@4ax.com>,
Brian Inglis <Brian.do...@SystematicSw.ab.ca> wrote:

% >Does anyone know why mawk is so much faster?

% IMHO, not knowing mawk, the difference is probably in the built
% in limits: program and data line lengths, program size, variable

I'm not aware of any such limits in mawk. I'm not aware of any
advantage of using any awk implementation over mawk. I don't
believe there are any platforms which gawk compiles on and mawk
doesn't, and mawk is consistently faster.
--

Patrick TJ McPhee
East York Canada
pt...@interlog.com

Michael Mauch

unread,

Feb 25, 2000, 3:00:00 AM2/25/00

to

Patrick TJ McPhee <pt...@interlog.com> wrote:

> I'm not aware of any such limits in mawk. I'm not aware of any
> advantage of using any awk implementation over mawk.

Is there a newer mawk than 1.3.3? Here's what mawk 1.3.3 says about its
compiled limits:

compiled limits:
max NF 32767
sprintf buffer 1020

So if I have an awk program that sprintf's more than 1020 characters, I
have to recompile mawk?

$ mawk 'BEGIN{ for(i=1;i<=2000;i++) s=s"x";print sprintf("%s",s)}'
mawk: program limit exceeded: sprintf buffer size=1020
FILENAME="" FNR=0 NR=0

Oh. Or am I supposed to "just not use sprintf() with such long strings"?
Or maybe I put a comment in my awk program that says "only use a mawk
with a sprintf buffer size of at least 100000 characters"?

How do I increase that size, btw?

> I don't
> believe there are any platforms which gawk compiles on and mawk
> doesn't, and mawk is consistently faster.

The msdos/NOTES file coming with mawk 1.3.3 supposes to use mawk 1.2.2
on DOS, because:

| Version 1.3:
| The new array design will fail under msdos if you put more than
| 16K items into an array and then walk it with for(i in A).
| Unfortunately things will probably fail ungracefully. The new
| array design runs into 64K limits at 16K elements in an array and
| there are no checks in the code. This is fixable, but tedious and
| 1.2.2 works well on DOS.

Has a version of mawk already been compiled with DJGPP or any other
32-bit-environment for DOS yet? If not, there will be much fun with 64K
limits.

Regards...
Michael

Paul Eggert

unread,

Feb 25, 2000, 3:00:00 AM2/25/00

to

pt...@interlog.com (Patrick TJ McPhee) writes:

>I'm not aware of any
>advantage of using any awk implementation over mawk.

Aside from mawk's hardwired limits, which are occasionally a problem,
I find that gawk's diagnostics are much better than mawk's.
It's not unreasonable to debug with gawk, and then use mawk if
extra speed is needed and if mawk's limits are not a problem.