Does anyone know why mawk is so much faster?
Andrew Savige
Peter
--
Peter S Tillier Peter....@BTInternet.com
pet...@eq1152.demon.co.uk
Opinions expressed are my own and not necessarily those
of my employer
from mawk-1.3.3/man/mawk.doc:
"
mawk, on the other hand, allows RS to be a regular expres-
sion. When "\n" appears in records, it is treated as space,
and FS always determines fields.
Removing the line at a time paradigm can make some programs
simpler and can often improve performance. For example,
redoing example 3 from above,
BEGIN { RS = "[^A-Za-z]+" }
{ word[ $0 ] = "" }
END { delete word[ "" ]
for( i in word ) cnt++
print cnt
}
counts the number of unique words by making each word a
record. On moderate size files, mawk executes twice as
fast, because of the simplified inner loop.
"
there are some more comparisons of [mng]awk , perl and awka on
http://www.linuxstart.com/~awka/compare.html
Eiso
__________________________________________________________________
o Eiso AB
o Dept. of Biochemistry
University of Groningen
The Netherlands
o
. .
o ^
| - _
\__|__/
|
|
/ \
/ \
| |
________ ._| |_. ________________________________________________
> Andrew Savige wrote in message <87v7ua$678$1...@merki.connect.com.au>...
> >I ran the following test on Linux on a 12 MB and a 24 MB file:
> > cp fred fred.tmp
> > mawk '{print;}' fred >fred.tmp
> > gawk '{print;}' fred >fred.tmp
> > perl -ne 'print' fred >fred.tmp
> > 12 MB 24 MB
> >Timings user sys user sys
> >cp 0.02 1.70 0.03 2.97
> >mawk 3.30 0.59 4.90 1.10
> >gawk 6.19 0.57 10.20 1.12
> >perl 5.47 0.71 9.59 1.11
> >
> >Does anyone know why mawk is so much faster?
> >
> <snip>
> Possibly because the internal compiled form may be optimized in a different
> way to gawk, What's really interesting is that it's a lot faster than perl!
>
other result suggest that perl is not necessarily faster than awk:
@MISC{kern98,
author = "B.W. Kerningham and C.J. Van Wyk",
title = "Timing Trials, or, the Trials of Timing:
Experiments with Scripting and User-Interface
Languages",
year = "1998",
note = " From Lucent Technologies Inc. Available from
\url{http://netlib.bell-labs.com/cm/cs/who/bwk/interps/pap.html}",
}
--
T...@Menzies.com; http://www.tim.menzies.com | PH +1-304-367-8447
NASA, 100 University Dr,Fairmont WV, 26554,USA | FAX +1-304-367-8211
The older I grow, the less important the comma becomes. Let the reader
catch his own breath. -- Elizabeth Clarkson Zwart
>I ran the following test on Linux on a 12 MB and a 24 MB file:
> cp fred fred.tmp
> mawk '{print;}' fred >fred.tmp
> gawk '{print;}' fred >fred.tmp
> perl -ne 'print' fred >fred.tmp
> 12 MB 24 MB
>Timings user sys user sys
>cp 0.02 1.70 0.03 2.97
>mawk 3.30 0.59 4.90 1.10
>gawk 6.19 0.57 10.20 1.12
>perl 5.47 0.71 9.59 1.11
>
>Does anyone know why mawk is so much faster?
>
>Andrew Savige
IMHO, not knowing mawk, the difference is probably in the built
in limits: program and data line lengths, program size, variable
and array numbers and sizes, etc. This applies to most versions
of awk. Simplifying assumptions and code may then be used to
speed up the code, on that platform, at the cost of not being
able to process all possible data which might be thrown at it.
GNU software in general, including gawk, are designed to be
portable across many platforms, have no built in limits, any size
awk program script, read lines of any length, store any amount of
data in any number of variables or sizes of arrays, until your
page/swap space or patience is exhausted, and the program
crashes, is crashed, or completes. So gawk uses very flexible and
portable algorithms everywhere you might otherwise hit a stop,
and this extra code exacts a price in performance, particularly
in the simple code you see here. In a more elaborate benchmark,
running more complex code, you may see a smaller difference. With
larger programs and larger chunks of data input, particularly
storing the data internally in large arrays, gawk will keep
running and finish the job, but other awks will squawk and splat.
Try making your test file input lines over 2K long, with some
long fields > 256 characters, store a few long fields in arrays
in your program, and write them out in reverse order at the end.
Most awks will not read lines over a certain fixed size, and may
complain and exit, overwrite something and crash, read only part
of each line and not work properly, run out of memory storing
data, etc.
But gawk should just suck it in, process it and spit it out
again, regardless of what you do. ISTR Arnold Robbins posting the
gawk limits fairly recently, and I remember they were either
unspecified or ridiculously large -- you could try looking on
deja for the article.
Thanks. Take care, Brian Inglis Calgary, Alberta, Canada
--
Brian_...@CSi.com (Brian dot Inglis at SystematicSw dot ab dot ca)
use address above to reply
Well, since we are on this subject, I think everyone would very
much enjoy a treatise on the benchmarking of interpreters, including
scheme limbo (AT&T's alternative to Java), tcl, perl, awk, and C,
by none other than Brian Kernighan (anyone heard of him in this
newsgroup?)
http://cm.bell-labs.com/cm/cs/who/bwk/interps/pap.html
To encapsulate, awk benchmarks quite favorably to perl.
-Brian
I enjoyed BrianK's article. However, I could not find any mention in
his article of WHICH version of awk he used. This can be crucial.
For example, on Solaris, running the following trivial awk program:
awk '/^ *SERVER *#DBCOL *$/,/^ *END_SERVER *$/ { print "!" $0; next; }
{ print; }' fred > fred.tmp
here are the results:
user system total
oawk 55.92 0.65 56.57
nawk 10.78 0.67 11.45
mawk 4.73 0.61 5.34
onetrueawk 49.96 0.77 50.73
perl 11.22 0.62 11.84
On this test, BrianK's onetrueawk performed miserably
(to be fair, onetrueawk 'out of the box' did not use the compiler's -O
switch).
The implementation of a given language is sometimes more important
than the choice of language.
AndrewS.
% >Does anyone know why mawk is so much faster?
% IMHO, not knowing mawk, the difference is probably in the built
% in limits: program and data line lengths, program size, variable
I'm not aware of any such limits in mawk. I'm not aware of any
advantage of using any awk implementation over mawk. I don't
believe there are any platforms which gawk compiles on and mawk
doesn't, and mawk is consistently faster.
--
Patrick TJ McPhee
East York Canada
pt...@interlog.com
> I'm not aware of any such limits in mawk. I'm not aware of any
> advantage of using any awk implementation over mawk.
Is there a newer mawk than 1.3.3? Here's what mawk 1.3.3 says about its
compiled limits:
$ mawk -W version
mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan
compiled limits:
max NF 32767
sprintf buffer 1020
So if I have an awk program that sprintf's more than 1020 characters, I
have to recompile mawk?
$ mawk 'BEGIN{ for(i=1;i<=2000;i++) s=s"x";print sprintf("%s",s)}'
mawk: program limit exceeded: sprintf buffer size=1020
FILENAME="" FNR=0 NR=0
Oh. Or am I supposed to "just not use sprintf() with such long strings"?
Or maybe I put a comment in my awk program that says "only use a mawk
with a sprintf buffer size of at least 100000 characters"?
How do I increase that size, btw?
> I don't
> believe there are any platforms which gawk compiles on and mawk
> doesn't, and mawk is consistently faster.
The msdos/NOTES file coming with mawk 1.3.3 supposes to use mawk 1.2.2
on DOS, because:
| Version 1.3:
| The new array design will fail under msdos if you put more than
| 16K items into an array and then walk it with for(i in A).
| Unfortunately things will probably fail ungracefully. The new
| array design runs into 64K limits at 16K elements in an array and
| there are no checks in the code. This is fixable, but tedious and
| 1.2.2 works well on DOS.
Has a version of mawk already been compiled with DJGPP or any other
32-bit-environment for DOS yet? If not, there will be much fun with 64K
limits.
Regards...
Michael
>I'm not aware of any
>advantage of using any awk implementation over mawk.
Aside from mawk's hardwired limits, which are occasionally a problem,
I find that gawk's diagnostics are much better than mawk's.
It's not unreasonable to debug with gawk, and then use mawk if
extra speed is needed and if mawk's limits are not a problem.