Mawk hashing inefficiency

jmellander

unread,

Sep 9, 2009, 1:23:26 PM9/9/09

to

Never posted to this group before, but have been an awk user for many
years.

Recently, for a project, I had the occasion to use mawk - I have a
list of ~12,000,000 Unix timestamps to nanosecond precision that I
needed to match the first field of every record in a number of huge
files. Gawk couldn't handle the number of records, and so I used
mawk, as being more memory thrifty. The program was a one-liner like
this:

mawk 'FNR==NR {x[$1]++;next} $1 in x}' timestamp_file log_file

which works perfectly, but the run time seemed excessive - many hours
per log file - which made me think that the hashing function was
causing many collisions, and thus hash chaining.....

When stuck in a slow meeting, I started looking at the mawk source
code, specifically the hashing functions, of which there are 2: hash()
in hash.c & ahash() in array.c

I was surprised to find that the hashing functions in both cases
essentially just sum the bytes of the key to create the hash - this
means that 123, 321, 213, etc. would all hash to the same location and
cause collisions, and hash chaining.

Modifying the hashing to a more efficient hash caused an enormous gain
in efficiency, as in this test:

$ wc -l j
2999999 j

$ time mawk-1.3.3/mawk '{x[$1]++}' j >/dev/null

real 2m24.362s
user 2m20.174s
sys 0m0.663s

$ time mawk-1.3.3a/mawk '{x[$1]++}' j >/dev/null

real 0m6.607s
user 0m6.146s
sys 0m0.241s

mawk-1.3.3a has the below modifications:

In hash.c I replaced the 'hash' function with:

/*
FNV-1 hash function, per en.wikipedia.org/wiki/Fowler-Noll-
Vo_hash_function
*/
unsigned hash(s)
register char *s ;
{
register unsigned h = 2166136261 ;

while (*s) h = (h * 16777619) ^ *s++ ;
return h ;
}

and in array.c replaced 'ahash' with:
/*
FNV-1 hash function, per en.wikipedia.org/wiki/Fowler-Noll-
Vo_hash_function
*/
static unsigned ahash(sval)
STRING* sval ;
{
register unsigned h = 2166136261 ;
register char *s = sval->str;

while (*s) h = (h * 16777619) ^ *s++ ;
return h ;
}

pk

unread,

Sep 9, 2009, 2:51:30 PM9/9/09

to

jmellander wrote:

> Never posted to this group before, but have been an awk user for many
> years.
>
> Recently, for a project, I had the occasion to use mawk - I have a
> list of ~12,000,000 Unix timestamps to nanosecond precision that I
> needed to match the first field of every record in a number of huge
> files. Gawk couldn't handle the number of records,

This looks strange to me. I've used gawk to process files up to some hundred
millions records without a problem. However, I must say that I never had to
use all those values as hashes. A quick test on a low-powered box seems to
show that for more than 4 millions values, gawk hash performance degrades
substantially (but maybe having more RAM could help here).

FWIW, this is what I get:

$ wc -l file
3000000

$ time gawk '{a[$1]++}' file

real 0m13.659s
user 0m9.772s
sys 0m0.985s

$ time mawk '{a[$1]++}' file

real 4m36.687s
user 3m41.719s
sys 0m1.731s

$ time mawk-new '{a[$1]++}' file

real 0m11.714s
user 0m7.491s
sys 0m0.726s

So your change is indeed quite effective.

(btw, you don't need the "++" in these examples, unless at some point you
need to know how many times a given value occurred).

> mawk-1.3.3a has the below modifications:
>
>
> In hash.c I replaced the 'hash' function with:

>[snip]

You should probably report your findings and perhaps submit your patch to
the new mawk maintainer, see http://invisible-island.net/mawk/mawk.html.

jmellander

unread,

Sep 9, 2009, 4:50:45 PM9/9/09

to

Thanks,

1.2e6 18 character hashes overflows memory on my 32 bit linux system
under gawk, but mawk takes ~ 730Megs of ram, and so works fine.

I'm doing a few more tests before submitting a patch, but I think it
will work fine. Of course, looping thru an array with a different
hashing scheme will return items in a different order, but the order
is undefined anyway.

FWIW: I looked at busybox awk, and found an issue with its hashing
scheme, as well:

The maximum size that a hash table can expand to is ~64K, so adding
more than ~64K indices to an array guarantees that hash collisions and
chaining will occur, slowing down all accesses, although as its
intended for embedded use with limited memory, maybe that's OK...

When I have time, I'll submit a patch for that, too. Per
http://www.cse.yorku.ca/~oz/hash.html and an examination of the
source, gawk uses the sdbm hash, busybox awk uses something similar to
the djb2 hash, and mawk used the lose lose hash :-(