sorting of awk arrays (hashes) function

johannes.mainusch

unread,

Oct 23, 2009, 9:49:25 AM10/23/09

to

It took me some time to understand, why awks arrays cannot easily been
sorted. Anyway, somewhere I got the hint to use unix shells sort by
dumping the hash, sorting it in the unix shell and reading it in
again. So I wrote a little functional capsule for that. here it is...

#
# print an array in sorted order by value
# 20091022, Johannes Mainusch
# don't blame me if it doesn't work on you
# :-)
#
function print_sorted_by_value (prefix, array, scale, norm,
significance, i, sum, tmpfile, cmd) {
printf ("\nprinting in sorted order by value\n");
for (i in array) n++; # get length of the array

tmpfile=sprintf("del.me.%d",1000000*rand());
#print "filename = ",tmpfile;

sum = 0;
for (key in array) {
value = "nan";
sum += array[key];
if (array[key] > significance) value = 100*array[key]/norm;
printf ("%s%-30s %8.1f %f\n", prefix, key, scale*array[key], value)
>>tmpfile;
}
close (tmpfile);
delete array;

cmd = sprintf ("sort -n -r -k3 %s", tmpfile);
while (cmd | getline myline) {
print myline;
# split (myline, tmp);
# array[tmp[2]]=tmp[1];
}
close (tmpfile);
system ("rm "tmpfile);

printf ("%s%-30s %8.1f\n", prefix, "sum:", sum);
printf ("-----------------------------------------\n");
}

Ed Morton

unread,

Oct 23, 2009, 10:06:29 AM10/23/09

to

A couple of things pop out: you need to add "n" to the pseudo-argument
list, the use of getline isn't a safe syntax (see
http://awk.info/?tip/getline), you'd need to run it from a dir where you
have write permission so, since you're assuming UNIX, put your tmp file
in /usr/tmp or similar, deleting a whole array is a gawk-ism but gawk
already has built in array sorting (asort() and asorti()), no need to
use sprintf() to create the "tmpfile" and "cmd" strings, you could use a
co-process instead of a tmp file if you're assuming gawk, you could use
length() instead of a loop to get the array size if you're assuming
gawk, instead of repeating the same format string in two printfs you
should define a format variable and use that, and all the trailing
semicolons are redundant.

Could you show some sample input, a small script that uses that function
plus the output it produces so we can see how to use it?

Regards,

Ed.

Kenny McCormack

unread,

Oct 23, 2009, 10:09:38 AM10/23/09

to

In article <c33c3055-7f83-4f24...@o36g2000vbl.googlegroups.com>,

johannes.mainusch <johannes...@gmx.de> wrote:
>It took me some time to understand, why awks arrays cannot easily been
>sorted. Anyway, somewhere I got the hint to use unix shells sort by
>dumping the hash, sorting it in the unix shell and reading it in
>again. So I wrote a little functional capsule for that. here it is...

A couple of notes (objections):

1) The need for this is pretty much obsolete today, given the built-in
sorting capabilities of GAWK and TAWK (and if you're not using one
or the other of these, then you really should be).

2) IME, you rarely need to sort the *values*. My applications have
always been the need for sorting the keys. TAWK does this
automatically, of course, as does GAWK if you're a sufficiently
whiny user (hint, hint).

3) Incidentally, I've never used GAWK's asort() or asorti() functions.
They look somewhat interesting, but I've never seen the need...

Ed Morton

unread,

Oct 23, 2009, 10:24:21 AM10/23/09

to

Kenny McCormack wrote:
<snip>

> 3) Incidentally, I've never used GAWK's asort() or asorti() functions.
> They look somewhat interesting, but I've never seen the need...

I don't use them much but once in a while they're useful. In fact, I
used one of them in a script just yesterday. I had multiple files of
measurements for various types of processor, e.g. this kind of format in
file "FILE1":

type=foo id=3
count1 = 7
count2 = 5

type=bar id=54
count1 = 3
count3 = 6

type=foo id=12
count4 = 5
count2 = 9

and I had to produce tabular output that was sorted by processor type+id
and with a blank line between each type:

FILE1:
bar_54 3 0 6 0

foo_03 7 5 0 0
foo_12 0 9 0 5

FILE2:
....

so it was convenient to initially store the data indexed by processor
type+id, then sort the list using asorti() before printing. I could've
piped an interim result per file to UNIX sort but then I'd have had to
add yet another pipe to a second awk to introduce the blank lines
between processor types and I'd have had to introduce a shell loop to
feed awk one file at a time instead of just handling all the files on
the awk command line or otherwise jump through hoops so just using
asorti() in a single script was quite a bit simpler.

Ed.

pk

unread,

Oct 23, 2009, 10:48:08 AM10/23/09

to

johannes.mainusch wrote:

> It took me some time to understand, why awks arrays cannot easily been
> sorted.

Why? AFAICT, an array sort function can be written in awk just as easily as
in any other language.

> Anyway, somewhere I got the hint to use unix shells sort by
> dumping the hash, sorting it in the unix shell and reading it in
> again. So I wrote a little functional capsule for that. here it is...

If you're using GNU awk you don't need that because it has built-in
functions to sort arrays by value and by index (hash).

You should check that getline returns a positive value, and probably you
should also close(cmd) "just in case".

Grant

unread,

Oct 23, 2009, 1:32:25 PM10/23/09

to

I use gawk's asort and asorti heaps:

grant@deltree:/usr/local/bin$ grep asort *|grep -v "Binary file"
cc2ip-logview: asort(tsdiff, tssort)
cc2ip-logview: asort(qslen, qssort)
cc2ip-logview: asort(rtime, rsort)
cc2ip-quota-lockout-view: n = asorti(query, sort)
get-web-blocks: numip = asorti(ip, ipnum_sort)
get-web-blocks: n = asorti(list_name, list_name_sorted)
ipblockmerge:# requires recent gawk with 'asorti' (tested with gawk-3.1.5)
ipblockmerge: n = asorti(list_input, list_sorted) # sort by start addr, blocksize
ipblockmerge: x = asorti(list_out, list_out_sorted)
junkview: pf = asort(xp)
junkview: j = asort(kk)
junkview: j = asort(kk)
junkview: sort_addr_port_len = asort(sort_addr_port)
junkview: sort_hits_port_len = asort(sort_hits_port)
junkview: addr_hits_port_len = asort(addr_hits_port)
junkview: hits_addr_port_len = asort(hits_addr_port)
junkview: hits_netw_addr_len = asort(hits_netw_addr)
junkview: m = asort(hl)
junkview: asort(nl)
junkview: sort_src_hit_dst_len = asort(sort_src_hit_dst)
logfilter: tcpsize = asorti(tcp, tcpsort) # sort by IP address
logfilter: nettcpsize = asorti(nettcp, nettcpsort) # sort by net address
pak-web-scan: n = asorti(list, sorted)
show-browsers: n = asorti(sort, sorted)
spam-net-finder: count = asort(output, sorted)
spam-net-finder-db: count = asort(output, sorted)

Grant.
--
http://bugsplatter.id.au

Da_Gut

unread,

Oct 30, 2009, 2:06:42 PM10/30/09

to

> Ed.

Ah, both thanks and drat as well, for that link. I was using getline
in my first awk program, but I'm pretty sure that by following the
above link, that I can eliminate it. Good information.

johannes.mainusch

unread,

Nov 5, 2009, 4:35:03 PM11/5/09

to

Thanks for all the good discussion. I'll try to digest that link and
understand getline better and then I'll clean up my code (I have done
that in fact). The reason for me not to use gawk is simply that I
develop on Mac and that I deploy on Debian. And I am just a part/part
time developer. That is in fact a hobby besides line management. So
*awk is not really an option. And the on remark about the possibility
of sorting hashes in awk I did not understand and I do not believe
it's possible as sorting always involves swapping elements and that
involves any kind if reference to elements which I do not have in an
awk hash. anyway, I might be mistaken and please do prove me wrong by
code sample :-)

Btw. I use awk to analyze custom webserver logs and histogramm data
and get sorted cross references... Its fast and nasty, and yes I know
about the existence of perl, ruby or open source log analyzers. But as
someone recently put it: "awk is a nice chainsaw..."
Cheers
Johannes