sorting

Ed Morton

unread,

Nov 14, 2009, 12:37:28 PM11/14/09

to

Since I promised Tim Menzies I'd try to get involved in awk.info, I'm
considering posting the code below as an example of using functions,
arrays, string vs numeric comparison, etc. The sorting algorithm is
O(n^2). If anyone feels like doing a more efficient equivalent, that'd
make for an interesting comparison too. Any other feedback (especially
spotting bugs or GNU-isms!) is welcome.

Ed.

$ cat sort.awk
function strFldSort(inArr,outArr,fldNum,fldSep \
,thisArr,thisIdx,thisElt,thisKey,cmpIdx,cmpElt,cmpKey,outIdx,numElts)
{
for (thisIdx in inArr) {
thisElt = inArr[thisIdx]
split(thisElt,thisArr,fldSep)

thisKey = thisArr[fldNum]""

outIdx=++cnt[thisKey] # start at 1 plus num occurrences

for (cmpIdx in inArr) {
cmpElt = inArr[cmpIdx]
split(cmpElt,cmpArr,fldSep)

cmpKey = cmpArr[fldNum]""

if (thisKey > cmpKey) {
outIdx++
}
}

outArr[outIdx] = thisIdx
numElts++
}

return numElts
}
function strRecSort(inArr,outArr,fldNum,fldSep \
,thisArr,thisIdx,thisElt,thisKey,cmpIdx,cmpElt,cmpKey,outIdx,numElts)
{
for (thisIdx in inArr) {
thisElt = inArr[thisIdx]

thisKey = thisElt""

outIdx=++cnt[thisKey] # start at 1 plus num occurrences

for (cmpIdx in inArr) {
cmpElt = inArr[cmpIdx]
cmpKey = cmpElt""

if (thisKey > cmpKey) {
outIdx++
}
}

outArr[outIdx] = thisIdx
numElts++
}

return numElts
}

function numFldSort(inArr,outArr,fldNum,fldSep \
,thisArr,thisIdx,thisElt,thisKey,cmpIdx,cmpElt,cmpKey,outIdx,numElts)
{
for (thisIdx in inArr) {
thisElt = inArr[thisIdx]
split(thisElt,thisArr,fldSep)

thisKey = thisArr[fldNum]+0

outIdx=++cnt[thisKey] # start at 1 plus num occurrences

for (cmpIdx in inArr) {
cmpElt = inArr[cmpIdx]
split(cmpElt,cmpArr,fldSep)

cmpKey = cmpArr[fldNum]+0

if (thisKey > cmpKey) {
outIdx++
}
}

outArr[outIdx] = thisIdx
numElts++
}

return numElts
}

function numRecSort(inArr,outArr,fldNum,fldSep \
,thisArr,thisIdx,thisElt,thisKey,cmpIdx,cmpElt,cmpKey,outIdx,numElts)
{
for (thisIdx in inArr) {
thisElt = inArr[thisIdx]

thisKey = thisElt+0

outIdx=++cnt[thisKey] # start at 1 plus num occurrences

for (cmpIdx in inArr) {
cmpElt = inArr[cmpIdx]
cmpKey = cmpElt+0

if (thisKey > cmpKey) {
outIdx++
}
}

outArr[outIdx] = thisIdx
numElts++
}

return numElts
}

function genSort(sortType,inArr,outArr,fldNum,fldSep, numElts)
{
# Traverses the input array, storing it's indices in the output
# array in sorted order of the input array elements. e.g.
#
# in: inArr["foo"]="b"; inArr["bar"]="a"; inArr["xyz"]="b"
# outArr[] is empty
# out: inArr["foo"]="b"; inArr["bar"]="a"; inArr["xyz"]="b"
# outArr[1]="bar"; outArr[2]="foo"; outArr[3]="xyz"
#
# sortType of "n" means sort by numerical comparison, sort by
# string comparison otherwise.
#
# Can also sort on specific fields given a field number and
# field separator. Note that the separator is required so
# that "" specifically means sort on character position.

if (sortType == "n") {
if (fldNum) {
numElts = numFldSort(inArr,outArr,fldNum,fldSep)
} else {
numElts = numRecSort(inArr,outArr,0,fldSep)
}
} else {
if (fldNum) {
numElts = strFldSort(inArr,outArr,fldNum,fldSep)
} else {
numElts = strRecSort(inArr,outArr,0,fldSep)
}
}

return numElts
}

{ inArr[NR]=$0 }

END {
numElts = genSort(sortType,inArr,outArr,fldNum,FS)

for (outIdx=1;outIdx<=numElts;outIdx++) {
print inArr[outArr[outIdx]]
}
}
------------------------------------------------------------------
$ cat file
b 3
a 2
b 10
c 1
b 2
$ awk -f sort.awk file
a 2
b 10
b 2
b 3
c 1
$ awk -v fldNum=2 -f sort.awk file
c 1
b 10
b 2
a 2
b 3
$ awk -v sortType=n -v fldNum=2 -f sort.awk file
c 1
b 2
a 2
b 3
b 10

pk

unread,

Nov 15, 2009, 12:35:25 PM11/15/09

to

Ed Morton wrote:

> Since I promised Tim Menzies I'd try to get involved in awk.info, I'm
> considering posting the code below as an example of using functions,
> arrays, string vs numeric comparison, etc. The sorting algorithm is
> O(n^2). If anyone feels like doing a more efficient equivalent, that'd
> make for an interesting comparison too. Any other feedback (especially
> spotting bugs or GNU-isms!) is welcome.

Good job. ISTR a project called awke, which unfortunately seems to be
inactive now, that had implemented a set of library functions for awk, among
which was a sort routine using quick sort - O(n log n).
The source code is still available at https://launchpad.net/awke, so you may
want to have a look at that for comparison.

Janis Papanagnou

unread,

Nov 15, 2009, 1:11:44 PM11/15/09

to

pk wrote:
> Ed Morton wrote:
>
>> Since I promised Tim Menzies I'd try to get involved in awk.info, I'm
>> considering posting the code below as an example of using functions,
>> arrays, string vs numeric comparison, etc. The sorting algorithm is
>> O(n^2). If anyone feels like doing a more efficient equivalent, that'd
>> make for an interesting comparison too. Any other feedback (especially
>> spotting bugs or GNU-isms!) is welcome.
>
> Good job. ISTR a project called awke, which unfortunately seems to be
> inactive now, that had implemented a set of library functions for awk, among
> which was a sort routine using quick sort - O(n log n).

Careful! Quicksort's *worst* time complexity is O(n^2). There are some
tactics in choosing the pivot element to make that worst case complexity
yet more unlikely. Even though it's in most cases (i.e. on average) a
very fast O(n log n) algorithm you cannot rely on that; for a guaranteed
O(n log n) complexity you can use (for example) heapsort, which is on
average by some significant constant typically slower than quicksort.

Janis

pk

unread,

Nov 15, 2009, 1:44:42 PM11/15/09

to

Janis Papanagnou wrote:

>> Good job. ISTR a project called awke, which unfortunately seems to be
>> inactive now, that had implemented a set of library functions for awk,
>> among which was a sort routine using quick sort - O(n log n).
>
> Careful! Quicksort's *worst* time complexity is O(n^2). There are some
> tactics in choosing the pivot element to make that worst case complexity
> yet more unlikely. Even though it's in most cases (i.e. on average) a
> very fast O(n log n) algorithm you cannot rely on that; for a guaranteed
> O(n log n) complexity you can use (for example) heapsort, which is on
> average by some significant constant typically slower than quicksort.

Hm yes you're right. I seem to remember there were ways of selecting the
pivot such that the worst case complexity was guaranteed to be O(n log n),
but which hovever in practice performed worse than "normal" methods. I can't
find any reference about that right now, so I might indeed be
misremembering.

But you're right, O(n log n) is NOT guaranteed with quicksort in general.
Thanks for making things clear and giving me the chance of brushing up again
those now-quite-ancient (for me) notions :)

pk

unread,

Nov 15, 2009, 1:56:32 PM11/15/09

to

pk wrote:

> I seem to remember there were ways of selecting the
> pivot such that the worst case complexity was guaranteed to be O(n log n),
> but which hovever in practice performed worse than "normal" methods. I
> can't find any reference about that right now, so I might indeed be
> misremembering.

Ah, I found something here:

http://stackoverflow.com/questions/164163/choosing-a-pivot-for-quicksort

"If you absolutely want to guarantee O(n lg n) runtime for the algorithm,
the columns-of-5 method for finding the median of an array runs in O(n)
time, which means that the recurrence equation for quicksort in the worst
case will be T(n) = O(n) (find the median) + O(n) (partition) + 2T(n/2)
(recurse left and right.) By the Master Theorem, this is O(n lg n). However,
the constant factor will be huge, and if worst case performance is your
primary concern, use a merge sort instead, which is only a little bit slower
than quicksort on average, and guarantees O(nlgn) time (and will be much
faster than this lame median quicksort)."

mjc

unread,

Nov 15, 2009, 10:14:56 PM11/15/09

to

Here are two versions of a heapsort I wrote a number of years ago.
Heapsort is guaranteed to take no more than n*log(n) time.

I don't have time to adequately document this, but heapsort2(ar, n)
sorts an array ar(1..n) with the built-in gawk comparison via ">".
heapsortarray(ar, n) uses a function "sortfunc" to compare two items.

I tend to write my gawk like C. All comments will be considered, and
anyone can use this in any way they like.

Martin Cohen

function heapsort2(sortar, n, l, n1, t1)
{
; # from Combinatorial algorithms by Nijenhuis and Wilf, 2nd ed., p.
138;
; # toheap;
for ( l = int(n/2); l >= 1; l-- )
heapf2(l, n, sortar);
; # sortheap;
n1 = n;
while ( n1 > 1 )
{
; # swap [1] and [n1];
t1 = sortar[1]; sortar[1] = sortar[n1]; sortar[n1] = t1;

; # do the operation;
n1--;
if ( n1 > 1 ) heapf2(1, n1, sortar);
}
}
function heapf2(l, n, b, l1, bs, m)
{
; # print "heapf(" l ", " n ")";
l1 = l;
bs = b[l];
while ( (m = 2*l1) <= n )
{
if ( (m < n) && (b[m+1] > b[m]) ) m++;
if ( bs >= b[m] )
{
b[l1] = bs;
return;
}
b[l1] = b[m];
l1 = m;
}
b[l1] = bs;
}
function heapsortarray(sortar, n, l, n1, t1)
{
; # from Combinatorial algorithms by Nijenhuis and Wilf, 2nd ed., p.
138;
; # toheap;
for ( l = int(n/2); l >= 1; l-- )
heapf(l, n, sortar);
; # sortheap;
n1 = n;
while ( n1 > 1 )
{
; # swap [1] and [n1];
t1 = sortar[1];
sortar[1] = sortar[n1];
sortar[n1] = t1;
; # do the operation;
n1--;
if ( n1 > 1 ) heapf(1, n1, sortar);
}
}
function heapf(l, n, b, l1, bs, m, doit)
{
l1 = l;
bs = b[l];
while ( (m = 2*l1) <= n )
{
if ( m < n )
{
; # if ( b[m+1] > b[m] ) m++;
; # see how to compare items;
doit = sortfunc(b[m+1], b[m]);
if ( doit ) m++;
}
; # if ( bs >= b[m] ) ...;
; # same comparison stuff;
doit = sortfunc(bs, b[m]);
if ( doit )
{
b[l1] = bs;
return;
}
b[l1] = b[m];
l1 = m;
}
b[l1] = bs;
}
# sortfunc - decide if exchanging;
function sortfunc(b1x, b2x, \
s1, s2, doit, b1, b2, ib1, ib2)
{
if ( sortnums )
{
b1 = b1x;
b2 = b2x;
}
else
{
ib1 = nonnumloc(b1x);
ib2 = nonnumloc(b2x);
b1 = substr(b1x, ib1);
b2 = substr(b2x, ib2);
}

if ( sorthow == "raw" )
{
; # as is;
doit = (b1 > b2);
}
else if ( sorthow == "Aa" )
{
; # first letter as is, remainder as lower case;
s1 = substr(b1, 1, 1);
s2 = substr(b2, 1, 1);
doit = ( (s1 > s2) \
|| ((s1 == s2) \
&& (tolower(substr(b1, 2)) \
> tolower(substr(b2, 2)) \
) \
) \
);
}
else
{
; # all as lower case;
doit = ( tolower(b1) > tolower(b2) );
}
return doit;
}
# nonnumloc - location of non-numeric part of a string;
function nonnumloc(str)
{
; # if short, use it;
if ( length(str) <= 1 ) return 1;
; # look for spaces, then dot, slash, digits;
match(str, "^[ ]*[./0-9]*");
; # return past that;
return RLENGTH+1;
}

Ed Morton

unread,

Nov 16, 2009, 9:28:12 AM11/16/09

to

Ed Morton wrote:
> Since I promised Tim Menzies I'd try to get involved in awk.info, I'm
> considering posting the code below as an example of using functions,
> arrays, string vs numeric comparison, etc. The sorting algorithm is
> O(n^2). If anyone feels like doing a more efficient equivalent, that'd
> make for an interesting comparison too. Any other feedback (especially
> spotting bugs or GNU-isms!) is welcome.

Here's a better version with less duplicate code, putting the key
numerical vs string code side-by-side so the differences are more
visible, more efficiently only splitting fields once per array element
instead of N times, and isolating the sorting algorithm to one function
(which I called keySort() since I couldn't decide exactly which sorting
algorithm it was implementing since it's not exactly inserting or
selecting or... - suggestions welcome from those more familiar with the
domain!) so we could plug in any other sorting algorithm for visual or
performance comparisons.

My code assumes that to force a numerical comparison all I need to do
is, before entering the common sorting function, add zero to each array
element (arr[idxN] = arr[idxN]+0) and then "arr[idx1] > arr[idx2]"
within the sorting function WILL do a numerical comparison, and
similarly if I instead add a null string to each array element
(arr[idxN] = arr[idxN]"") then exactly the same comparison code,
"arr[idx1] > arr[idx2]", WILL do a string comparison. It works in gawk
but is that a valid assumption for all awks?

Martin - thanks for posting the heapsort code for comparison. It looks
like it's sorting and rewriting to the original array. Don't suppose
you'd feel like modifying it to create the same kind of array of indices
that keySort() does and use the same variable names so if it was posted
at awk.info newcomers could more easily understand the differences?

Ed.

function keySort(keyArr,outArr, thisIdx,thisKey,cmpIdx,outIdx,numElts)
{
for (thisIdx in keyArr) {
thisKey = keyArr[thisIdx]

outIdx=++cnt[thisKey] # start at 1 plus num occurrences

for (cmpIdx in keyArr) {
if (thisKey > keyArr[cmpIdx]) {
outIdx++
}
}

outArr[outIdx] = thisIdx
numElts++
}

return numElts
}

function genSort(sortType,inArr,outArr,fldNum,fldSep, \
keyArr,thisIdx,thisArr)

{
# Traverses the input array, storing it's indices in the output
# array in sorted order of the input array elements. e.g.
#
# in: inArr["foo"]="b"; inArr["bar"]="a"; inArr["xyz"]="b"
# outArr[] is empty
#
# out: inArr["foo"]="b"; inArr["bar"]="a"; inArr["xyz"]="b"
# outArr[1]="bar"; outArr[2]="foo"; outArr[3]="xyz"
#

# Can sort on specific fields given a field number and
# field separator.

#
# sortType of "n" means sort by numerical comparison, sort by
# string comparison otherwise.

if (fldNum) {

if (sortType == "n") {

for (thisIdx in inArr) {
split(inArr[thisIdx],thisArr,fldSep)
keyArr[thisIdx] = thisArr[fldNum]+0
}
} else {
for (thisIdx in inArr) {
split(inArr[thisIdx],thisArr,fldSep)
keyArr[thisIdx] = thisArr[fldNum]""
}
}
} else {

if (sortType == "n") {

for (thisIdx in inArr) {
keyArr[thisIdx] = inArr[thisIdx]+0
}
} else {
for (thisIdx in inArr) {
keyArr[thisIdx] = inArr[thisIdx]""
}
}
}

return keySort(keyArr,outArr)

Anton Treuenfels

unread,

Nov 16, 2009, 7:34:13 PM11/16/09

to

"Ed Morton" <morto...@gmail.com> wrote in message
news:hdrnhu$3t8$1...@news.eternal-september.org...

>
> My code assumes that to force a numerical comparison all I need to do is,
> before entering the common sorting function, add zero to each array
> element (arr[idxN] = arr[idxN]+0) and then "arr[idx1] > arr[idx2]" within
> the sorting function WILL do a numerical comparison, and similarly if I
> instead add a null string to each array element (arr[idxN] = arr[idxN]"")
> then exactly the same comparison code, "arr[idx1] > arr[idx2]", WILL do a
> string comparison. It works in gawk but is that a valid assumption for all
> awks?

It's valid for TAWK, although IIRC adding zero to a string yields zero,
while appending the null string to a number yields a decimal string
representation of that number. The point being simply that if that's not
what is desired the sort result may come out a bit off.

- Anton Treuenfels

Ed Morton

unread,

Nov 18, 2009, 8:52:59 PM11/18/09

to

It seems to work on nawk and /usr/xpg4/bin/awk too so that's good enough for me.
So, here's the final version which I modified quite a bit, mostly to preserve input
ordering after I noticed differences in output between various awks as a result
of the "in" operator. I also let it use the default awk comparison if no specific
comarison is chosen and included an implementation of selection sort so if anyone
wants to contribute alternate algorithms once it's up on awk.info they get an idea
how to do it.

Ed.

#---------------------------------------------------------------------
function selSort(keyArr,idxArr,numElts, swap,thisIdx,minIdx,cmpIdx)
{
# Selection Sort, O(n^2): http://en.wikipedia.org/wiki/Selection_sort

for (thisIdx=1; thisIdx<=numElts; thisIdx++) {

minIdx = thisIdx

for (cmpIdx=thisIdx + 1; cmpIdx <= numElts; cmpIdx++) {
if (keyArr[idxArr[minIdx]] > keyArr[idxArr[cmpIdx]]) {
minIdx = cmpIdx
}
}

if (thisIdx != minIdx) {
swap = idxArr[thisIdx]
idxArr[thisIdx] = idxArr[minIdx]
idxArr[minIdx] = swap
}
}

return
}

#---------------------------------------------------------------------
function keySort(keyArr,idxArr,numElts, occArr,thisIdx,thisKey,cmpIdx,outIdx)
{
# Key Sort, O(n^2): made up by Ed Morton for simplicity.

for (thisIdx=1; thisIdx<=numElts; thisIdx++) {
thisKey = keyArr[thisIdx]

outIdx=++occArr[thisKey] # start at 1 plus num occurrences

for (cmpIdx=1; cmpIdx<=numElts; cmpIdx++) {

if (thisKey > keyArr[cmpIdx]) {
outIdx++
}
}

idxArr[outIdx] = thisIdx
}

return
}

######################################################################
function genSort(sortAlg,sortType,eltArr,idxArr,numElts,fldNum,fldSep,
prtInfo, keyArr,thisIdx,thisArr,prtAlg,prtType,prtKey)
{
# This code was written by Ed Morton to demonstrate the use
# of arrays, functions, and string vs numeric comparisons in awk.
# It also provides a framework for people to implement various
# sorting algorithms in awk such as those listed at:
# http://en.wikipedia.org/wiki/Sorting_algorithm
#
# Sorts idxArr in the order of the elements in eltArr, e.g.
#
# in: eltArr["foo"]="b"; eltArr["bar"]="a"; eltArr["xyz"]="b"
# idxArr[1]="foo"; idxArr[2]="bar"; idxArr[3]="xyz"
#
# out: eltArr["foo"]="b"; eltArr["bar"]="a"; eltArr["xyz"]="b"
# idxArr[1]="bar"; idxArr[2]="foo"; idxArr[3]="xyz"
#
# Specifying fldNum and fldSep sorts on a specific field. fldSep
# is required to be explicitly set in this case as a NULL value
# means "sort by character position" rather than "sort by FS". To
# set fldSep to the current value of FS, set it to the string "FS".
# To set fldSep to the regexp "FS", set it to the string "[F][S]".
#
# sortType starting with "s" means sort by string comparison,
# sortType starting with "n" means sort by numeric comparison,
# sort by awks default string/numeric comparison otherwise.
#
# sortAlg starting with "sel" means use selection sort algorithm,
# use key sort otherwise.
#
# Input order will be preserved for records that match on their keys.

if (prtInfo) {
for (thisIdx=1;thisIdx<=numElts;thisIdx++) {
printf "idxArr[%d] = %s,\teltArr[%s] = %s\n", \
thisIdx,idxArr[thisIdx], \
idxArr[thisIdx],eltArr[idxArr[thisIdx]] | "cat>&2"
}
}

if (fldNum) {
prtKey="\"" fldSep "\"-separated field " fldNum
fldSep = (fldSep == "FS" ? FS : fldSep)
for (thisIdx=1; thisIdx<=numElts; thisIdx++) {
split(eltArr[idxArr[thisIdx]],thisArr,fldSep)

keyArr[thisIdx] = thisArr[fldNum]
}
} else {

prtKey="full records"
for (thisIdx=1; thisIdx<=numElts; thisIdx++) {
keyArr[thisIdx] = eltArr[idxArr[thisIdx]]
}
}

if (sortType ~ /^s/) {
prtType="string"
for (thisIdx=1; thisIdx<=numElts; thisIdx++) {
keyArr[thisIdx] = keyArr[thisIdx]""
}
} else if (sortType ~ /^n/) {
prtType="numeric"
for (thisIdx=1; thisIdx<=numElts; thisIdx++) {
keyArr[thisIdx] = keyArr[thisIdx]+0
}
} else {
prtType="default"
}

if (sortAlg ~ /^sel/) {
prtAlg="Selection"
selSort(keyArr,idxArr,numElts)
} else {
prtAlg="Key"
keySort(keyArr,idxArr,numElts)
}

if (prtInfo) {
printf "%s Sorted %s comparison on %s of array size %d:\n", \
prtAlg,prtType,prtKey,numElts | "cat>&2"

for (thisIdx=1;thisIdx<=numElts;thisIdx++) {
printf "idxArr[%d] = %s,\teltArr[%s] = %s,\tkeyArr[%s] = %s\n", \
thisIdx,idxArr[thisIdx], \
idxArr[thisIdx],eltArr[idxArr[thisIdx]], \
idxArr[thisIdx],keyArr[idxArr[thisIdx]] | "cat>&2"
}
}

return
}

{ idxArr[++numElts]=NR; eltArr[idxArr[numElts]]=$0 }

END {
genSort(sortAlg,sortType,eltArr,idxArr,numElts,fldNum,fldSep,prtInfo)

for (idx=1;idx<=numElts;idx++) {
print eltArr[idxArr[idx]]
}
}

Aleksey Cheusov

unread,

Nov 24, 2009, 12:58:42 AM11/24/09

to

> Good job. ISTR a project called awke, which unfortunately seems to be
> inactive now, that had implemented a set of library functions for awk,
> among which was a sort routine using quick sort - O(n log n).

Have a look at http://sf.net/projects/runawk
There you'll find lots of awk modules including quicksort.awk and
heapsort.awk as well as runawk utility, tool for handling modules.

> The source code is still available at https://launchpad.net/awke, so
> you may want to have a look at that for comparison.

--
Best regards, Aleksey Cheusov.