Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Quicksort implementation + insertion sort + median of three pivot

88 views
Skip to first unread message

Joel Parker Henderson

unread,
Nov 7, 2015, 12:29:20 AM11/7/15
to
I'm writing an awk quicksort implementation to take the place of gawn asort() when I run with POSIX awk such as mawk and nawk. I'm posting the code in case it can help other people who have a similar need. The code runs well, as far as I can tell. Suggestions welcome . Best, Joel


# num-quicksort.awk
# Author: Joel Parker Henderson (jo...@joelparkerhenderson.com)
# License: Apache, BSD, CC, GPL, MIT, any open source license.
# Website: http://numcommand.com

# Intialize.
#
# Experts suggest switching from quicksort to insertion sort
# for smaller lists, and using a threshold in the range of 8 to 20.
# We choose 12 for our kinds of data. Change this as you like.

function num_quicksort_init() {
NUM_QUICKSORT_INSERTION_SORT_THRESHOLD = 12
}

# Quicksort.
#
# Quicksort selects a pivot and divides the data into values above and
# below the pivot. Sorting then recurses on the left and right sub-lists.
# Thanks to code posted on http://awk.info/?doc/quicksort.html
# TODO: optimizations such as Tukeys ninther, Shell sort, dual pivot.

function num_quicksort(A) {
num_quicksort_slice(A, 1, num_arr_length(A))
}

function num_quicksort_slice(A, lo, hi, pivot_index) {
if (lo >= hi) return
if (hi - lo < NUM_QUICKSORT_INSERTION_SORT_THRESHOLD) {
num_insertion_sort_slice(A, lo, hi)
return
}
pivot_index = num_quicksort_pivot_index_via_median_of_three_sort_slice(A, lo, hi)
pivot_index = num_partition_slice(A, lo, hi, pivot_index)
num_quicksort_slice(A, lo, pivot_index - 1)
num_quicksort_slice(A, pivot_index + 1, hi)
}

# Partition an array slice using a given pivot index.
# Return the new pivot index.

function num_partition_slice(A, lo, hi, pivot_index, left, right, pivot_value, t) {
if (lo >= hi) return lo
left = lo
right = hi
pivot_value = A[pivot_index]
num_arr_swap(A, left, pivot_index)
while (left < right) {
while (left <= hi && A[left] <= pivot_value) left++
while (right >= lo && A[right] > pivot_value) right--
if (left < right) num_arr_swap(A, left, right)
}
pivot_index = right
num_arr_swap(A, lo, pivot_index)
return pivot_index
}

# Pivot.
#
# Choose a quicksort pivot index by using the "median of three" heuristic
# with a swap sort of the three items for efficiency on the next pivot.
#
# Compared to picking the pivot randomly, the median of three heuristic:
#
# * Ensures that a common case of fully sorted data remains optimal.
# * Is more difficult for an attacker to manipulate into the worst case.
# * Is often faster than a PRNG, which is often relatively slow.
#
# The median of three looks at the first, middle and last elements of
# the array, and choose the median of those as the pivot.
#
# To get the "full effect" of the median of three, it is also important
# to sort those three items, not just use the median as the pivot --
# this does not affect what is chosen as the pivot in the current
# iteration, but can/will affect what is used as the pivot in the next
# recursive call, which helps to limit the bad behavior for a few
# initial orderings (one that turns out to be particularly bad in many
# cases is an array that is sorted, except for having the smallest item
# at the high end, or having the largest item at the low end).
#
# Thanks to http://stackoverflow.com/questions/7559608/median-of-three-values-strategy
#
# To calculate the midpoint, we prefer (lo+(hi-lo)/2) instead of the naive
# simpler ((hi+lo)/2) because the former does not risk integer overflow.

function num_quicksort_pivot_index_via_median_of_three_sort(A) {
num_quicksort_pivot_index_via_median_of_three_sort_slice(A, 1, num_arr_length(A))
}

function num_quicksort_pivot_index_via_median_of_three_sort_slice(A, lo, hi, mid) {
if (lo == hi) return lo
mid = lo + int((hi - lo) / 2)
if (A[hi] < A[lo]) num_arr_swap(A, lo, hi)
if (A[mid] < A[lo]) num_arr_swap(A, mid, lo)
if (A[hi] < A[mid]) num_arr_swap(A, hi, mid)
return mid
}

# Insertion sort.
#
# This implementation is a slightly faster version that moves A[i]
# to its position in one go and only performs one assignment in the
# inner loop body. See https://en.wikipedia.org/wiki/Insertion_sort

function num_insertion_sort(A) {
num_insertion_sort_slice(A, 1, num_arr_length(A))
}

function num_insertion_sort_slice(A, lo, hi, i, j, x) {
if (lo >= hi) return
for (i = lo + 1; i <= hi; i++) {
x = A[i]
j = i
while (j > lo && A[j-1] > x) {
A[j] = A[j-1]
j--
}
A[j] = x
}
}

# Swap array items.
function num_arr_swap(A, i, j, t) {
t = A[i]; A[i] = A[j]; A[j] = t
}

# Length of an array.
function num_arr_length(arr, i, len) {
for (i in arr) len++
return len
}

Janis Papanagnou

unread,
Nov 7, 2015, 3:03:13 AM11/7/15
to
On 07.11.2015 06:29, Joel Parker Henderson wrote:
> I'm writing an awk quicksort implementation to take the place of gawn
> asort() when I run with POSIX awk such as mawk and nawk. I'm posting the
> code in case it can help other people who have a similar need. The code
> runs well, as far as I can tell. Suggestions welcome . Best, Joel

After a brief view just one suggestion; keep the code documentation terse
and avoid unnecessary "vindications". The "sorted middle pivot of three"
was state of the art almost since the beginning of quicksort invention,
many decades ago. I'd strip down the comment to just say what method you
used, and strip most/all of the explanation and specifically the SO-link
and the part that you didn't use a PRNG-function (why would one do that
in the first place?). If you really feel any necessity giving references
to theoretical explanations you should refer to primary literature, but
certainly not to a web-forum; though, as said, that's unnecessary for a
state-of-the-art implementation.

Janis

>
[...]
> [...]


Ed Morton

unread,
Nov 7, 2015, 8:48:36 AM11/7/15
to
On 11/6/2015 11:29 PM, Joel Parker Henderson wrote:
> I'm writing an awk quicksort implementation

Already done, see http://awk.info/?doc/quicksort.html, and you might also want
to take a look at http://awk.info/?doc/sorting.html.

Ed.

Janis Papanagnou

unread,
Nov 7, 2015, 10:01:42 AM11/7/15
to
You seem to have missed that the OP has that link even in the comment of his
code.

As far as I've seen, Joel's implementation has some (standard-) improvements,
like linear sort for small data partitions, and the mentioned pivot-selection.

It makes sense to avoid using the non-optimized version on awk.info if there's
now a version that has optimizations that are state-of-the-art for 3+ decades.

Janis

>
> Ed.
>

Kaz Kylheku

unread,
Nov 7, 2015, 10:37:01 AM11/7/15
to
On 2015-11-07, Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> On 07.11.2015 06:29, Joel Parker Henderson wrote:
>> I'm writing an awk quicksort implementation to take the place of gawn
>> asort() when I run with POSIX awk such as mawk and nawk. I'm posting the
>> code in case it can help other people who have a similar need. The code
>> runs well, as far as I can tell. Suggestions welcome . Best, Joel
>
> After a brief view just one suggestion; keep the code documentation terse
> and avoid unnecessary "vindications".

I would completely strip all the comments, which have absolutely no
value in my eyes. Everyone knows what is "quicksort" and "median of three",
or else can Google these terms.

> many decades ago. I'd strip down the comment to just say what method you
> used

already_obvious_from_the_moronically_long_identifiers_embedding_median_of_three

Janis Papanagnou

unread,
Nov 7, 2015, 11:17:20 AM11/7/15
to
It is good practise, IMO, to tell what's implemented and what optimizations
are applied. I don't think it's good practise to ask the user to go through
all the code to get that information; but as said, this can be kept terse.
My main concern is that some of the comments are misleading, many more are
unnecessary.

Janis

Kenny McCormack

unread,
Nov 7, 2015, 12:38:21 PM11/7/15
to
In article <n1l3ok$l24$1...@news.m-online.net>,
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>On 07.11.2015 14:48, Ed Morton wrote:
>> On 11/6/2015 11:29 PM, Joel Parker Henderson wrote:
>>> I'm writing an awk quicksort implementation
>>
>> Already done, see http://awk.info/?doc/quicksort.html, and you might also want
>> to take a look at http://awk.info/?doc/sorting.html.
>
>You seem to have missed that the OP has that link even in the comment of his
>code.

Ed rarely reads posts - he's admitted to this on at least one occasion.

He's like people I see on other forums, who are so enamored of giving
advice, that they openly admit that they just scan for keywords before
responding. The funny part is how often this strategy works [*], but, of
course, it is embarrassing when it falls flat. Of course, you could also
say that it is hard to tell when it has actually worked, because if you see
someone making a sensible response, you don't really know if they read the
post before responding - or if they just got lucky...

[*] For example: One of my standard truisms is "All cron problems are
environment problems". This means that if anyone posts a message in any
kind of unix/shell help group about having a problem with cron, you can
just ignore whatever they wrote and just say "cron doesn't run with the
usual environment that you get when you login in normally", followed
usually by "Don't use cron until you really understand this" (followed, in
fact, in modern times, with "Don't use cron (period)".

--
> No, I haven't, that's why I'm asking questions. If you won't help me,
> why don't you just go find your lost manhood elsewhere.

CLC in a nutshell.

Joel Parker Henderson

unread,
Nov 7, 2015, 9:30:11 PM11/7/15
to
Thanks Janis for the great feedback; I appreciate it.

> ... keep the code documentation terse
> ... refer to primary literature

Can do.


> you didn't use a PRNG-function (why would one do that in the first place?).

Curiously the majority of quicksort-related pages that I read this week
chose a pivot randomly. Some used med. of 3, 5, 9, and Tukey's ninther.

I want to try dual pivots: primary literature http://arxiv.org/pdf/1412.0193v1.pdf

Janis Papanagnou

unread,
Nov 8, 2015, 2:24:54 AM11/8/15
to
On 08.11.2015 03:30, Joel Parker Henderson wrote:
> Thanks Janis for the great feedback; I appreciate it.
>
>> ... keep the code documentation terse
>> ... refer to primary literature
>
> Can do.
>
>
>> you didn't use a PRNG-function (why would one do that in the first place?).
>
> Curiously the majority of quicksort-related pages that I read this week
> chose a pivot randomly. [...]

It was a rhetorical question. (I don't know what pages you read, but that
comment I'd consider misleading and would certainly suggest to avoid it.)
Doing the Right Thing in your implementation suffices, there's no need for
vindications why you didn't do it in suboptimal or "wrong" ways.

Janis

Ed Morton

unread,
Nov 8, 2015, 8:20:46 PM11/8/15
to
On 11/7/2015 11:38 AM, Kenny McCormack wrote:
> In article <n1l3ok$l24$1...@news.m-online.net>,
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>> On 07.11.2015 14:48, Ed Morton wrote:
>>> On 11/6/2015 11:29 PM, Joel Parker Henderson wrote:
>>>> I'm writing an awk quicksort implementation
>>>
>>> Already done, see http://awk.info/?doc/quicksort.html, and you might also want
>>> to take a look at http://awk.info/?doc/sorting.html.
>>
>> You seem to have missed that the OP has that link even in the comment of his
>> code.
>
> Ed rarely reads posts - he's admitted to this on at least one occasion.

No, once again you're mis-quoting me. That is extremely rude, please stop. What
I SAID is that I rarely read posts from YOU.

I suspect you find yourself very much at home visiting the monkey cage at the
zoo and seeing the chimps gathering their feces to throw at passers-by.

Ed.

0 new messages