Overhead of interfaces with sort.Sort()

1,513 views
Skip to first unread message

Donovan Hide

unread,
Jan 13, 2013, 7:43:22 AM1/13/13
to golang-nuts
Hi,

I'm doing some big sort operations to build an inverted index and am finding the generated pointer receiver code for using interfaces with sort.Sort() is itself a big performance drain. Maybe I'm making some obvious mistakes, but this test case seems to demonstrate it:


The key lines in the profile output are:

121  34.1%  34.1%      121  34.1% main.InvertedSlice.Less
109  30.7%  64.8%      230  64.8% main.(*InvertedSlice).Less

30.7% of time is spent just jumping from the generated pointer receiver to the actual defined slice Less method

Is there a way to make sure sort.Sort() doesn't invoke this function and just uses the non-pointer function?

Cheers,
Donovan.

Donovan Hide

unread,
Jan 13, 2013, 9:09:21 AM1/13/13
to golang-nuts
Well, after a bit more mailing list research, seems like this is a classic example of the need for generics and compile-time specialisation :-)

In an effort to find a short term performance improvement I did a quick test of copy and pasting a shell sort implementation from here:


I suppose the challenge is to find the shortest possible sort routine, in terms of lines of code, that performs best and can be pasted in to replace sort.Sort(). Shell sort directly implemented on the slice is nearly twice as fast... Bit sad about this, but hey ho!


Cheers,
Donovan.

Ethan Burns

unread,
Jan 13, 2013, 10:15:03 AM1/13/13
to golan...@googlegroups.com
Hi,

I have seen this behavior before too.  My solution was to make the interface methods operate directly on a pointer to my type, getting rid of the need for the generated method.  However, the thing that I am unclear on is why there is the need for main.(*InvertedSlice).Less in the first place.  I had assumed that, because the interface is implemented by the type itself, the sort package would not call main.(InvertedSlice).Less (the non-pointer version) directly.  My guess is that it relates to the fact that the interface value really stores a pointer to the InvertedSlice type, but it is still quite surprising to me that there would be a significant overhead here.

Your original version takes ~8s on my laptop and this new version takes ~6s.  A pretty significant improvement, percentage-wise.


Best,
Ethan

Ethan Burns

unread,
Jan 13, 2013, 10:17:30 AM1/13/13
to golan...@googlegroups.com
On Sunday, January 13, 2013 10:15:03 AM UTC-5, Ethan Burns wrote:
Hi,

I have seen this behavior before too.  My solution was to make the interface methods operate directly on a pointer to my type, getting rid of the need for the generated method.  However, the thing that I am unclear on is why there is the need for main.(*InvertedSlice).Less in the first place.  I had assumed that, because the interface is implemented by the type itself, the sort package would not call main.(InvertedSlice).Less (the non-pointer version) directly.

(Sorry, typo.  I meant to say "the sort package would call main... directly"---the "not" shouldn't be there.)

Donovan Hide

unread,
Jan 13, 2013, 10:32:50 AM1/13/13
to Ethan Burns, golang-nuts
Hi Ethan,

thanks for investigating! Having pointer receivers on the interface methods does indeed seem to make a significant difference. Have updated the benchmark here:


Still looks like directly implementing a sort function per slice type is faster though. Guess the inlining of the interface methods by gc doesn't quite match hard-coding the swap method, which is a shame... Either that or the 1959 Shell algorithm is a timeless classic :-)

Cheers,
Donovan.


Ethan Burns

unread,
Jan 13, 2013, 10:45:13 AM1/13/13
to golan...@googlegroups.com, Ethan Burns
Donovan,

I don't think that Go inlines interface methods at all.  The sort package will need to call the sort.Interface methods each and every time they are needed.  For this reason, I think it is very likely that implementing your own sort function, specific to each slice type, will always be faster than the sort package (assuming you use a reasonable sorting algorithm).  By the way, the qsort function in C has the same "problem," because it cannot inline the comparison function; C++ has templates, so  it does not have this "problem."

Anyway, I am not too worried about the performance of the sort package, as it is sufficiently fast for most cases, and that is its intention: being fast enough and convenient enough for general use.  The thing that I still wonder about, however, is why such an apparently expensive (*InvertedSlice).Less method needs to be generated when sort.Interface is implemented by the non-pointer InvertedSlice type.


Best,
Ethan

minux

unread,
Jan 13, 2013, 10:58:19 AM1/13/13
to Ethan Burns, golan...@googlegroups.com
On Sun, Jan 13, 2013 at 11:45 PM, Ethan Burns <burns...@gmail.com> wrote:
I don't think that Go inlines interface methods at all.  The sort package will need to call the sort.Interface methods each and every time they are needed.  For this reason, I think it is very likely that implementing your own sort function, specific to each slice type, will always be faster than the sort package (assuming you use a reasonable sorting algorithm).  By the way, the qsort function in C has the same "problem," because it cannot inline the comparison function; C++ has templates, so  it does not have this "problem."

Anyway, I am not too worried about the performance of the sort package, as it is sufficiently fast for most cases, and that is its intention: being fast enough and convenient enough for general use.  The thing that I still wonder about, however, is why such an apparently expensive (*InvertedSlice).Less method needs to be generated when sort.Interface is implemented by the non-pointer InvertedSlice type.
This has something to do with the way values are stored in an interface.
when the value is smaller than an uintptr, it will be stored inline in interface, but if it's larger than that,
it will be stored as a pointer to actual data.
In your case, a slice is stored in an interface, so a pointer will be actually store into the interface,
and thus Sort can only call the method on pointer receiver.

btw, because non-pointer receiver methods are always in the method set of pointer receiver, so Go
must generate the pointer receiver methods.

David DENG

unread,
Jan 13, 2013, 10:09:06 PM1/13/13
to golan...@googlegroups.com, Ethan Burns
The implementation of Less should be further optimized to make the comparison fair:

func (s *InvertedSlice) Less(i, j int) bool {
return (*s)[i].Hash < (*s)[j].Hash || ((*s)[i].Hash == (*s)[j].Hash && (*s)[i].Position < (*s)[j].Position)
}

Actaully I manually expanded the sort source code with []Inverted replacing Instance and run the sorting on the same data, and the difference is smaller(n := 20000000):

Expanded Sort() 9.6615526s
sort.Sort() 12.9767423s

The whole source is here: http://play.golang.org/p/sv0cx71Lf2

David

Niklas Schnelle

unread,
Jan 14, 2013, 5:21:43 AM1/14/13
to golan...@googlegroups.com, Ethan Burns
Are you sure you want to use a sorting algorithm that has pretty much unknown time complexity?
I'd recommend a simple quicksort with median of three pivot selection, that has O(n*log n) running time in the worst case but in practice this will
only happen for specificaly created sequences that exploit the pivot choice algoritm.
You could also use a randomized quicksort which has O(n*log n) expected running time regardless of input.
Or use heapsort which has a guaranteed O(n*log n) running time but a less tight inner loop and is thus a bit slower in practice, note
that for comparison based sort O(n*log n) is provably the best achievable complexity.

Donovan Hide

unread,
Jan 14, 2013, 5:40:19 AM1/14/13
to Niklas Schnelle, Ethan Burns, david...@gmail.com, golang-nuts
Hi again,

thanks David and Niklas for the replies. Interesting to see the effect of inlining the default sort.Sort() code. I thought about doing that but the code was too long, hence choosing ShellSort :-)

I know Shell sort is unstable, but in my test case (not the test code) it is significantly faster for my usage than using sort.Sort() with pointer-receiver interface implementing methods. I'm basically merging two streams of inverted indicies, which need to be sorted before they can merge. Using Shell sort has trimmed seconds of this time.

My requirement is to have the fastest possible sort with the smallest amount of struct specific code. The lines of code overhead for shell sort compares favourably to implementing the sort interface, as seen in this example:


versus


If you can come up with a shorter, faster, more stable sort, I'd be very interested!

Cheers,
Donovan.

Donovan Hide

unread,
Jan 14, 2013, 6:06:48 AM1/14/13
to Niklas Schnelle, Ethan Burns, david...@gmail.com, golang-nuts
When I say:

Donovan Hide

unread,
Jan 14, 2013, 6:07:58 AM1/14/13
to Niklas Schnelle, Ethan Burns, David DENG, golang-nuts
Oops, premature send!

When I said:

On 14 January 2013 10:40, Donovan Hide <donov...@gmail.com> wrote:
Interesting to see the effect of inlining the default sort.Sort() code

I meant specialising of the default sort.Sort() code, not inlining!

David DENG

unread,
Jan 14, 2013, 6:12:07 AM1/14/13
to golan...@googlegroups.com, Niklas Schnelle, Ethan Burns, david...@gmail.com
Have you tried optimize the Less function as I mentioned?

David

Peter

unread,
Jan 14, 2013, 6:12:51 AM1/14/13
to golan...@googlegroups.com, Niklas Schnelle, Ethan Burns, david...@gmail.com
Requiring the sort to be stable just slows it down. After all, you're expecting it to do extra work. It seems unfair to compare these two algorithms.

Specialising the sort.Sort() code does make for an interesting comparison though.

Donovan Hide

unread,
Jan 14, 2013, 6:20:39 AM1/14/13
to David DENG, golang-nuts, Niklas Schnelle, Ethan Burns
Updated the benchmark:


The Less optimisation does improve the result. It's intriguing that doing so many dereferences improves performance. Makes for less readable code, but more speed :-)

Donovan Hide

unread,
Jan 14, 2013, 6:24:07 AM1/14/13
to Peter, golang-nuts, Niklas Schnelle, Ethan Burns, David DENG
On 14 January 2013 11:12, Peter <peter.a...@gmail.com> wrote:
It seems unfair to compare these two algorithms.

It's not really a question of fairness, more a question of how to make the standard library exhibit its best performance and why the most obvious usage is not necessarily the fastest. A performance tips page on the wiki would be a great addition! I know the argument is don't optimise, just wait for the compiler to get better, but in the here and now, these things do help!

Peter

unread,
Jan 14, 2013, 6:55:33 AM1/14/13
to golan...@googlegroups.com, Peter, Niklas Schnelle, Ethan Burns, David DENG
I think I expressed myself poorly. The standard sort function is a stable sort. If you don't require stability then rolling your own algorithm could give significant improvements, even if it was implemented in the same interface manner that package sort is. But to then compare the non-stable sort to the stable one is unfair, as package sort is doing extra work that you don't really need. The "most obvious usage" gives you stability, something that people may take for granted.

Comparing different optimizations of the same algorithm is certainly interesting though, please keep us informed.

Job van der Zwan

unread,
Jan 14, 2013, 7:51:19 AM1/14/13
to golan...@googlegroups.com, Peter, Niklas Schnelle, Ethan Burns, David DENG
On Monday, 14 January 2013 12:55:33 UTC+1, Peter wrote:
I think I expressed myself poorly. The standard sort function is a stable sort. If you don't require stability then rolling your own algorithm could give significant improvements, even if it was implemented in the same interface manner that package sort is. But to then compare the non-stable sort to the stable one is unfair, as package sort is doing extra work that you don't really need. The "most obvious usage" gives you stability, something that people may take for granted.

It's still worth mentioning so people can judge if the effort is worth the trouble. 

Niklas Schnelle

unread,
Jan 14, 2013, 8:00:37 AM1/14/13
to golan...@googlegroups.com, Peter, Niklas Schnelle, Ethan Burns, David DENG
You also might want to try https://github.com/psilva261/timsort
timsort is generally considered to be the fastest available comparison based sort it's quite complex using a run based mergesort.
From a time complexity point of view it's of course still O(n*log n) but with a lot of engineering to make it super fast with partially
ordered data (in both directions) and a lot of fine tuning.
Unless your data is something like ints or strings where you can use an O(n) non comparison sort that's probably the fastet you will get.
Also make sure not to use tiny data sets for testing, even extremely poor algorithms can be fast on small data sets because of tighter loops.
For example with <= 7 elements bubble sort is one of the fastest sorts but it's O(n^2) and sorting about a million elements will
take in excess of an hour.

Kevin Gillette

unread,
Jan 14, 2013, 1:54:52 PM1/14/13
to golan...@googlegroups.com
That explicit pointer receivers for slice types, which serve no semantic purpose if the slice length/cap is not to be touched, perform better than non-pointer receivers seems to me to be either a case of the pointer methods being heavily optimized or the non-pointer methods generating far less performant code than I would have thought.

I will test the performance of both the obvious (semantically correct and simplest) way and the messy pointerized way with optimizations disabled.

Donovan Hide

unread,
Jan 14, 2013, 2:02:51 PM1/14/13
to Kevin Gillette, golang-nuts
Hi Kevin,

compare the not particularly useful wrapper code (*InvertedSlice).Less:


with InvertedSlice.Less :


I assume that this is to do with decoding the interface values, as minux hinted earlier, but it feels like it could/should be optimised out somehow...

Cheers,
Donovan.


On 14 January 2013 18:54, Kevin Gillette <extempor...@gmail.com> wrote:
That explicit pointer receivers for slice types, which serve no semantic purpose if the slice length/cap is not to be touched, perform better than non-pointer receivers seems to me to be either a case of the pointer methods being heavily optimized or the non-pointer methods generating far less performant code than I would have thought.

I will test the performance of both the obvious (semantically correct and simplest) way and the messy pointerized way with optimizations disabled.

--



Donovan Hide

unread,
Jan 14, 2013, 2:25:37 PM1/14/13
to Kevin Gillette, golang-nuts
Looks like the generated code is built here:

Kevin Gillette

unread,
Jan 14, 2013, 2:41:07 PM1/14/13
to golan...@googlegroups.com, Ethan Burns
Couldn't pointer wrappers for value-receiver methods use a unified loader simply based on the function address and the receiver size, consisting of some asm that is jmp'd to? It seems a shame to generate wrapper methods for this (presumably incurring double the call overhead).

minux

unread,
Jan 14, 2013, 2:48:23 PM1/14/13
to Donovan Hide, Niklas Schnelle, Ethan Burns, David DENG, golang-nuts
This is a form of generics that slows down the compiler and bloats the generated binary.

Kevin Gillette

unread,
Jan 14, 2013, 3:15:03 PM1/14/13
to golan...@googlegroups.com
https://gist.github.com/4532825

Interesting note: the faster dereference-heavy implementation ended up being slower with bounds checking disabled, on my AMD X2, than it was with a default compile.

minux

unread,
Jan 14, 2013, 3:33:40 PM1/14/13
to Peter, golan...@googlegroups.com, Niklas Schnelle, Ethan Burns
On Mon, Jan 14, 2013 at 7:55 PM, Peter <peter.a...@gmail.com> wrote:
I think I expressed myself poorly. The standard sort function is a stable sort. If you don't require stability then rolling your own algorithm
sort.Sort is *not* a stable sort, and you have to maintain original indices for
elements yourself to make it a stable sort.
see also issue 3671.

Peter

unread,
Jan 14, 2013, 3:41:54 PM1/14/13
to golan...@googlegroups.com, Peter, Niklas Schnelle, Ethan Burns
Oh. My mistake. Sorry about that.

In that case I wonder how fast the generic version of shell sort is. I read the paper on which the sort function is based a while ago (perhaps I should re-read it if I remembered it being stable), I assume the authors pretty thoroughly investigated other algorithms back then.

Kyle Lemons

unread,
Jan 14, 2013, 3:52:11 PM1/14/13
to minux, Donovan Hide, Niklas Schnelle, Ethan Burns, David DENG, golang-nuts
Naively, it seems like all value methods could be generated with a few assembly instructions before it that load the value from a pointer and then just runs into the main function.  Conceptually:

func (p *Type) Method:
  v := *p
func (v Type) Method:
  some instructions here

Then, assuming this header is a constant or calculable size given the type, generating an interface for a pointer value requires grabbing either the method address for pointer methods or (method_addr - header_size) for value methods.



--
 
 

Kevin Gillette

unread,
Jan 14, 2013, 3:52:53 PM1/14/13
to golan...@googlegroups.com, Peter, Niklas Schnelle, Ethan Burns, David DENG
The timsort implementation for Go that you linked to has some interesting properties. First, it's about benchmarked at being 2x faster than Go, but it makes half the function calls (timsort 'owns' the data and thus does swapping, while the stdlib sort asks the data to swap itself). Both sorts effectively are equivalent in asking the data to do the comparison itself. Since, as with the stdlib sort, this timsort appears to take care to do work in-place (rather than with copies), the timsort implementation theoretically could have been designed to be a drop-in replacement for the stdlib sort, accepting sort.Interface instead of the data and a less function -- if that had been done, the performance gap likely would have been less (thus I argue the 2x relative speed of timsort is not solely due to the merits of timsort). The stdlib sort is also a hybrid sort, switching between insertion sort, heap sort, and quick sort, depending on (sub)task size, though all of these are still done in-place.

That design decision also comes with downsides: while it's argued by many that Len and Less are 'baggage' in the stdlib sort when all one wants to do is make a reverse sort, the stdlib does allow data to "own" itself, which is a very important property. For one thing, timsort is only immediately convenient if the data is already stored in an []interface{}, which it very often is not (thus offsetting much of the justification for the comparatively more concise function-based API in timsort, since you often must do a copy loop in, and another copy loop back out). Because of these copy loops, this timsort artificially still counts, in many situations, as requiring n additional space, and additional benchmarks should be added to account for the common case of input data not already being stored in []interface{}, since the speed gap would be further reduced in these cases.

What is most important to me is that the stdlib sort can have fully compiler verified semantics for homogeneous data (like []int), without any need for type assertions, since the sort does not ever interact with the elements directly in any way -- the same cannot be said of the timsort implementation. This can have ramifications for long running processes that only rarely need to sort -- a type assertion to the wrong type in a timsort comparison function will lead to a runtime panic which may crop up weeks or months after deployment, whereas it would be unnecessary in the stdlib sort in the equivalent case (again, with homogeneous, which is at least as common as sorting heterogeneous data).

Niklas Schnelle

unread,
Jan 14, 2013, 4:14:18 PM1/14/13
to golan...@googlegroups.com, Peter, Niklas Schnelle, Ethan Burns, David DENG
I had not looked into this timsort implementation, just knew it from Python where it's the standard sort. I think the sort interface is
perfect because making the data swap itself allows much more flexibility. In another project (in Java) I used
a very similiar heap interface and this allows stuff like dragging indizes with the data without much hassle even if they are in a different data structure and so on.

Note that timsort is mergesort based and therefor is not in-place as the author also notes on the github page.
That's probably why it needs to own the data.

Donovan Hide

unread,
Jan 14, 2013, 4:22:39 PM1/14/13
to Kevin Gillette, golang-nuts, Peter, Niklas Schnelle, Ethan Burns, David DENG
the timsort implementation theoretically could have been designed to be a drop-in replacement for the stdlib sort, accepting sort.Interface instead of the data and a less function

I had a good long read of timsort today as well. I think the design decision with the standard library sort.Interface is not to allow the sorter implementation to know the type of the elements being sorted and to not have any means of allocating extra memory during the sort operation. These are great ideals, and work well, but as timsort proves, a bit of extra allocation can go along way!

I tried to think how you could extend the sort.Interface with one or two extra methods to allow for timsort to be implemented against an interface and not a slice. The only solution I could think of is where you provide a Stash(i,j int) and Retrieve(i,j int) method:


Where the stash area is used for doing the temporary merging. Perhaps a range is required instead to make a faster copy operation possible rather than indexed single value copying....

I believe this interface would allow for a generic timsort, or any other allocating sort. Just need to write it now :-)

Cheers,
Donovan. 




Kevin Gillette

unread,
Jan 14, 2013, 4:38:29 PM1/14/13
to golan...@googlegroups.com, Peter, Niklas Schnelle, Ethan Burns, David DENG
@Niklas: I had missed the "not in-place" notice on the documentation, and glossed over lines 120-126 in timsort.go. Merge sorts can indeed be done in-place, however.

@Donovan: I'm not aware of any comparison sort where reallocation of the actual data is needed. Besides which, a traditional recursively copying merge-sort could still be done without copying the actual data (and thus could still use sort.Interface alone): you just initialize an []int (I'll call it an 'index list') to the length of the input, such that each element is initially its own index, such as []int{0,1,2,3} when Len() returns 4. Then you do your series of Less calls but without any calls to Swap until you've sorted your internal []int. Finally, you scan over the index list, calling Swap for each index out of place. For example, if the result of sorting your index list resulted in []int{3,1,0,2}, the following calls would occur: Swap(3, 0); Swap(0, 2); It's essentially sorting the a known, type-safe surrogate for the initial data, involving as many copies as needed, following by asking the actual data to update itself based on the surrogate.

Donovan Hide

unread,
Jan 14, 2013, 4:58:05 PM1/14/13
to Kevin Gillette, golang-nuts, Peter, Niklas Schnelle, Ethan Burns, David DENG
Hi Kevin,

thanks very much for that detailed explanation :-) That idea of a surrogate index/index list does indeed solve the allocation issue. Its interesting reading the current Java port of timsort that we're all referring to:


You can definitely see the Java influences, not very idiomatic Go.

The index list seems to be present in this implementation of a stable sort based on sort.Sort():


Although the sorted struct itself deals with the updating of the index list. This sounds like a fun project so I might have a go at porting timsort to sort.Interface...

Cheers,
Donovan.





On 14 January 2013 21:38, Kevin Gillette <extempor...@gmail.com> wrote:
@Niklas: I had missed the "not in-place" notice on the documentation, and glossed over lines 120-126 in timsort.go. Merge sorts can indeed be done in-place, however.

@Donovan: I'm not aware of any comparison sort where reallocation of the actual data is needed. Besides which, a traditional recursively copying merge-sort could still be done without copying the actual data (and thus could still use sort.Interface alone): you just initialize an []int (I'll call it an 'index list') to the length of the input, such that each element is initially its own index, such as []int{0,1,2,3} when Len() returns 4. Then you do your series of Less calls but without any calls to Swap until you've sorted your internal []int. Finally, you scan over the index list, calling Swap for each index out of place. For example, if the result of sorting your index list resulted in []int{3,1,0,2}, the following calls would occur: Swap(3, 0); Swap(0, 2); It's essentially sorting the a known, type-safe surrogate for the initial data, involving as many copies as needed, following by asking the actual data to update itself based on the surrogate.

--
 
 

Kevin Gillette

unread,
Jan 14, 2013, 5:31:02 PM1/14/13
to golan...@googlegroups.com
Regarding sort stability: I've found that a stable sort is not only usually unimportant in Go, but is also often semantically indistinguishable from an unstable sort, since many of the structures being sorted are slices of non-pointer types where the entirety of the element is being sorted (in which case the result of sorting out of insertion order is identical to sorting with respect to insertion order).

There are certainly many exceptions to this, but the legitimate uses of a stable sort (which require either more memory or more time) are far less common in Go than, in particular, languages with reference semantics on all objects (most of the popular dynamically typed vm languages) or the potential for 'invisible' object-data or an opaque memory-layout (this includes most languages with traditional OOP classes).  In those other languages, it's harder, if not impossible, to deal with data without dealing with reference, and it's often idiomatic to attach a lot of volatile data (such as caches) to objects which may end up being sorted. In such cases, it's 'safer' to make stable sorts the default.

Donovan Hide

unread,
Jan 15, 2013, 2:18:39 PM1/15/13
to Kevin Gillette, golang-nuts
Hi Kevin,

once again many thanks for this idea. I explored it a little today, but got slightly stumped. I think the order of processing of the index list can get a little complex, or maybe I'm just a bit slow :-)

Given a nearly sorted slice:

[-90 45 75 170 -802 2 24 66]

and an index list:

[1 4 6 7 0 2 3 5]

this should let you reach a a fully sorted list:

[-802 -90 2 24 45 66 75 170]

The problem I have is choosing which order to process the index list, using only a Interface.Swap(i,j int) method. Without the possibility of a temp variable it can get a bit gnarly. The obvious left to right order does not work. I've tried marking processed items as -1, but the logic of which ones to mark gets equally complex. Thanks for any ideas!

Cheers,
Donovan.

Kevin Gillette

unread,
Jan 15, 2013, 5:52:45 PM1/15/13
to golan...@googlegroups.com, Kevin Gillette
You don't happen to have any code examples of this that we can hack on, do you (in order to get past this processing-order issue)? My initial session of thinking it through, for the inputs I considered, made it seem like left-to-right sequentially could work, though after a swap you may need to consider what just swapped in (in case it needs to swap again). Suffice it to say, that approach (keep swapping until everything looks right) is fairly naive because it may result in redundant swaps. There are a few other approaches that may also be viable (such as keeping a shrinking slice of unresolved indices and scanning over that in as many passes as needed -- should be n log n instead of n^2), but in any case, I'm reasonably confident that the 'index list' approach can work for all of the copying comparison sorts I'm aware of without increasing the time-complexity of the given sort you're implementing.

Donovan Hide

unread,
Jan 15, 2013, 6:18:49 PM1/15/13
to Kevin Gillette, golang-nuts
I do, but it's a very messy, debug message laden hack of a merge sort :-)

http://play.golang.org/p/Eu_DeiPVa0

The swap problem appears in the final merge:

After:  [-802 -90 2 66 45 24 75 170] [-1 -1 -1 7 0 2 -1 -1] 0 8 4 8

with 24 45 66 being out of order. I've been using a -1 scheme to mark completed swaps, but this is obviously wrong in its current form. The main goal, as discussed is to get a simple merge sort working with sort.Interface and then move towards timsort. A lofty goal :-)

Cheers,
Donovan.

David DENG

unread,
Jan 16, 2013, 1:34:12 AM1/16/13
to golan...@googlegroups.com, Kevin Gillette
Is this the algorithm you are looking for:


David

Kevin Gillette

unread,
Jan 16, 2013, 3:14:51 AM1/16/13
to golan...@googlegroups.com, Kevin Gillette
David is correct. Before seeing his answer, as a sanity check, I reimplemented and got a working solution...

As it turned out, while Donovan and David's code deals with something like a 'push to' list of indices, my merge-sort generates a 'pull from' list, and I ended up needing to post-process that and add a corresponding 'push to' list as well (so mine takes 2n explicit storage, plus space for the call stack, though avoids a dead-leaf recursive call.

Oddly enough, if you test the already sorted input, the stdlib makes two swaps (presumably the second swap is to undo the first).

Verbose version is at <http://play.golang.org/p/I2E0L9oJJl>. To ease debugging, I used a byte-slice consisting of the letters a-j instead of numbers (don't try the test version as is with a longer input).

Trimmed version is at <http://play.golang.org/p/1pSgMU0Cx8>

Despite this merge sort generally requiring fewer swaps and comparisons than the stdlib sort (benefit of using a lot more mem), it's still considerably slow (and not tuned, of course). Results I get on an AMD X2 3800+ with more than enough ram:

BenchmarkMergeSort       1000000              2788 ns/op
BenchmarkStdlibSort      1000000              1143 ns/op

Kevin Gillette

unread,
Jan 16, 2013, 4:04:03 AM1/16/13
to golan...@googlegroups.com, Kevin Gillette
@Donovan: After 10 mins of fiddling around, I wasn't able to apply David's fix to your own sort (I wasn't able to apply it to my own successfully either, despite it being a correct algorithm). It should have been easier to do in mine, since the swaps were a separate, final stage in my case, after arriving at a correct push-to list, while you interleave your swaps (which is fine -- just a different approach). You already arrive at the correct push-to list at the first "Before" of the final recursive step, meaning that your swap algorithm did work up until that point, but not after that point. Odd.

Assuming that your current implementation is no slower than whatever the final working fix for it will be, I benchmarked it and arrived at 1316 ns/op, which is not much slower than the stdlib sort on my hardware (though the effect of garbage is not fully considered).

Regarding timsort: if you want to interleave swaps (rather than do the lazy thing like me, and save them for last), I expect the tricky parts will be anywhere you see use of copy, which would generally be after runs are detected, or in the classic merge sort case when the left or right list has been exhausted. To preserve locality of reference as much as possible (if even possible), ideally you'd want to order your swaps so that a single value "slides" to its destination, rather than having something like a shrinking end-to-end swap. The slide would look like:

51234
15234
12534
12354
12345

The worse (and probably harder to program anyway), shrinking end-to-end swap, would look like:

51234
41235
31245
21345
12345

Donovan Hide

unread,
Jan 16, 2013, 7:53:33 AM1/16/13
to Kevin Gillette, David DENG, golang-nuts
Hi Kevin and David,

David, thank you very much for that algorithm. Certainly does the job, I particularly like the triple swap :-)

Kevin, thanks for all the analysis. I got a version of merge sort working. It's nearly twice as fast as the standard library sort in the benchmarks in the standard library test suite, although it's twice as slow for the Bentley McIlroy adversary test...

It's current obvious disadvantage is the need for a scratch space equivalent to size of data * size of int + a slice header. This is the key advantage of timsort, so I'll explore further now that I have a working merge sort.


Cheers,
Donovan.

David DENG

unread,
Jan 17, 2013, 7:02:40 AM1/17/13
to golan...@googlegroups.com, Kevin Gillette, David DENG
Is the data pattern in the benchmark the pattern you are using? I implemented a in-place merge sort, which performs much better in your test-data, with less memory usage. (But slower than build-in sort for totally random intergers)

Here is the benchmark data:

 
Build-in sort 
PASS
BenchmarkSortString1K       5000            803246 ns/op
BenchmarkSortInt1K          5000            314817 ns/op
BenchmarkSortInt64K           50          31921820 ns/op
BenchmarkSortInt1M             5         615235200 ns/op
BenchmarkSortRandInt1M         5         720041180 ns/op


mergeSort
PASS
BenchmarkSortString1K       5000            420223 ns/op
BenchmarkSortInt1K         10000            205411 ns/op
BenchmarkSortInt64K          100          20051150 ns/op
BenchmarkSortInt1M             5         397422720 ns/op
BenchmarkSortRandInt1M         1        1285073500 ns/op


inplaceMergesort
PASS
BenchmarkSortString1K       5000            522830 ns/op
BenchmarkSortInt1K         20000             91005 ns/op
BenchmarkSortInt64K          500           7144407 ns/op
BenchmarkSortInt1M            10         134307680 ns/op
BenchmarkSortRandInt1M         1        2711155100 ns/op


The code is shared here for your information:


David

Donovan Hide

unread,
Jan 17, 2013, 7:54:28 AM1/17/13
to David DENG, golang-nuts, Kevin Gillette
Hi David,

the test suite is just ripped from the standard library. Not sure if you are benchmarking my attempt at a merge sort or Kevin's. The data is not representative of my workload, but when is test data ever that :-) 

I think just benchmarking time in isolation is not a useful metric. Your in place merge sort is slower for strings which suggests either more comparisons or more swaps. It's probably worth using instrumented slices, something like:


My goal is to learn about all the different sort strategies and make the text search engine I'm working on respond as quickly as possible with the least amount of memory usage. Sorting a section of an inverted index is part of the performance I'm trying to improve. Timsort might be part of the answer, which involves the use of an intelligent merge sort. I'm just learning stuff as I go along :-)

Thanks for your code though, I've learn a lot already from your algorithms!

Cheers,
Donovan. 

Niklas Schnelle

unread,
Jan 17, 2013, 8:04:32 AM1/17/13
to golan...@googlegroups.com, David DENG, Kevin Gillette
If you're searching for strings only you might want to have a look at Radix sort or even Tries (https://en.wikipedia.org/wiki/Trie) which is specificcally designed for string searches and stuff like autocomplete.

Niklas Schnelle

unread,
Jan 17, 2013, 8:10:35 AM1/17/13
to golan...@googlegroups.com, David DENG, Kevin Gillette
On another note, in the algorithm research community comparison based sorts are only measure by the number of swaps which is independent of the architecture.
Also if you wan't to try different sorting algorithms, quicksort is really nice to implement and should be quite a bit faster than mergesort for most data.
It has a worst case of O(n^2) comparisons but for a median of three pivot picking stragey the only way you're going to hit this is with an adverserial supplying exactly the right data to trigger it. Or you use a randomized pivot picking strategy, there the probability to hit O(n^2) run time for most data sets will be much less than the probability of your computer getting hit by a meteroid.

Donovan Hide

unread,
Jan 17, 2013, 8:16:46 AM1/17/13
to Niklas Schnelle, golang-nuts, David DENG, Kevin Gillette

If you're searching for strings only you might want to have a look at Radix sort or even Tries (https://en.wikipedia.org/wiki/Trie) which is specificcally designed for string searches and stuff like autocomplete.

Hi Niklas,

thanks for the suggestion. I'm specifically sorting a data structure returned from a massive (up to 64GB so far!) in memory inverted index. No strings involved, just a series of hashes of a rolling window of text (Rabin Karp style) and a position in a document. The specific bottle neck is in reordering the matching positions back into document order for numerous very long documents. The code is here:


Currently an inlined/specialised Shell sort gives the best performance by dropping sort.Interface, which is a shame. Just wanted to try and improve this by looking at other sort.Interface compatible sort operations.

Cheers,
Donovan.



Donovan Hide

unread,
Jan 17, 2013, 8:21:10 AM1/17/13
to Niklas Schnelle, golang-nuts, David DENG, Kevin Gillette

On 17 January 2013 13:10, Niklas Schnelle <niklas....@gmail.com> wrote:
Also if you wan't to try different sorting algorithms, quicksort is really nice to implement and should be quite a bit faster than mergesort for most data.


I think this has already been done :-)

Donovan Hide

unread,
Jan 17, 2013, 8:32:46 AM1/17/13
to Niklas Schnelle, golang-nuts, David DENG, Kevin Gillette

On another note, in the algorithm research community comparison based sorts are only measure by the number of swaps which is independent of the architecture.

Bit confused why this is the case. It's perfectly possible that a comparison for a large struct could be a heavier operation than the move itself, especially if the swap only involves a change of two pointers, as would be the case in a linked list...

Philip

unread,
Jan 17, 2013, 8:34:47 AM1/17/13
to golan...@googlegroups.com, Kevin Gillette, Peter, Niklas Schnelle, Ethan Burns, David DENG
Actually I started working on making this code more idiomatic Go 1, but this is really some work -- at least if you don't want to sacrifice the current performance.  In fact some idioms in the code used are from Go versions that didn't know append.


You can definitely see the Java influences, not very idiomatic Go.

Donovan Hide

unread,
Jan 17, 2013, 8:44:03 AM1/17/13
to Philip, golang-nuts, Kevin Gillette, Peter, Niklas Schnelle, Ethan Burns, David DENG

On 17 January 2013 13:34, Philip <philip...@tinet.net> wrote:
Actually I started working on making this code more idiomatic Go 1, but this is really some work -- at least if you don't want to sacrifice the current performance.  In fact some idioms in the code used are from Go versions that didn't know append.

Hi Philip,

no offence intended with the idiomatic comment :-) I think timsort can be re-modelled to be a lot more like the standard library sort and remove a lot of duplication in the process (ie. mergeLo, mergeHi, gallopLeft and gallopRight). I also believe the memory allocation  required for the tmp slice can be factored out. It might be possible to generate the run list recursively as well...

Lots of ideas! Currently implementing gallopSearch as a drop in replacement for sort.Search. Expect a pull request soon(ish!).

Cheers,
Donovan.  

Niklas Schnelle

unread,
Jan 17, 2013, 9:10:25 AM1/17/13
to golan...@googlegroups.com, Philip, Kevin Gillette, Peter, Niklas Schnelle, Ethan Burns, David DENG
Sorry got a mix up in my head there, of course one only counts the comparisons not the swaps, I guess I should only post stuff like that after the coffee :-)

Donovan Hide

unread,
Jan 17, 2013, 9:17:08 AM1/17/13
to Niklas Schnelle, golang-nuts, Philip, Kevin Gillette, Peter, Ethan Burns, David DENG

Sorry got a mix up in my head there, of course one only counts the comparisons not the swaps, I guess I should only post stuff like that after the coffee :-)

I'd say count both! My idea of a perfect way to compare sort efficiencies and choose the right one for the job:

* Number of swaps
* Number of comparisons
* Time spent by sort not comparing and not swapping
* Amount of memory allocated on the heap
* Amount of memory consumed through recursion on the stacks, ie. maxDepth

Philip

unread,
Jan 17, 2013, 9:25:27 AM1/17/13
to golan...@googlegroups.com
Nice, looking forward to that :-)

Cheers, Philip

Kevin Gillette

unread,
Jan 17, 2013, 2:43:48 PM1/17/13
to golan...@googlegroups.com, David DENG, Kevin Gillette
I agree entirely with idea of 3rd party sorts, in general, using the sort.Interface approach, since it separates the sort algorithm from the data -- sort.Interface is clearly oriented toward comparison sorts, and thus any non-comparison sort cannot be expected to utilize sort.Interface effectively. A radix sort (and other bit access sorts) could possibly be efficient, yet generic, by assuming a slice or array pointer and accepting an interface containing Len and a method returning a pointer to the slice/array, and a pointer to the key field within that first element. These sorts could then use reflect to determine the size of the elements and the nature of the key field (whether it's a slice/string or int type), and then use unsafe (the algo would have to be very well tested) to compare bits and perform swaps generically based on element/key information -- as I understand it, many of these sorts lose most of their value if they don't have direct access, and since they may be quite tricky to implement on an as-needed basis, and extremely verbose if implemented with type asserts and specialization (which doesn't help slices of structs), unsafe may be the only way for these sorts to be reusable.

Also, the relative speeds, in practice, of any of these algorithms may not be applicable when access is mediated behind an interface: quicksort is generally considered more cache-friendly than a copying mergesort, just as accesses to a contiguous block of memory are much more cache-friendly than a linked list of independently allocated elements. Therefore, performance of all comparison sorts using this interface will be subject to the interface implementation (and the data model behind it): cache-friendly sorts may lose this benefit due to thrashing, and sorts that already cache-thrash on their own will look comparatively quite good when paired with data models that also thrash.

@Donovan: keeping track of the variance of time taken by Less and Swap calls in isolation (no actual sort would be run), both in predictable strides and with random indices, may provide a way to measure if the sortable data is staying within cache lines.

While working on the merge sort stuff a couple days ago, I started on a sortutil package, which among other things, provides sort.Interface wrappers that aid in developing and debugging sort.Interface compatible sorts, and a various element swappers that operate on sort.Interface, including Reverse, Rotate, Shuffle, and Skew. Skew, which does block-shifting using Swap, may particularly be useful for timsort, since it can handle left-or-right shifts with an emphasis on locality (an element never moves more than block-width indices in a single Swap), and should use the minimum number of Swap's -- the downside is that it currently uses recursion for planning purposes to considerably simply the algorithm (though it doesn't any additional storage).

Also, when implemented with sort.Interface, some things in timsort, such as the use of binary search, will still retain its benefit (or may even gain more relative value, due to the higher frequency of function calls), yet galloping mode, for example, may prove detrimental since sort.Interface does not provide a way to shift an entire block of elements at once (though a superset of sort.Interface with a block-shift method could achieve this).

If a block-shifting method is desired, and we can agree on something clean/reasonable, Skew would be extended to use that, when available. Such a method, however, would only really benefit data modelled in slices/arrays (which is arguably most of the cases), and is just tricky enough that you can't expect everybody who implements sort.Interface correctly on the first try to be able to implement this additional method correctly. It'd also only be useful on sorts that do sorted-run-detection (or take measures to construct sorted runs to be block-copied).

Kevin Gillette

unread,
Jan 24, 2013, 6:26:11 PM1/24/13
to golan...@googlegroups.com
I've completed the aforementioned sortutil package, which is available at <github.com/extemporalgenome/sortutil>, which among other things, has logging and stats-gathering wrappers. In terms of stats, it provides call counts and per-element stats: the min, max, mean, and standard deviation of Less and Swap calls across all elements. The Analyze function provides a zero-setup means for observing the behavior of a sort.Sort compatible algorithm through a variety of hardcoded inputs. However, the data-sets it uses for analysis mostly contain 26 elements (there is a 23-element set as well), so out of the box, it won't help in analysis of behaviors/bugs which manifest themselves at larger or particularly sized inputs, or which display pathological behavior for very specific input sequences. I may later add a verification function that runs all possible permutations up to a given size through a given sort algorithm.

I do have some data now on the stdlib's behavior regarding pre-sorted inputs: <https://gist.github.com/4629151>. The gist contains partial output from passing sort.Sort through sortutil.Analyze -- it demonstrates that the stdlib Sort performs spurious swaps, at least for a presorted input of length 26 (though I've tried it with various smaller lengths, and the same behavior occurs).

Donovan Hide

unread,
Jan 24, 2013, 7:30:35 PM1/24/13
to Kevin Gillette, golang-nuts
This looks really interesting. I've been doing some work on minimising in-place swaps (or inversions) for galloping runs in a merge sort based on Timsort, so this will help analyse that work. Will give some feedback on github in the next week.

Might be worth trialling this change to sort.Sort through your analysis suite:


Cheers,
Donovan.

Kevin Gillette

unread,
Jan 24, 2013, 8:05:56 PM1/24/13
to golan...@googlegroups.com, Kevin Gillette
Is there, by chance, a patch associated with that issue?
Reply all
Reply to author
Forward
0 new messages