I don't have direct feedback on this, but I do have an observation based on my own faster-sort code, which is that timing seems about the same, and scaling seems different than this report.
The sorty_test.go file starts with "const N = 1 << 28" so we're talking about sorting a 268,435,456-element array of ints. In that code, they are uint32s and float32s, in mine, 64-bit ints. I should be slower, but adjusting my benchmarks for this array size I see:
go standard library sort.Ints()
54669865378 ns/op
54.6 sec
my quicksort
21398838106 ns/op
21.3 sec (2.55x stdlib)
my parallel quicksort
5428725888 ns/op
5.4 sec (10.0x stdlib, 3.94x serial version on my 4 cpu macbook pro)
These are 64-bit values and 32-bit should is just slightly faster
5.3 sec
I don't see a slowdown here Go version to version. Of course 10x slower in the standard library vs tuned parallel is unfortunate.
Michael