Advice for slices.SortFunc by field via pointer path optimization

259 views
Skip to first unread message

Jorropo

unread,
Jul 25, 2025, 1:16:50 PMJul 25
to golang-dev
Hello I would like to know if the approach I am taking is reasonable before I spend a lot of time on it,
would you consider merging the following optimization ?

I am trying to improve the performance of:

slices.SortFunc(data, func(a, b T) int { return cmp.Compare(a.d, b.d) })

In english sorting by the field of a struct.

Todo so the compiler would inspect the function literal and rewrite this call to slices.sortWithPointerPath if it matches some pattern (return the result of cmp(a.ppableExpression, b.ppableExpression) or a and b reversed).


I have currently implemented enough to write a benchmark (everything but the compiler rewrite, the benchmark can call sortWithPointerPath manually).
This yields >2X performance win:
goos: linux
goarch: amd64
pkg: slices
cpu: AMD Ryzen 5 3600 6-Core Processor    
                                          │      old      │                 new                  │
                                          │    sec/op     │    sec/op     vs base                │
SortUniqueUint32InAStruct/Size16-12          659.7n ±  2%   265.3n ±  4%  -59.78% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size32-12         1839.3n ±  5%   662.8n ±  8%  -63.97% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size64-12          4.393µ ±  2%   1.497µ ±  5%  -65.92% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size128-12        10.642µ ±  4%   3.354µ ±  9%  -68.48% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size256-12        24.771µ ±  1%   7.357µ ±  8%  -70.30% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size512-12         54.74µ ±  2%   16.05µ ±  5%  -70.67% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size1024-12       121.92µ ±  3%   45.98µ ±  2%  -62.28% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size2048-12        280.5µ ±  9%   132.1µ ± 10%  -52.89% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size4096-12        605.0µ ±  4%   292.1µ ±  4%  -51.72% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size8192-12       1307.6µ ±  5%   634.7µ ±  4%  -51.46% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size16384-12       2.806m ±  2%   1.344m ±  2%  -52.10% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size32768-12       6.163m ±  3%   2.816m ±  2%  -54.31% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size65536-12      12.970m ±  5%   6.075m ±  3%  -53.16% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size131072-12      27.04m ±  3%   13.00m ±  9%  -51.92% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size262144-12      57.65m ±  5%   28.86m ±  7%  -49.94% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size524288-12     122.76m ±  5%   61.13m ±  6%  -50.20% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size1048576-12     251.6m ±  7%   119.6m ± 11%  -52.45% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size2097152-12     523.8m ± 11%   257.2m ±  5%  -50.90% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size4194304-12    1080.5m ±  5%   512.3m ±  4%  -52.59% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size8388608-12      2.299 ±  7%    1.061 ±  3%  -53.84% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size16777216-12     4.770 ±  4%    2.287 ±  6%  -52.06% (p=0.000 n=10)
geomean

Note this benchmark shows the difference between virtual compares and pointer path compares for a 5×uint32 struct sorting by 4th field.
The data is shuffled using identical PCG seeds and rand.Shuffle, the cost of shuffling has been removed by benchmarking shuffle alone and subtracting identical nsperop shuffle costs from the respective sorts nsperop benchmarks.
Size indicates the length of the buffer in structs.

Keith Randall

unread,
Jul 25, 2025, 1:43:52 PMJul 25
to Jorropo, golang-dev
I feel like it would be better to have an indirect sorting function, like
```
slices.SortFuncIndir(data []T, cmp func(a, b *T) int)
```
(not happy with the name)

Then callers can do all the unsafe shenanigans inside the comparison function that they want, if they want. Or just do `cmp.Compare(a.x.y, b.x.y)`, no unsafe needed.

This is presuming that the gains you are seeing is from passing large values to the comparison function. By passing a pointer instead the caller can access just the data they need.


--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/golang-dev/CAHWihb8FDRQOmhinwax5ReH0vPoZ3QPxRQXZ%2BhkbiPSNv2v6nA%40mail.gmail.com.

Jorropo

unread,
Jul 25, 2025, 8:55:32 PMJul 25
to kei...@alum.mit.edu, golang-dev
I originally wanted to address the virtual call overhead.
Using the same methodology but this time benchmarking plain `uint32`s we see performance gains between 40% and 16%.
goos: linux
goarch: amd64
pkg: slices
cpu: AMD Ryzen 5 3600 6-Core Processor
                                          │     old      │                 new                 │
                                          │    sec/op    │   sec/op     vs base                │
SortUniqueUint32InAStruct/Size16-12          172.2n ± 2%   103.9n ± 4%  -39.67% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size32-12          507.3n ± 4%   304.2n ± 4%  -40.04% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size64-12         1294.7n ± 2%   797.7n ± 6%  -38.39% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size128-12         3.204µ ± 4%   1.969µ ± 2%  -38.54% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size256-12         7.523µ ± 4%   4.478µ ± 3%  -40.47% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size512-12        19.863µ ± 1%   9.957µ ± 7%  -49.87% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size1024-12        53.77µ ± 2%   36.79µ ± 4%  -31.58% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size2048-12        132.7µ ± 1%   109.2µ ± 3%  -17.67% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size4096-12        291.1µ ± 3%   243.4µ ± 8%  -16.40% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size8192-12        639.0µ ± 4%   534.7µ ± 3%  -16.32% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size16384-12       1.359m ± 1%   1.141m ± 4%  -16.03% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size32768-12       2.897m ± 1%   2.390m ± 3%  -17.48% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size65536-12       6.238m ± 2%   5.149m ± 2%  -17.45% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size131072-12      13.22m ± 1%   10.98m ± 4%  -16.95% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size262144-12      28.00m ± 3%   22.81m ± 9%  -18.55% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size524288-12      58.73m ± 2%   48.59m ± 3%  -17.26% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size1048576-12     123.7m ± 3%   102.5m ± 7%  -17.18% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size2097152-12     259.3m ± 7%   211.6m ± 4%  -18.41% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size4194304-12     560.5m ± 6%   459.9m ± 7%  -17.95% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size8388608-12    1131.5m ± 6%   906.8m ± 8%  -19.86% (p=0.000 n=10)
SortUniqueUint32InAStruct/Size16777216-12     2.371 ± 4%    1.928 ± 5%  -18.69% (p=0.000 n=10)
geomean                                      1.030m        763.4µ       -25.87%

Adding a new function like slices.SortFuncIndir does not address the virtual call overhead and unlikely to be worth complexifying the slices package API for,
it would have roughly ~half the performance gains while requiring work from end users,
while the solution I propose is entirely compiler driven and would magically work in compatible cases.

Just to be sure, no end user would ever write unsafe code here, the compiler would generate the unsafe code for you, I didn't write this part yet,
In the slices package: pointerPath safely encapsulates all unsafe operations and is very reviewable.
The compiler pointer path generation will be significantly more complex tho.

Keith Randall

unread,
Jul 28, 2025, 5:56:19 PMJul 28
to Jorropo, golang-dev
I'm not super excited about it. It seems like it really ties the compiler to the sort package in a way it wasn't before. We've only hard-coded the semantics of packages for very simple packages so far, like math/bits.
I'm also worried that any compiler detection of this is going to be fragile. Will all of these work?

slices.SortFunc(data, func(a, b T) int {return cmp.Compare(a.d, b.d) })
slices.SortFunc(data, func(a, b T) int {return a.d - b.d })
slices.SortFunc(data, func(a, b T) int {ad := a.d; bd := b.d; return cmp.Compare(ad, bd)})

Jorropo

unread,
Jul 28, 2025, 10:39:36 PMJul 28
to kei...@alum.mit.edu, golang-dev
> I'm not super excited about it. It seems like it really ties the compiler to the sort package in a way it wasn't before. We've only hard-coded the semantics of packages for very simple packages so far, like math/bits.

Yes exactly. Imo it is worth it for a 2X improvement on something as common as sorting, if it breaks I'll fix it.


> I'm also worried that any compiler detection of this is going to be fragile. Will all of these work?

Here is data on usages of slices.SortFunc in the wild (sourced from https://grep.app/search?case=true&q=slices.SortFunc%28),
I've read all files in order for 30 minutes, they are from many different popular repos, it's not a fair random sample but it's easy to search and I didn't thought it would make a difference.
Applicable:
- 38: func(a, b T) int { return cmp.Compare(a.d, b.d) }
- 27: func(a, b T) int { if a.d < b.d { return constX } if a.d > b.d { return constY } return 0 }
- 10: func(a, b T) int { return strings.Compare(a.d, b.d) }
- 7: func(a, b T) int { return cmp.Compare(a.InlinableGetter(), b.InlinableGetter()) }
- 4: func(a, b T) int { if a.InlinableGetter() < b.InlinableGetter() { return constX } if a.InlinableGetter() > b.InlinableGetter() { return constY } return 0 }
- 3: cmp.Compare
- 1: func (a, b T) int { return int(a.d) - int(b.d) } // no-op conv; type T int

We could cheat:
- 13: func(a, b T) int { return bytes.Compare(a.d, b.d) }

Maybe but I don't care right now:
- 3: func(a, b T) int { if reversed { return cmp.Compare(a.d, b.d) } return cmp.Compare(b.d, a.d) }
- 2: func(a, b T) int { if a.InlinableGetterWhichSprintfAConstantPrefix() < b.InlinableGetterWhichSprintfAConstantPrefix() { return constX } if a.InlinableGetterWhichSprintfAConstantPrefix() > b.InlinableGetterWhichSprintfAConstantPrefix() { return constY } return 0 }

Non-Applicable:
- 46: Sort by a secondary keys on equality
- 6: func(a, b []T) int { if a[const] < b[const] { return constX } if a[const] > a[const] { return constY } return 0 }
- 6: Complex Logic
- 5: func(a, b T) int { return cmp.Compare(a.aMap[k], b.aMap[k]) }
- 3: runtimeClosure
- 3: slices.Compare
- 3: func(a, b T) int { if reversed { return cmp.Compare(a.aMap[k], b.aMap[k]) } return cmp.Compare(b.aMap[k], a.aMap[k]) }
- 2: func(a, b T) int { if a.d.Less(b.d) { return constX } if b.d.Less(a.d) { return constY } return 0 }
- 2: does io (fetch from DB)
- 1: func(a, b T) int { return strings.Compare(a.avail - a.free, b.avail - b.free) }
- 1: func(a, b T) int { return cmp.Compare(a.StringConcatMultipleFields(), b.StringConcatMultipleFields()) }

Data not conveyed in the results:
- The compare function is not always a literal, sometimes it's a method used as a function: slices.SortFunc(s, type.CmpMethod)
- The selectors are varied and different, altho it is not much harder to do a solved pointer path extractor.
- I have never seen a 3 indirects pointer path.

Notes:
- func(a, b T) int { if a.d < b.d { return constX } if a.d > a.d { return constY } return 0 } // Good candidate for a modernizer.
- cmp.Compare // this is not a mistake, is NaN handling different ? idk why someone wouldn't use slices.Sort.
- I have seen - to invert the result of cmp.Compare a single time ! Swapping a and b is more common. It ended up not being listed in Applicable because they also used cmp.Or so it is « Sort by a secondary keys on equality ».
- About « we could cheat » we could lie when we generate the pointer path `pointerPath[T, string]{...}` rather than `pointerPath[T, []byte]{...}` and optimize these too.
- Surprisingly the 2 ifs 3 returns form was never written differently (I write it -1 0 1 myself).

Conclusion
If we only match {cmp,bytes,strings}.Compare(a.pointerPath, b.pointerPath) we will optimize 64% of the optimizable slices.SortFunc, 37% of all slices.SortFunc.
If we also match the 2 ifs 3 returns form the numbers go to 94% and 54%.
This detection should run after inlining, most getters end up returning a single field.

There exists a tail end of creative solutions but I don't think it's a problem if we don't get it right all the time because at worst nothing happens, at best 2X faster.
Reply all
Reply to author
Forward
0 new messages