I suspect that this will not help (but have not tried and would love
to be wrong). Here's why:
Running the CPU profiler, 40% of the time is spent on the line
mcentral.go:217. This is setting up the freelist for newly allocated
spans. Perhaps this is an artifact of CPU profiling (?). If not, it
suggests that the compiler performance would be significantly improved
by making setting up new spans cheaper, perhaps by enabling an
alternative representation of spans that can cheaply represent n steps
of a fixed size in a row. Something vaguely like:
type gclink struct {
n int // number of remaining sequential pointers; 0 means use next instead
step uintptr // offset of next sequential pointer
next gclinkptr // next set of sequential pointers
}
I don't know enough about the memory allocator to know whether this
would help, how difficult it might be to implement, or what other
havoc it might wreak.
I did try the obvious thing of unrolling the loop 8x, and it
(probably) helped a tiny bit. Compiling the rotate0.go test, I saw a
borderline-significant speed-up of 1.37% (n=100, p=0.06). However, the
loop is memory-bound, not CPU-bound, so it is unsurprising that
unrolling wouldn't have much impact.
If this effect is not an artifact of CPU profiling, addressing it
might help all short-lived programs that allocate heavily and free
little.
-josh