Duff devices in the runtime

776 views
Skip to first unread message

Keith Randall

unread,
Jun 2, 2025, 6:27:20 PMJun 2
to golang-dev
TL;DR try my CL on your favorite amd64 or arm64 machine and let me know if the CL makes things faster or slower.

I'm thinking about getting rid of the Duff device optimization from the compiler/runtime. It doesn't seem to help performance any more, at least in most situations. But it is hard to be sure because I only have access to a handful of machines.

Background: the duff devices help memory zeroing and memory copying by having one large unrolled zeroing/copying loop in the runtime, and the compiler then synthesizes a jump into the middle of those unrolled instructions to get exactly the fraction of that unrolled loop that it needs.

In Go, it would look like:
```
s[0] = 0; s = s[1:]
s[0] = 0; s = s[1:]
s[0] = 0; s = s[1:]
...lots more repeats...
s[0] = 0; s = s[1:]
s[0] = 0; s = s[1:]
s[0] = 0; s = s[1:]
return
```
Then the compiler, if it needs to zero 7 words, computes the address of the 7th line from the end and jumps to it.

It sounds great, but it takes some overhead per call and branch predictors are so good nowadays that getting rid of the loop overhead at the callsite just isn't that important.

If you have an amd64 or arm64 machine on which you care about performance, please try my CL and report back. (The CL contains a few reports already. Check there if you're wondering whether your machine was already tried.)

Benchmark command: go test runtime -test.run=none -test.bench='Mem(clr|move)KnownSize(112|128|192|248|256|512|1024)$' -test.count=10
Run the benchmark at tip and tip+CL and use benchstat to report the differences. Replies to this thread are fine, or email me directly.

Thanks!

Daniel Martí

unread,
Jun 2, 2025, 7:12:42 PMJun 2
to golan...@googlegroups.com

Hi Keith,

Perhaps split the added benchmarks into a parent CL in a chain. Otherwise one has to apply them to properly benchmark master without the rest of your patch.

In any case, below are the numbers for my amd64 zen5 laptop in performance mode, to avoid throttling. It appears to be a net win throughout.

goos: linux
goarch: amd64
pkg: runtime
cpu: AMD Ryzen AI 9 HX 370 w/ Radeon 890M           
                        │     tip      │               cl678175               │
                        │    sec/op    │    sec/op     vs base                │
MemclrKnownSize112-24     1.0675n ± 1%   0.7465n ± 0%  -30.07% (p=0.000 n=10)
MemclrKnownSize128-24     1.0665n ± 2%   0.8535n ± 0%  -19.97% (p=0.000 n=10)
MemclrKnownSize192-24      1.492n ± 0%    1.280n ± 1%  -14.20% (p=0.000 n=10)
MemclrKnownSize248-24      2.237n ± 0%    2.072n ± 2%   -7.33% (p=0.000 n=10)
MemclrKnownSize256-24      1.917n ± 1%    1.707n ± 1%  -10.95% (p=0.000 n=10)
MemclrKnownSize512-24      3.623n ± 0%    3.412n ± 0%   -5.84% (p=0.000 n=10)
MemclrKnownSize1024-24     7.042n ± 1%    6.819n ± 0%   -3.17% (p=0.000 n=10)
MemmoveKnownSize112-24    1.2880n ± 0%   0.9974n ± 1%  -22.56% (p=0.000 n=10)
MemmoveKnownSize128-24     1.918n ± 1%    1.193n ± 2%  -37.83% (p=0.000 n=10)
MemmoveKnownSize192-24     1.918n ± 0%    1.862n ± 1%   -2.89% (p=0.000 n=10)
MemmoveKnownSize248-24     2.572n ± 2%    2.253n ± 1%  -12.38% (p=0.000 n=10)
MemmoveKnownSize256-24     2.557n ± 1%    2.306n ± 0%   -9.84% (p=0.000 n=10)
MemmoveKnownSize512-24     5.115n ± 0%    4.337n ± 1%  -15.21% (p=0.000 n=10)
MemmoveKnownSize1024-24   11.600n ± 1%    8.368n ± 4%  -27.87% (p=0.000 n=10)
geomean                    2.486n         2.078n       -16.39%

                        │     tip      │               cl678175                │
                        │     B/s      │      B/s       vs base                │
MemclrKnownSize112-24     97.73Gi ± 0%   139.73Gi ± 0%  +42.98% (p=0.000 n=10)
MemclrKnownSize128-24     111.8Gi ± 1%    139.7Gi ± 0%  +24.94% (p=0.000 n=10)
MemclrKnownSize192-24     119.8Gi ± 0%    139.6Gi ± 1%  +16.54% (p=0.000 n=10)
MemclrKnownSize248-24     103.3Gi ± 0%    111.4Gi ± 2%   +7.92% (p=0.000 n=10)
MemclrKnownSize256-24     124.4Gi ± 1%    139.7Gi ± 1%  +12.30% (p=0.000 n=10)
MemclrKnownSize512-24     131.6Gi ± 0%    139.8Gi ± 0%   +6.19% (p=0.000 n=10)
MemclrKnownSize1024-24    135.4Gi ± 1%    139.9Gi ± 0%   +3.28% (p=0.000 n=10)
MemmoveKnownSize112-24    80.98Gi ± 0%   104.58Gi ± 1%  +29.14% (p=0.000 n=10)
MemmoveKnownSize128-24    62.15Gi ± 1%    99.97Gi ± 2%  +60.85% (p=0.000 n=10)
MemmoveKnownSize192-24    93.26Gi ± 0%    96.04Gi ± 1%   +2.98% (p=0.000 n=10)
MemmoveKnownSize248-24    89.80Gi ± 2%   102.49Gi ± 1%  +14.13% (p=0.000 n=10)
MemmoveKnownSize256-24    93.24Gi ± 1%   103.42Gi ± 0%  +10.92% (p=0.000 n=10)
MemmoveKnownSize512-24    93.23Gi ± 0%   109.97Gi ± 1%  +17.95% (p=0.000 n=10)
MemmoveKnownSize1024-24   82.22Gi ± 1%   113.98Gi ± 4%  +38.63% (p=0.000 n=10)
geomean                   99.27Gi         118.7Gi       +19.60%

--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/golang-dev/CA%2BZMcOO_%3D%2BaJBTWiku0enKX1tiYJDdcu7YBhwcCgPik7iFTeVg%40mail.gmail.com.

Caleb Spare

unread,
Jun 2, 2025, 7:20:14 PMJun 2
to kei...@alum.mit.edu, golang-dev
Hey Keith, results look great for the memclr benchmarks on a couple of chips I care about:

goos: linux
goarch: amd64
pkg: runtime
cpu: AMD Ryzen 9 3900X 12-Core Processor
                       │   old.txt   │               new.txt               │
                       │   sec/op    │   sec/op     vs base                │
MemclrKnownSize112-24    2.133n ± 2%   1.627n ± 2%  -23.72% (p=0.000 n=10)
MemclrKnownSize128-24    2.366n ± 2%   1.850n ± 3%  -21.81% (p=0.000 n=10)
MemclrKnownSize192-24    3.235n ± 3%   2.766n ± 1%  -14.50% (p=0.000 n=10)
MemclrKnownSize248-24    5.889n ± 2%   5.345n ± 2%   -9.25% (p=0.000 n=10)
MemclrKnownSize256-24    4.187n ± 2%   3.751n ± 1%  -10.39% (p=0.000 n=10)
MemclrKnownSize512-24    7.895n ± 2%   7.500n ± 2%   -5.01% (p=0.000 n=10)
MemclrKnownSize1024-24   15.53n ± 2%   14.86n ± 4%   -4.35% (p=0.001 n=10)
geomean                  4.683n        4.073n       -13.01%

                       │   old.txt    │               new.txt                │
                       │     B/s      │     B/s       vs base                │
MemclrKnownSize112-24    48.90Gi ± 3%   64.11Gi ± 2%  +31.09% (p=0.000 n=10)
MemclrKnownSize128-24    50.39Gi ± 2%   64.43Gi ± 3%  +27.87% (p=0.000 n=10)
MemclrKnownSize192-24    55.28Gi ± 3%   64.64Gi ± 1%  +16.92% (p=0.000 n=10)
MemclrKnownSize248-24    39.22Gi ± 2%   43.22Gi ± 2%  +10.19% (p=0.000 n=10)
MemclrKnownSize256-24    56.95Gi ± 2%   63.55Gi ± 1%  +11.59% (p=0.000 n=10)
MemclrKnownSize512-24    60.39Gi ± 2%   63.58Gi ± 2%   +5.28% (p=0.000 n=10)
MemclrKnownSize1024-24   61.39Gi ± 2%   64.18Gi ± 4%   +4.54% (p=0.001 n=10)
geomean                  52.69Gi        60.57Gi       +14.95%


This one is an AWS graviton 4 arm64:

goos: linux
goarch: arm64
pkg: runtime
                       │   old.txt   │               new.txt               │
                       │   sec/op    │   sec/op     vs base                │
MemclrKnownSize112-48    2.875n ± 3%   2.334n ± 0%  -18.82% (p=0.000 n=10)
MemclrKnownSize128-48    2.355n ± 0%   2.488n ± 2%   +5.65% (p=0.000 n=10)
MemclrKnownSize192-48    2.509n ± 0%   2.363n ± 2%   -5.84% (p=0.000 n=10)
MemclrKnownSize248-48    3.226n ± 0%   2.870n ± 0%  -11.02% (p=0.000 n=10)
MemclrKnownSize256-48    3.225n ± 0%   2.893n ± 0%  -10.29% (p=0.000 n=10)
MemclrKnownSize512-48    6.772n ± 0%   6.612n ± 0%   -2.37% (p=0.000 n=10)
MemclrKnownSize1024-48   16.49n ± 0%   15.88n ± 0%   -3.70% (p=0.000 n=10)
geomean                  4.108n        3.824n        -6.90%

                       │   old.txt    │               new.txt                │
                       │     B/s      │     B/s       vs base                │
MemclrKnownSize112-48    36.29Gi ± 3%   44.69Gi ± 0%  +23.15% (p=0.000 n=10)
MemclrKnownSize128-48    50.61Gi ± 0%   47.91Gi ± 3%   -5.34% (p=0.000 n=10)
MemclrKnownSize192-48    71.27Gi ± 0%   75.69Gi ± 2%   +6.21% (p=0.000 n=10)
MemclrKnownSize248-48    71.61Gi ± 0%   80.47Gi ± 0%  +12.37% (p=0.000 n=10)
MemclrKnownSize256-48    73.92Gi ± 0%   82.41Gi ± 0%  +11.48% (p=0.000 n=10)
MemclrKnownSize512-48    70.41Gi ± 0%   72.12Gi ± 0%   +2.43% (p=0.000 n=10)
MemclrKnownSize1024-48   57.84Gi ± 0%   60.05Gi ± 0%   +3.82% (p=0.000 n=10)
geomean                  60.07Gi        64.52Gi        +7.41%

If you set up a CL chain as Daniel mentioned I can compare the other benchmarks too; don't have time to set that up at the moment.

-Caleb

Keith Randall

unread,
Jun 2, 2025, 7:29:27 PMJun 2
to Caleb Spare, golang-dev
> Perhaps split the added benchmarks into a parent CL in a chain. Otherwise one has to apply them to properly benchmark master without the rest of your patch.

Done. CL 678215 is just the benchmark changes. Maybe it will get submitted at some point soon (tests changes are ok in the freeze) and we won't need to worry about it any more.


Rob Pike

unread,
Jun 2, 2025, 10:03:02 PMJun 2
to kei...@alum.mit.edu, golang-dev
Happy to hear that, as it was a misnomer to call it a Duff device. Yes, it behaves the same, but Tom Duff's genius was to see a way that C (and only C) could recreate a jump into an unrolled loop, with no assembler.

If it's not in C, it's just jumping into a loop. Effective but not genius. Although it was nice to honor him by calling it "duffcopy".

Thanks for doing this change, in any case; it looks promising. CPUs have changed since the days of Evans and Sutherland frame buffers.

-rob

Caleb Spare

unread,
Jun 3, 2025, 1:29:45 AMJun 3
to kei...@alum.mit.edu, golang-dev
Hey Keith,

Here's the full benchmarks. I added in another chip (a newish Intel one) to boot.

Overall, seems like the memclr benchmarks are very good across the board and the memmove ones are mixed to negative.

goos: linux
goarch: amd64
pkg: runtime
cpu: AMD Ryzen 9 3900X 12-Core Processor
                        │   old.txt   │               new.txt               │
                        │   sec/op    │   sec/op     vs base                │
MemclrKnownSize112-24     2.087n ± 3%   1.619n ± 2%  -22.44% (p=0.000 n=10)
MemclrKnownSize128-24     2.270n ± 2%   1.870n ± 2%  -17.64% (p=0.000 n=10)
MemclrKnownSize192-24     3.197n ± 4%   2.790n ± 2%  -12.74% (p=0.000 n=10)
MemclrKnownSize248-24     5.778n ± 2%   5.306n ± 3%   -8.17% (p=0.000 n=10)
MemclrKnownSize256-24     4.205n ± 2%   3.777n ± 2%  -10.18% (p=0.000 n=10)
MemclrKnownSize512-24     7.892n ± 2%   7.573n ± 3%   -4.03% (p=0.000 n=10)
MemclrKnownSize1024-24    15.60n ± 6%   14.89n ± 1%   -4.55% (p=0.029 n=10)
MemmoveKnownSize112-24    2.101n ± 3%   1.663n ± 2%  -20.80% (p=0.000 n=10)
MemmoveKnownSize128-24    2.333n ± 3%   2.146n ± 2%   -8.02% (p=0.000 n=10)
MemmoveKnownSize192-24    3.263n ± 4%   3.162n ± 3%   -3.08% (p=0.011 n=10)
MemmoveKnownSize248-24    5.873n ± 5%   5.349n ± 3%   -8.92% (p=0.000 n=10)
MemmoveKnownSize256-24    4.257n ± 1%   5.721n ± 2%  +34.38% (p=0.000 n=10)
MemmoveKnownSize512-24    8.448n ± 2%   8.099n ± 6%        ~ (p=0.052 n=10)
MemmoveKnownSize1024-24   18.60n ± 4%   22.62n ± 2%  +21.64% (p=0.000 n=10)
geomean                   4.735n        4.454n        -5.94%


                        │   old.txt    │               new.txt                │
                        │     B/s      │     B/s       vs base                │
MemclrKnownSize112-24     49.96Gi ± 2%   64.43Gi ± 2%  +28.97% (p=0.000 n=10)
MemclrKnownSize128-24     52.51Gi ± 2%   63.77Gi ± 2%  +21.43% (p=0.000 n=10)
MemclrKnownSize192-24     55.94Gi ± 4%   64.09Gi ± 2%  +14.57% (p=0.000 n=10)
MemclrKnownSize248-24     39.97Gi ± 2%   43.53Gi ± 2%   +8.90% (p=0.000 n=10)
MemclrKnownSize256-24     56.71Gi ± 2%   63.13Gi ± 2%  +11.32% (p=0.000 n=10)
MemclrKnownSize512-24     60.42Gi ± 2%   62.96Gi ± 3%   +4.20% (p=0.000 n=10)
MemclrKnownSize1024-24    61.12Gi ± 6%   64.04Gi ± 1%   +4.77% (p=0.029 n=10)
MemmoveKnownSize112-24    49.66Gi ± 3%   62.70Gi ± 2%  +26.25% (p=0.000 n=10)
MemmoveKnownSize128-24    51.09Gi ± 3%   55.57Gi ± 2%   +8.75% (p=0.000 n=10)
MemmoveKnownSize192-24    54.81Gi ± 4%   56.55Gi ± 3%   +3.18% (p=0.011 n=10)
MemmoveKnownSize248-24    39.33Gi ± 5%   43.17Gi ± 3%   +9.78% (p=0.000 n=10)
MemmoveKnownSize256-24    56.01Gi ± 1%   41.68Gi ± 2%  -25.58% (p=0.000 n=10)
MemmoveKnownSize512-24    56.44Gi ± 2%   58.88Gi ± 6%        ~ (p=0.052 n=10)
MemmoveKnownSize1024-24   51.28Gi ± 4%   42.15Gi ± 2%  -17.80% (p=0.000 n=10)
geomean                   52.11Gi        55.40Gi        +6.31%


goos: linux
goarch: arm64
pkg: runtime
(AWS Graviton 4)

                        │   old.txt   │               new.txt               │
                        │   sec/op    │   sec/op     vs base                │
MemclrKnownSize112-48     2.954n ± 3%   2.334n ± 0%  -20.98% (p=0.000 n=10)
MemclrKnownSize128-48     2.355n ± 0%   2.500n ± 1%   +6.14% (p=0.000 n=10)
MemclrKnownSize192-48     2.509n ± 0%   2.355n ± 2%   -6.16% (p=0.000 n=10)
MemclrKnownSize248-48     3.225n ± 0%   2.870n ± 0%  -11.02% (p=0.000 n=10)
MemclrKnownSize256-48     3.225n ± 0%   2.894n ± 0%  -10.26% (p=0.000 n=10)
MemclrKnownSize512-48     6.772n ± 0%   6.611n ± 0%   -2.38% (p=0.000 n=10)
MemclrKnownSize1024-48    16.48n ± 0%   15.88n ± 0%   -3.64% (p=0.000 n=10)
MemmoveKnownSize112-48    2.862n ± 2%   2.867n ± 0%        ~ (p=0.465 n=10)
MemmoveKnownSize128-48    2.813n ± 0%   3.229n ± 0%  +14.77% (p=0.000 n=10)
MemmoveKnownSize192-48    3.918n ± 0%   5.016n ± 0%  +28.02% (p=0.000 n=10)
MemmoveKnownSize248-48    4.615n ± 0%   5.375n ± 0%  +16.47% (p=0.000 n=10)
MemmoveKnownSize256-48    5.143n ± 0%   6.092n ± 0%  +18.45% (p=0.000 n=10)
MemmoveKnownSize512-48    10.33n ± 0%   11.83n ± 0%  +14.52% (p=0.000 n=10)
MemmoveKnownSize1024-48   21.40n ± 0%   23.37n ± 1%   +9.16% (p=0.000 n=10)
geomean                   4.791n        4.931n        +2.94%


                        │   old.txt    │               new.txt                │
                        │     B/s      │     B/s       vs base                │
MemclrKnownSize112-48     35.31Gi ± 3%   44.69Gi ± 0%  +26.56% (p=0.000 n=10)
MemclrKnownSize128-48     50.62Gi ± 0%   47.69Gi ± 1%   -5.78% (p=0.000 n=10)
MemclrKnownSize192-48     71.27Gi ± 0%   75.96Gi ± 2%   +6.58% (p=0.000 n=10)
MemclrKnownSize248-48     71.61Gi ± 0%   80.48Gi ± 0%  +12.38% (p=0.000 n=10)
MemclrKnownSize256-48     73.92Gi ± 0%   82.38Gi ± 0%  +11.44% (p=0.000 n=10)
MemclrKnownSize512-48     70.41Gi ± 0%   72.13Gi ± 0%   +2.44% (p=0.000 n=10)
MemclrKnownSize1024-48    57.85Gi ± 0%   60.05Gi ± 0%   +3.80% (p=0.000 n=10)
MemmoveKnownSize112-48    36.45Gi ± 2%   36.39Gi ± 0%        ~ (p=0.481 n=10)
MemmoveKnownSize128-48    42.37Gi ± 0%   36.92Gi ± 0%  -12.87% (p=0.000 n=10)
MemmoveKnownSize192-48    45.64Gi ± 0%   35.65Gi ± 0%  -21.90% (p=0.000 n=10)
MemmoveKnownSize248-48    50.05Gi ± 0%   42.97Gi ± 0%  -14.13% (p=0.000 n=10)
MemmoveKnownSize256-48    46.35Gi ± 0%   39.14Gi ± 0%  -15.57% (p=0.000 n=10)
MemmoveKnownSize512-48    46.18Gi ± 0%   40.31Gi ± 0%  -12.71% (p=0.000 n=10)
MemmoveKnownSize1024-48   44.56Gi ± 0%   40.82Gi ± 1%   -8.39% (p=0.000 n=10)
geomean                   51.50Gi        50.03Gi        -2.85%



goos: linux
goarch: amd64
pkg: runtime
cpu: Intel(R) Xeon(R) Platinum 8488C

                        │   old.txt   │               new.txt               │
                        │   sec/op    │   sec/op     vs base                │
MemclrKnownSize112-32     1.461n ± 1%   1.091n ± 0%  -25.33% (p=0.000 n=10)
MemclrKnownSize128-32     1.511n ± 1%   1.092n ± 0%  -27.76% (p=0.000 n=10)
MemclrKnownSize192-32     1.932n ± 0%   1.638n ± 0%  -15.22% (p=0.000 n=10)
MemclrKnownSize248-32     4.362n ± 1%   3.817n ± 1%  -12.50% (p=0.000 n=10)
MemclrKnownSize256-32     2.465n ± 1%   2.185n ± 0%  -11.34% (p=0.000 n=10)
MemclrKnownSize512-32     4.638n ± 0%   4.359n ± 0%   -6.03% (p=0.000 n=10)
MemclrKnownSize1024-32    8.986n ± 1%   8.757n ± 1%   -2.54% (p=0.000 n=10)
MemmoveKnownSize112-32    1.828n ± 1%   1.385n ± 1%  -24.21% (p=0.000 n=10)
MemmoveKnownSize128-32    2.004n ± 0%   2.772n ± 4%  +38.31% (p=0.000 n=10)
MemmoveKnownSize192-32    2.738n ± 0%   2.849n ± 2%   +4.05% (p=0.000 n=10)
MemmoveKnownSize248-32    4.362n ± 0%   3.828n ± 1%  -12.25% (p=0.000 n=10)
MemmoveKnownSize256-32    3.461n ± 0%   4.903n ± 1%  +41.66% (p=0.000 n=10)
MemmoveKnownSize512-32    8.440n ± 1%   7.365n ± 0%  -12.73% (p=0.000 n=10)
MemmoveKnownSize1024-32   16.14n ± 0%   17.93n ± 0%  +11.12% (p=0.000 n=10)
geomean                   3.488n        3.282n        -5.89%


                        │   old.txt    │                new.txt                │
                        │     B/s      │      B/s       vs base                │
MemclrKnownSize112-32     71.41Gi ± 1%    95.61Gi ± 0%  +33.89% (p=0.000 n=10)
MemclrKnownSize128-32     78.88Gi ± 1%   109.19Gi ± 0%  +38.43% (p=0.000 n=10)
MemclrKnownSize192-32     92.55Gi ± 0%   109.18Gi ± 0%  +17.97% (p=0.000 n=10)
MemclrKnownSize248-32     52.94Gi ± 1%    60.51Gi ± 1%  +14.29% (p=0.000 n=10)
MemclrKnownSize256-32     96.71Gi ± 1%   109.09Gi ± 0%  +12.81% (p=0.000 n=10)
MemclrKnownSize512-32     102.8Gi ± 0%    109.4Gi ± 0%   +6.41% (p=0.000 n=10)
MemclrKnownSize1024-32    106.1Gi ± 1%    108.9Gi ± 1%   +2.61% (p=0.000 n=10)
MemmoveKnownSize112-32    57.08Gi ± 1%    75.32Gi ± 1%  +31.96% (p=0.000 n=10)
MemmoveKnownSize128-32    59.47Gi ± 0%    43.01Gi ± 4%  -27.68% (p=0.000 n=10)
MemmoveKnownSize192-32    65.31Gi ± 0%    62.77Gi ± 2%   -3.89% (p=0.000 n=10)
MemmoveKnownSize248-32    52.95Gi ± 0%    60.35Gi ± 1%  +13.97% (p=0.000 n=10)
MemmoveKnownSize256-32    68.89Gi ± 0%    48.63Gi ± 1%  -29.41% (p=0.000 n=10)
MemmoveKnownSize512-32    56.50Gi ± 1%    64.74Gi ± 0%  +14.58% (p=0.000 n=10)
MemmoveKnownSize1024-32   59.09Gi ± 0%    53.17Gi ± 0%  -10.01% (p=0.000 n=10)
geomean                   70.74Gi         75.18Gi        +6.27%

Caleb

peterGo

unread,
Jun 3, 2025, 2:19:14 PMJun 3
to golang-dev
Keith,

Here are a few more data points: i5-14400, i5-1235U, i3-3220T, i5-2390T, and Arm Cortex-A76. The shapes of the benchstat results are similar. MemclrKnownSize cpu has a significant but declining improvement. MemmoveKnownSize cpu starts with a significant decline that turns into a significant increase.

---

tip:
$ go version
go version go1.25-devel_49f6304724 Mon Jun 2 17:09:50 2025 -0700 linux/amd64
> go version
go version go1.25-devel_49f6304724 Mon Jun 2 17:09:50 2025 -0700 windows/amd64
cl:
tip + cl678175

---


goos: linux
goarch: amd64
pkg: runtime
cpu: Intel(R) Core(TM) i5-14400
mem: DDR4 @ 3200 MHz
perflock -governor 50%
                        │ duff.tip.txt │             duff.cl.txt              │
                        │    sec/op    │    sec/op     vs base                │
MemclrKnownSize112-16      1.866n ± 1%   1.492n ±  0%  -20.02% (p=0.000 n=10)
MemclrKnownSize128-16      1.875n ± 0%   1.495n ±  1%  -20.25% (p=0.000 n=10)
MemclrKnownSize192-16      2.625n ± 1%   2.250n ±  1%  -14.30% (p=0.000 n=10)
MemclrKnownSize248-16      5.971n ± 1%   5.247n ±  0%  -12.12% (p=0.000 n=10)
MemclrKnownSize256-16      3.371n ± 0%   2.994n ±  0%  -11.20% (p=0.000 n=10)
MemclrKnownSize512-16      6.346n ± 1%   6.005n ±  1%   -5.37% (p=0.000 n=10)
MemclrKnownSize1024-16     12.32n ± 1%   11.95n ±  1%   -3.04% (p=0.000 n=10)
MemmoveKnownSize112-16     2.613n ± 1%   1.882n ±  1%  -27.94% (p=0.000 n=10)
MemmoveKnownSize128-16     2.801n ± 0%   2.365n ±  2%  -15.57% (p=0.000 n=10)
MemmoveKnownSize192-16     3.733n ± 0%   3.444n ±  5%   -7.76% (p=0.000 n=10)
MemmoveKnownSize248-16     5.975n ± 1%   5.234n ±  1%  -12.41% (p=0.000 n=10)
MemmoveKnownSize256-16     4.865n ± 0%   5.805n ± 14%  +19.32% (p=0.000 n=10)
MemmoveKnownSize512-16     9.370n ± 7%   9.947n ±  1%   +6.16% (p=0.018 n=10)
MemmoveKnownSize1024-16    19.89n ± 5%   24.88n ±  2%  +25.11% (p=0.000 n=10)
geomean                    4.639n        4.262n         -8.13%

                        │ duff.tip.txt │              duff.cl.txt              │
                        │     B/s      │      B/s       vs base                │
MemclrKnownSize112-16     55.89Gi ± 1%   69.89Gi ±  0%  +25.04% (p=0.000 n=10)
MemclrKnownSize128-16     63.60Gi ± 0%   79.72Gi ±  1%  +25.35% (p=0.000 n=10)
MemclrKnownSize192-16     68.11Gi ± 1%   79.48Gi ±  1%  +16.69% (p=0.000 n=10)
MemclrKnownSize248-16     38.68Gi ± 1%   44.02Gi ±  0%  +13.79% (p=0.000 n=10)
MemclrKnownSize256-16     70.73Gi ± 0%   79.65Gi ±  0%  +12.62% (p=0.000 n=10)
MemclrKnownSize512-16     75.14Gi ± 1%   79.40Gi ±  1%   +5.67% (p=0.000 n=10)
MemclrKnownSize1024-16    77.38Gi ± 1%   79.81Gi ±  1%   +3.13% (p=0.000 n=10)
MemmoveKnownSize112-16    39.92Gi ± 1%   55.41Gi ±  1%  +38.79% (p=0.000 n=10)
MemmoveKnownSize128-16    42.56Gi ± 0%   50.40Gi ±  2%  +18.43% (p=0.000 n=10)
MemmoveKnownSize192-16    47.90Gi ± 0%   51.93Gi ±  6%   +8.40% (p=0.000 n=10)
MemmoveKnownSize248-16    38.65Gi ± 1%   44.13Gi ±  1%  +14.17% (p=0.000 n=10)
MemmoveKnownSize256-16    49.01Gi ± 0%   41.07Gi ± 16%  -16.19% (p=0.000 n=10)
MemmoveKnownSize512-16    50.89Gi ± 6%   47.94Gi ±  1%   -5.80% (p=0.015 n=10)
MemmoveKnownSize1024-16   47.94Gi ± 6%   38.32Gi ±  2%  -20.07% (p=0.000 n=10)
geomean                   53.18Gi        57.89Gi         +8.85%

---


goos: linux
goarch: amd64
pkg: runtime
cpu: 12th Gen Intel(R) Core(TM) i5-1235U
mem: DDR4 @ 3200 MHz
perflock -governor 50%
                        │ duff.tip.txt │             duff.cl.txt             │
                        │    sec/op    │   sec/op     vs base                │
MemclrKnownSize112-12      2.098n ± 0%   1.676n ± 0%  -20.11% (p=0.000 n=10)
MemclrKnownSize128-12      2.114n ± 0%   1.678n ± 0%  -20.60% (p=0.000 n=10)
MemclrKnownSize192-12      2.962n ± 0%   2.516n ± 0%  -15.07% (p=0.000 n=10)
MemclrKnownSize248-12      6.712n ± 0%   5.874n ± 0%  -12.49% (p=0.000 n=10)
MemclrKnownSize256-12      3.798n ± 0%   3.354n ± 0%  -11.69% (p=0.000 n=10)
MemclrKnownSize512-12      7.132n ± 0%   6.709n ± 0%   -5.93% (p=0.000 n=10)
MemclrKnownSize1024-12     13.85n ± 0%   13.42n ± 0%   -3.10% (p=0.000 n=10)
MemmoveKnownSize112-12     2.937n ± 0%   2.097n ± 1%  -28.60% (p=0.000 n=10)
MemmoveKnownSize128-12     3.148n ± 0%   2.646n ± 1%  -15.95% (p=0.000 n=10)
MemmoveKnownSize192-12     4.198n ± 0%   3.699n ± 2%  -11.89% (p=0.000 n=10)
MemmoveKnownSize248-12     6.714n ± 0%   5.875n ± 0%  -12.50% (p=0.000 n=10)
MemmoveKnownSize256-12     5.474n ± 0%   6.801n ± 2%  +24.25% (p=0.000 n=10)
MemmoveKnownSize512-12     10.62n ± 4%   10.90n ± 3%        ~ (p=0.128 n=10)
MemmoveKnownSize1024-12    22.18n ± 2%   28.21n ± 2%  +27.21% (p=0.000 n=10)
geomean                    5.219n        4.771n        -8.58%

                        │ duff.tip.txt │             duff.cl.txt              │
                        │     B/s      │     B/s       vs base                │
MemclrKnownSize112-12     49.72Gi ± 0%   62.22Gi ± 0%  +25.13% (p=0.000 n=10)
MemclrKnownSize128-12     56.40Gi ± 0%   71.02Gi ± 0%  +25.93% (p=0.000 n=10)
MemclrKnownSize192-12     60.36Gi ± 0%   71.08Gi ± 0%  +17.76% (p=0.000 n=10)
MemclrKnownSize248-12     34.41Gi ± 0%   39.32Gi ± 0%  +14.26% (p=0.000 n=10)
MemclrKnownSize256-12     62.77Gi ± 0%   71.09Gi ± 0%  +13.24% (p=0.000 n=10)
MemclrKnownSize512-12     66.86Gi ± 0%   71.07Gi ± 0%   +6.30% (p=0.000 n=10)
MemclrKnownSize1024-12    68.87Gi ± 0%   71.05Gi ± 0%   +3.17% (p=0.000 n=10)
MemmoveKnownSize112-12    35.51Gi ± 0%   49.73Gi ± 0%  +40.04% (p=0.000 n=10)
MemmoveKnownSize128-12    37.87Gi ± 0%   45.06Gi ± 1%  +18.98% (p=0.000 n=10)
MemmoveKnownSize192-12    42.59Gi ± 0%   48.34Gi ± 2%  +13.49% (p=0.000 n=10)
MemmoveKnownSize248-12    34.40Gi ± 0%   39.32Gi ± 0%  +14.29% (p=0.000 n=10)
MemmoveKnownSize256-12    43.55Gi ± 0%   35.05Gi ± 2%  -19.52% (p=0.000 n=10)
MemmoveKnownSize512-12    44.92Gi ± 4%   43.72Gi ± 3%        ~ (p=0.123 n=10)
MemmoveKnownSize1024-12   43.00Gi ± 2%   33.80Gi ± 2%  -21.39% (p=0.000 n=10)
geomean                   47.28Gi        51.71Gi        +9.37%

---


goos: linux
goarch: amd64
pkg: runtime
cpu: Intel(R) Core(TM) i3-3220T CPU @ 2.80GHz
mem: DDR3 @ 1600 MHz
perflock -governor 50%
                       │ duff.tip.txt │             duff.cl.txt              │
                       │    sec/op    │    sec/op     vs base                │
MemclrKnownSize112-4      4.114n ± 0%    3.199n ± 0%  -22.25% (p=0.000 n=10)
MemclrKnownSize128-4      4.568n ± 0%    3.651n ± 0%  -20.06% (p=0.000 n=10)
MemclrKnownSize192-4      6.392n ± 0%    5.477n ± 0%  -14.32% (p=0.000 n=10)
MemclrKnownSize248-4      11.51n ± 5%    10.04n ± 0%  -12.73% (p=0.000 n=10)
MemclrKnownSize256-4      8.214n ± 0%    7.300n ± 0%  -11.12% (p=0.000 n=10)
MemclrKnownSize512-4      15.52n ± 0%    14.61n ± 0%   -5.86% (p=0.000 n=10)
MemclrKnownSize1024-4     30.15n ± 0%    29.22n ± 0%   -3.08% (p=0.000 n=10)
MemmoveKnownSize112-4     5.022n ± 0%    3.775n ± 0%  -24.82% (p=0.000 n=10)
MemmoveKnownSize128-4     5.480n ± 0%    4.198n ± 0%  -23.38% (p=0.000 n=10)
MemmoveKnownSize192-4     7.381n ± 0%    8.678n ± 0%  +17.56% (p=0.000 n=10)
MemmoveKnownSize248-4     11.58n ± 5%    10.50n ± 0%   -9.29% (p=0.000 n=10)
MemmoveKnownSize256-4     9.480n ± 1%   11.575n ± 0%  +22.09% (p=0.000 n=10)
MemmoveKnownSize512-4     17.38n ± 0%    22.49n ± 0%  +29.40% (p=0.000 n=10)
MemmoveKnownSize1024-4    33.51n ± 0%    51.61n ± 0%  +54.01% (p=0.000 n=10)
geomean                   9.738n         9.347n        -4.02%

                       │ duff.tip.txt │             duff.cl.txt              │
                       │     B/s      │     B/s       vs base                │
MemclrKnownSize112-4     25.35Gi ± 0%   32.61Gi ± 0%  +28.63% (p=0.000 n=10)
MemclrKnownSize128-4     26.10Gi ± 0%   32.65Gi ± 0%  +25.10% (p=0.000 n=10)
MemclrKnownSize192-4     27.97Gi ± 0%   32.65Gi ± 0%  +16.72% (p=0.000 n=10)
MemclrKnownSize248-4     20.12Gi ± 5%   23.01Gi ± 0%  +14.40% (p=0.000 n=10)
MemclrKnownSize256-4     29.03Gi ± 0%   32.66Gi ± 0%  +12.51% (p=0.000 n=10)
MemclrKnownSize512-4     30.73Gi ± 0%   32.63Gi ± 0%   +6.19% (p=0.000 n=10)
MemclrKnownSize1024-4    31.63Gi ± 0%   32.63Gi ± 0%   +3.17% (p=0.000 n=10)
MemmoveKnownSize112-4    20.77Gi ± 0%   27.63Gi ± 0%  +33.01% (p=0.000 n=10)
MemmoveKnownSize128-4    21.76Gi ± 0%   28.39Gi ± 0%  +30.51% (p=0.000 n=10)
MemmoveKnownSize192-4    24.22Gi ± 0%   20.61Gi ± 0%  -14.94% (p=0.000 n=10)
MemmoveKnownSize248-4    20.00Gi ± 5%   22.01Gi ± 0%  +10.00% (p=0.000 n=10)
MemmoveKnownSize256-4    25.15Gi ± 1%   20.60Gi ± 0%  -18.10% (p=0.000 n=10)
MemmoveKnownSize512-4    27.44Gi ± 0%   21.20Gi ± 0%  -22.72% (p=0.000 n=10)
MemmoveKnownSize1024-4   28.46Gi ± 0%   18.48Gi ± 0%  -35.08% (p=0.000 n=10)
geomean                  25.35Gi        26.40Gi        +4.15%

---

goos: windows
goarch: amd64
pkg: runtime
cpu: Intel(R) Core(TM) i5-2390T CPU @ 2.70GHz
mem: DDR3 @ 1333MHz
                       │ duff.tip.txt  │             duff.cl.txt             │
                       │    sec/op     │   sec/op     vs base                │
MemclrKnownSize112-4      2.758n ±  1%   2.131n ± 2%  -22.73% (p=0.000 n=10)
MemclrKnownSize128-4      3.067n ±  1%   2.454n ± 1%  -19.97% (p=0.000 n=10)
MemclrKnownSize192-4      4.271n ±  2%   3.696n ± 3%  -13.46% (p=0.000 n=10)
MemclrKnownSize248-4      7.385n ±  2%   6.724n ± 1%   -8.96% (p=0.000 n=10)
MemclrKnownSize256-4      5.547n ±  1%   4.895n ± 1%  -11.75% (p=0.000 n=10)
MemclrKnownSize512-4     10.500n ±  2%   9.923n ± 5%   -5.50% (p=0.003 n=10)
MemclrKnownSize1024-4     20.14n ±  2%   19.57n ± 2%   -2.83% (p=0.000 n=10)
MemmoveKnownSize112-4     3.689n ±  2%   3.037n ± 1%  -17.69% (p=0.000 n=10)
MemmoveKnownSize128-4     3.669n ±  3%   3.962n ± 1%   +7.96% (p=0.001 n=10)
MemmoveKnownSize192-4     5.205n ±  1%   5.824n ± 2%  +11.90% (p=0.000 n=10)
MemmoveKnownSize248-4     7.343n ±  2%   7.125n ± 2%   -2.98% (p=0.001 n=10)
MemmoveKnownSize256-4     6.520n ± 17%   7.731n ± 1%  +18.57% (p=0.002 n=10)
MemmoveKnownSize512-4     12.14n ± 20%   15.36n ± 1%  +26.48% (p=0.000 n=10)
MemmoveKnownSize1024-4    22.91n ±  3%   30.77n ± 1%  +34.32% (p=0.000 n=10)
geomean                   6.594n         6.470n        -1.88%

                       │ duff.tip.txt  │             duff.cl.txt              │
                       │      B/s      │     B/s       vs base                │
MemclrKnownSize112-4     37.82Gi ±  1%   48.95Gi ± 2%  +29.41% (p=0.000 n=10)
MemclrKnownSize128-4     38.87Gi ±  1%   48.58Gi ± 1%  +24.98% (p=0.000 n=10)
MemclrKnownSize192-4     41.87Gi ±  2%   48.38Gi ± 3%  +15.55% (p=0.000 n=10)
MemclrKnownSize248-4     31.27Gi ±  2%   34.35Gi ± 1%   +9.84% (p=0.000 n=10)
MemclrKnownSize256-4     42.98Gi ±  1%   48.71Gi ± 1%  +13.32% (p=0.000 n=10)
MemclrKnownSize512-4     45.40Gi ±  2%   48.06Gi ± 5%   +5.87% (p=0.003 n=10)
MemclrKnownSize1024-4    47.34Gi ±  2%   48.71Gi ± 2%   +2.91% (p=0.000 n=10)
MemmoveKnownSize112-4    28.27Gi ±  2%   34.35Gi ± 1%  +21.50% (p=0.000 n=10)
MemmoveKnownSize128-4    32.49Gi ±  3%   30.09Gi ± 1%   -7.37% (p=0.002 n=10)
MemmoveKnownSize192-4    34.35Gi ±  1%   30.70Gi ± 2%  -10.63% (p=0.000 n=10)
MemmoveKnownSize248-4    31.45Gi ±  2%   32.42Gi ± 2%   +3.06% (p=0.001 n=10)
MemmoveKnownSize256-4    36.57Gi ± 14%   30.84Gi ± 1%  -15.66% (p=0.002 n=10)
MemmoveKnownSize512-4    39.28Gi ± 17%   31.05Gi ± 1%  -20.95% (p=0.000 n=10)
MemmoveKnownSize1024-4   41.63Gi ±  3%   31.00Gi ± 1%  -25.53% (p=0.000 n=10)
geomean                  37.42Gi         38.14Gi        +1.92%

---


goos: linux
goarch: arm64
pkg: runtime
cpu: Arm Cortex-A76 (Raspberry Pi 5 Model B)
mem: LPDDR4X @ 2133 MHz (4267 MT/s)
                       │ duff.tip.txt │             duff.cl.txt             │
                       │    sec/op    │   sec/op     vs base                │
MemclrKnownSize112-4      3.337n ± 0%   2.921n ± 0%  -12.47% (p=0.000 n=10)
MemclrKnownSize128-4      3.768n ± 0%   3.391n ± 0%  -10.01% (p=0.000 n=10)
MemclrKnownSize192-4      5.424n ± 2%   5.018n ± 0%   -7.49% (p=0.000 n=10)
MemclrKnownSize248-4      7.090n ± 1%   6.673n ± 1%   -5.89% (p=0.000 n=10)
MemclrKnownSize256-4      7.093n ± 0%   6.697n ± 2%   -5.58% (p=0.000 n=10)
MemclrKnownSize512-4      13.77n ± 0%   13.43n ± 0%   -2.47% (p=0.000 n=10)
MemclrKnownSize1024-4     27.83n ± 2%   27.71n ± 0%   -0.45% (p=0.000 n=10)
MemmoveKnownSize112-4     4.849n ± 0%   4.778n ± 0%   -1.46% (p=0.000 n=10)
MemmoveKnownSize128-4     5.399n ± 4%   5.502n ± 2%   +1.91% (p=0.016 n=10)
MemmoveKnownSize192-4     7.700n ± 0%   7.979n ± 0%   +3.63% (p=0.000 n=10)
MemmoveKnownSize248-4     9.383n ± 0%   9.700n ± 2%   +3.38% (p=0.000 n=10)
MemmoveKnownSize256-4     10.13n ± 0%   10.50n ± 0%   +3.70% (p=0.000 n=10)
MemmoveKnownSize512-4     19.80n ± 0%   20.63n ± 0%   +4.17% (p=0.000 n=10)
MemmoveKnownSize1024-4    39.29n ± 4%   41.05n ± 4%   +4.49% (p=0.000 n=10)
geomean                   8.901n        8.731n        -1.91%

                       │ duff.tip.txt │             duff.cl.txt              │
                       │     B/s      │     B/s       vs base                │
MemclrKnownSize112-4     31.25Gi ± 0%   35.71Gi ± 0%  +14.27% (p=0.000 n=10)
MemclrKnownSize128-4     31.64Gi ± 0%   35.15Gi ± 0%  +11.12% (p=0.000 n=10)
MemclrKnownSize192-4     32.97Gi ± 2%   35.63Gi ± 0%   +8.09% (p=0.000 n=10)
MemclrKnownSize248-4     32.58Gi ± 1%   34.62Gi ± 1%   +6.26% (p=0.000 n=10)
MemclrKnownSize256-4     33.62Gi ± 0%   35.60Gi ± 2%   +5.90% (p=0.000 n=10)
MemclrKnownSize512-4     34.63Gi ± 0%   35.50Gi ± 0%   +2.52% (p=0.000 n=10)
MemclrKnownSize1024-4    34.26Gi ± 2%   34.42Gi ± 0%   +0.47% (p=0.000 n=10)
MemmoveKnownSize112-4    21.51Gi ± 0%   21.83Gi ± 0%   +1.50% (p=0.000 n=10)
MemmoveKnownSize128-4    22.08Gi ± 3%   21.67Gi ± 2%   -1.87% (p=0.015 n=10)
MemmoveKnownSize192-4    23.22Gi ± 0%   22.41Gi ± 0%   -3.50% (p=0.000 n=10)
MemmoveKnownSize248-4    24.61Gi ± 0%   23.81Gi ± 2%   -3.27% (p=0.000 n=10)
MemmoveKnownSize256-4    23.54Gi ± 0%   22.70Gi ± 0%   -3.54% (p=0.000 n=10)
MemmoveKnownSize512-4    24.08Gi ± 0%   23.11Gi ± 0%   -4.01% (p=0.000 n=10)
MemmoveKnownSize1024-4   24.27Gi ± 3%   23.23Gi ± 3%   -4.29% (p=0.000 n=10)
geomean                  27.72Gi        28.26Gi        +1.95%

---

Peter

Alex Markin

unread,
Jun 5, 2025, 7:53:51 PMJun 5
to golang-dev
Hello.

I tested this patch on my Kunpeng920 and Kunpeng930 systems (arm64) and in all cases got a degradation. The main reason is branch misprediction. I suggest to use this patch as an option for ARM64 platforms, for example to enable it as GOARM64 value. The result of my evaluations are below:

```
goos: linux
goarch: arm64 (Kunpeng920)
pkg: runtime
                     │ base_1.log  │              new_1.log               │
                     │   sec/op    │    sec/op     vs base                │
MemmoveKnownSize112    4.576n ± 0%    4.712n ± 0%   +2.96% (p=0.000 n=20)
MemmoveKnownSize128    5.012n ± 0%    5.378n ± 0%   +7.30% (p=0.000 n=20)
MemmoveKnownSize192    6.932n ± 0%    7.718n ± 0%  +11.34% (p=0.000 n=20)
MemmoveKnownSize248    8.575n ± 0%    9.434n ± 0%  +10.01% (p=0.000 n=20)
MemmoveKnownSize256    9.056n ± 1%   10.090n ± 0%  +11.42% (p=0.000 n=20)
MemmoveKnownSize512    17.37n ± 0%    19.58n ± 0%  +12.75% (p=0.000 n=20)
MemmoveKnownSize1024   35.00n ± 1%    38.34n ± 0%   +9.57% (p=0.000 n=20)
geomean                9.598n         10.49n        +9.29%

                     │  base_1.log  │              new_1.log               │
                     │     B/s      │     B/s       vs base                │
MemmoveKnownSize112    22.79Gi ± 0%   22.14Gi ± 0%   -2.88% (p=0.000 n=20)
MemmoveKnownSize128    23.79Gi ± 0%   22.17Gi ± 0%   -6.81% (p=0.000 n=20)
MemmoveKnownSize192    25.80Gi ± 0%   23.17Gi ± 0%  -10.18% (p=0.000 n=20)
MemmoveKnownSize248    26.93Gi ± 0%   24.48Gi ± 0%   -9.10% (p=0.000 n=20)
MemmoveKnownSize256    26.33Gi ± 1%   23.63Gi ± 0%  -10.23% (p=0.000 n=20)
MemmoveKnownSize512    27.45Gi ± 0%   24.35Gi ± 0%  -11.29% (p=0.000 n=20)
MemmoveKnownSize1024   27.25Gi ± 1%   24.87Gi ± 0%   -8.74% (p=0.000 n=20)
geomean                25.71Gi        23.52Gi        -8.50%


goos: linux
goarch: arm64 (Kunpeng920)
pkg: runtime
                       │ base_4.log  │              new_4.log               │
                       │   sec/op    │    sec/op     vs base                │
MemmoveKnownSize112-4    4.574n ± 1%    4.705n ± 0%   +2.85% (p=0.000 n=20)
MemmoveKnownSize128-4    5.004n ± 0%    5.374n ± 0%   +7.39% (p=0.000 n=20)
MemmoveKnownSize192-4    6.928n ± 0%    7.674n ± 0%  +10.77% (p=0.000 n=20)
MemmoveKnownSize248-4    8.575n ± 0%    9.430n ± 0%   +9.96% (p=0.000 n=20)
MemmoveKnownSize256-4    9.066n ± 2%   10.080n ± 0%  +11.18% (p=0.000 n=20)
MemmoveKnownSize512-4    17.47n ± 1%    19.58n ± 0%  +12.08% (p=0.000 n=20)
MemmoveKnownSize1024-4   35.15n ± 1%    38.58n ± 0%   +9.76% (p=0.000 n=20)
geomean                  9.610n         10.49n        +9.10%

                       │  base_4.log  │              new_4.log               │
                       │     B/s      │     B/s       vs base                │
MemmoveKnownSize112-4    22.80Gi ± 1%   22.17Gi ± 0%   -2.78% (p=0.000 n=20)
MemmoveKnownSize128-4    23.82Gi ± 0%   22.18Gi ± 0%   -6.89% (p=0.000 n=20)
MemmoveKnownSize192-4    25.81Gi ± 0%   23.30Gi ± 0%   -9.71% (p=0.000 n=20)
MemmoveKnownSize248-4    26.93Gi ± 0%   24.49Gi ± 0%   -9.06% (p=0.000 n=20)
MemmoveKnownSize256-4    26.30Gi ± 2%   23.66Gi ± 0%  -10.04% (p=0.000 n=20)
MemmoveKnownSize512-4    27.30Gi ± 1%   24.35Gi ± 0%  -10.79% (p=0.000 n=20)
MemmoveKnownSize1024-4   27.13Gi ± 1%   24.72Gi ± 0%   -8.89% (p=0.000 n=20)
geomean                  25.67Gi        23.53Gi        -8.34%


goos: linux
goarch: arm64 (Kunpeng920)
pkg: runtime
                        │ base_48.log │              new_48.log              │
                        │   sec/op    │    sec/op     vs base                │
MemmoveKnownSize112-48    4.620n ± 2%    4.714n ± 0%   +2.03% (p=0.000 n=20)
MemmoveKnownSize128-48    5.006n ± 0%    5.373n ± 0%   +7.32% (p=0.000 n=20)
MemmoveKnownSize192-48    6.930n ± 0%    7.693n ± 0%  +11.00% (p=0.000 n=20)
MemmoveKnownSize248-48    8.571n ± 0%    9.433n ± 0%  +10.06% (p=0.000 n=20)
MemmoveKnownSize256-48    9.117n ± 2%   10.080n ± 0%  +10.57% (p=0.000 n=20)
MemmoveKnownSize512-48    17.51n ± 1%    19.59n ± 0%  +11.88% (p=0.000 n=20)
MemmoveKnownSize1024-48   35.17n ± 1%    38.56n ± 0%   +9.65% (p=0.000 n=20)
geomean                   9.636n         10.49n        +8.89%

                        │ base_48.log  │              new_48.log              │
                        │     B/s      │     B/s       vs base                │
MemmoveKnownSize112-48    22.58Gi ± 2%   22.13Gi ± 0%   -1.99% (p=0.000 n=20)
MemmoveKnownSize128-48    23.81Gi ± 0%   22.19Gi ± 0%   -6.83% (p=0.000 n=20)
MemmoveKnownSize192-48    25.80Gi ± 0%   23.25Gi ± 0%   -9.90% (p=0.000 n=20)
MemmoveKnownSize248-48    26.95Gi ± 0%   24.49Gi ± 0%   -9.14% (p=0.000 n=20)
MemmoveKnownSize256-48    26.15Gi ± 2%   23.65Gi ± 0%   -9.56% (p=0.000 n=20)
MemmoveKnownSize512-48    27.23Gi ± 1%   24.34Gi ± 0%  -10.62% (p=0.000 n=20)
MemmoveKnownSize1024-48   27.12Gi ± 1%   24.73Gi ± 0%   -8.80% (p=0.000 n=20)
geomean                   25.61Gi        23.52Gi        -8.16%



--- perf stat, see branch-misses


goos: linux
goarch: arm64
pkg: runtime
BenchmarkMemmoveKnownSize256    134495558                8.973 ns/op    28529.84 MB/s
PASS

 Performance counter stats for '../base.out -test.run=- -test.count=1 -test.bench=MemmoveKnownSize256':

          2,116.53 msec task-clock:u              #    1.000 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
               627      page-faults:u             #    0.296 K/sec
     5,485,575,603      cycles:u                  #    2.592 GHz
    10,374,188,703      instructions:u            #    1.89  insn per cycle
   <not supported>      branches:u
           112,219      branch-misses:u

       2.116984830 seconds time elapsed

       2.115458000 seconds user
       0.000000000 seconds sys



goos: linux
goarch: arm64
pkg: runtime
BenchmarkMemmoveKnownSize256    100000000               10.09 ns/op     25373.84 MB/s
PASS

 Performance counter stats for '../new.out -test.run=- -test.count=1 -test.bench=MemmoveKnownSize256':

          1,026.72 msec task-clock:u              #    1.000 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
               620      page-faults:u             #    0.604 K/sec
     2,656,241,428      cycles:u                  #    2.587 GHz
     5,566,161,483      instructions:u            #    2.10  insn per cycle
   <not supported>      branches:u
         2,057,437      branch-misses:u

       1.027170660 seconds time elapsed

       1.026353000 seconds user
       0.000000000 seconds sys



goos: linux
goarch: arm64 (Kunpeng930)
pkg: runtime
                     │ base_1.log  │               new_1.log               │
                     │   sec/op    │    sec/op     vs base                 │
MemmoveKnownSize112    2.884n ± 0%   10.010n ± 0%  +247.09% (p=0.000 n=10)
MemmoveKnownSize128    3.188n ± 0%    3.418n ± 0%    +7.21% (p=0.000 n=10)
MemmoveKnownSize192    4.626n ± 0%    5.685n ± 1%   +22.91% (p=0.000 n=10)
MemmoveKnownSize248    5.713n ± 0%    6.261n ± 0%    +9.61% (p=0.000 n=10)
MemmoveKnownSize256    6.008n ± 0%    6.540n ± 0%    +8.85% (p=0.000 n=10)
MemmoveKnownSize512    12.96n ± 0%    13.61n ± 0%    +4.98% (p=0.000 n=10)
MemmoveKnownSize1024   28.16n ± 0%    28.96n ± 0%    +2.84% (p=0.000 n=10)
geomean                6.578n         8.474n        +28.83%

                     │  base_1.log  │              new_1.log               │
                     │     B/s      │     B/s       vs base                │
MemmoveKnownSize112    36.17Gi ± 0%   10.42Gi ± 0%  -71.19% (p=0.000 n=10)
MemmoveKnownSize128    37.40Gi ± 0%   34.88Gi ± 0%   -6.74% (p=0.000 n=10)
MemmoveKnownSize192    38.66Gi ± 0%   31.45Gi ± 1%  -18.64% (p=0.000 n=10)
MemmoveKnownSize248    40.43Gi ± 0%   36.88Gi ± 0%   -8.78% (p=0.000 n=10)
MemmoveKnownSize256    39.68Gi ± 0%   36.46Gi ± 0%   -8.14% (p=0.000 n=10)
MemmoveKnownSize512    36.79Gi ± 0%   35.05Gi ± 0%   -4.73% (p=0.000 n=10)
MemmoveKnownSize1024   33.87Gi ± 0%   32.93Gi ± 0%   -2.75% (p=0.000 n=10)
geomean                37.51Gi        29.12Gi       -22.38%
```

вторник, 3 июня 2025 г. в 01:27:20 UTC+3, Keith Randall:

Keith Randall

unread,
Jun 5, 2025, 8:10:31 PMJun 5
to Alex Markin, golang-dev
Those numbers seem strange to me.
It's an extra 1,945,218 mispredicts, and an extra 1.08981 seconds. That's 560ns per mispredict, or 1,449 cycles. That can't be right, can it? Mispredicts are bad, but not *that* bad.

--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.

Keith Randall

unread,
Jun 5, 2025, 8:16:10 PMJun 5
to Alex Markin, golang-dev
Maybe the number of iterations is different? For the performance counter runs, run with something like -benchtime=1000000000x so the same work is done in both runs.

Keith Randall

unread,
Jun 5, 2025, 8:20:01 PMJun 5
to Alex Markin, golang-dev
Just looking at a single run, a 10% slowdown on a ~1sec run means ~100ms / ~2M mispredicts ~= 50 ns / mispredict ~= 128 cycles / mispredict. Still seems high, but maybe just plausible?

peterGo

unread,
Jun 5, 2025, 9:00:31 PMJun 5
to golang-dev
On Thursday, June 5, 2025 at 7:53:51 PM UTC-4 Alex Markin wrote:
Hello.

I tested this patch on my Kunpeng920 and Kunpeng930 systems (arm64) and in all cases got a degradation. The main reason is branch misprediction.

 Alex and Keith,

I am not familiar with the Kunpeng920 so I Googled it.

https://www.hisilicon.com/en/products/kunpeng/huawei-kunpeng/huawei-kunpeng-920    

Kunpeng 920 is the industry's leading-edge Arm-based server CPU. Utilizing cutting-edge 7 nm processes, the CPU was independently designed by HUAWEI based on the Arm architecture license. **Processor performance is significantly improved by optimizing branch prediction algorithms**, increasing the number of execution units, and improving the memory subsystem architecture.

Interesting!

Peter

Konstantin Kulikov

unread,
Jun 6, 2025, 12:01:08 PMJun 6
to kei...@alum.mit.edu, golang-dev
Banana bpi r3 (mt7986)
~/tmp $ benchstat bpir3-old.txt bpir3-new.txt

goos: linux
goarch: arm64
pkg: runtime
                       │ bpir3-old.txt │            bpir3-new.txt             │
                       │    sec/op     │    sec/op     vs base                │
MemclrKnownSize112-4       13.52n ± 0%    12.02n ± 0%  -11.09% (p=0.000 n=10)
MemclrKnownSize192-4       21.03n ± 0%    20.03n ± 0%   -4.76% (p=0.000 n=10)
MemclrKnownSize248-4       25.54n ± 0%    24.54n ± 0%   -3.94% (p=0.000 n=10)
MemclrKnownSize256-4       27.04n ± 0%    26.04n ± 0%   -3.70% (p=0.000 n=10)
MemclrKnownSize512-4       51.09n ± 0%    50.29n ± 1%   -1.56% (p=0.001 n=10)
MemclrKnownSize1024-4      99.15n ± 0%   103.20n ± 0%   +4.08% (p=0.000 n=10)
MemmoveKnownSize112-4      15.02n ± 0%    14.52n ± 0%   -3.33% (p=0.000 n=10)
MemmoveKnownSize192-4      22.54n ± 0%    23.63n ± 1%   +4.88% (p=0.000 n=10)
MemmoveKnownSize248-4      27.55n ± 0%    29.68n ± 1%   +7.73% (p=0.000 n=10)
MemmoveKnownSize256-4      28.56n ± 0%    31.07n ± 2%   +8.81% (p=0.000 n=10)
MemmoveKnownSize512-4      52.58n ± 0%    63.65n ± 0%  +21.06% (p=0.000 n=10)
MemmoveKnownSize1024-4     100.6n ± 0%    119.7n ± 0%  +18.98% (p=0.000 n=10)
geomean                    32.53n         33.40n        +2.68%

                       │ bpir3-old.txt │            bpir3-new.txt             │
                       │      B/s      │     B/s       vs base                │
MemclrKnownSize112-4      7.714Gi ± 0%   8.677Gi ± 0%  +12.49% (p=0.000 n=10)
MemclrKnownSize192-4      8.502Gi ± 0%   8.927Gi ± 0%   +5.00% (p=0.000 n=10)
MemclrKnownSize248-4      9.044Gi ± 0%   9.414Gi ± 0%   +4.08% (p=0.000 n=10)
MemclrKnownSize256-4      8.817Gi ± 0%   9.156Gi ± 0%   +3.85% (p=0.000 n=10)
MemclrKnownSize512-4      9.334Gi ± 0%   9.482Gi ± 1%   +1.58% (p=0.001 n=10)
MemclrKnownSize1024-4     9.618Gi ± 0%   9.243Gi ± 0%   -3.90% (p=0.000 n=10)
MemmoveKnownSize112-4     6.944Gi ± 0%   7.185Gi ± 0%   +3.47% (p=0.000 n=10)
MemmoveKnownSize192-4     7.935Gi ± 0%   7.566Gi ± 1%   -4.64% (p=0.000 n=10)
MemmoveKnownSize248-4     8.385Gi ± 0%   7.782Gi ± 1%   -7.19% (p=0.000 n=10)
MemmoveKnownSize256-4     8.348Gi ± 0%   7.672Gi ± 2%   -8.10% (p=0.000 n=10)
MemmoveKnownSize512-4     9.070Gi ± 0%   7.492Gi ± 0%  -17.39% (p=0.000 n=10)
MemmoveKnownSize1024-4    9.475Gi ± 0%   7.964Gi ± 0%  -15.95% (p=0.000 n=10)
geomean                   8.564Gi        8.341Gi        -2.61%



Banana bpi r4 (mt7988)
~/tmp $ benchstat bpir4-old.txt bpir4-new.txt

goos: linux
goarch: arm64
pkg: runtime
                       │ bpir4-old.txt │            bpir4-new.txt            │
                       │    sec/op     │   sec/op     vs base                │
MemclrKnownSize112-4      10.685n ± 0%   9.348n ± 0%  -12.51% (p=0.000 n=10)
MemclrKnownSize128-4       12.01n ± 0%   10.69n ± 0%  -10.99% (p=0.000 n=10)
MemclrKnownSize192-4       17.37n ± 0%   16.02n ± 0%   -7.75% (p=0.000 n=10)
MemclrKnownSize248-4       22.03n ± 0%   21.04n ± 0%   -4.52% (p=0.000 n=10)
MemclrKnownSize256-4       22.70n ± 0%   21.36n ± 0%   -5.88% (p=0.000 n=10)
MemclrKnownSize512-4       44.06n ± 0%   42.72n ± 0%   -3.04% (p=0.000 n=10)
MemclrKnownSize1024-4      86.86n ± 0%   85.45n ± 0%   -1.62% (p=0.000 n=10)
MemmoveKnownSize112-4      12.75n ± 1%   11.48n ± 1%   -9.96% (p=0.000 n=10)
MemmoveKnownSize128-4      14.43n ± 1%   13.18n ± 0%   -8.66% (p=0.000 n=10)
MemmoveKnownSize192-4      21.86n ± 1%   20.94n ± 0%   -4.21% (p=0.000 n=10)
MemmoveKnownSize248-4      27.60n ± 0%   26.46n ± 1%   -4.09% (p=0.000 n=10)
MemmoveKnownSize256-4      28.44n ± 2%   26.95n ± 1%   -5.22% (p=0.000 n=10)
MemmoveKnownSize512-4      54.75n ± 1%   54.16n ± 1%   -1.08% (p=0.000 n=10)
MemmoveKnownSize1024-4     106.7n ± 0%   108.1n ± 0%   +1.36% (p=0.000 n=10)
geomean                    26.37n        24.88n        -5.66%

                       │ bpir4-old.txt │             bpir4-new.txt             │
                       │      B/s      │      B/s       vs base                │
MemclrKnownSize112-4      9.760Gi ± 0%   11.158Gi ± 0%  +14.33% (p=0.000 n=10)
MemclrKnownSize128-4      9.922Gi ± 0%   11.150Gi ± 0%  +12.38% (p=0.000 n=10)
MemclrKnownSize192-4      10.30Gi ± 0%    11.16Gi ± 0%   +8.37% (p=0.000 n=10)
MemclrKnownSize248-4      10.49Gi ± 0%    10.98Gi ± 0%   +4.72% (p=0.000 n=10)
MemclrKnownSize256-4      10.51Gi ± 0%    11.16Gi ± 0%   +6.24% (p=0.000 n=10)
MemclrKnownSize512-4      10.82Gi ± 0%    11.16Gi ± 0%   +3.12% (p=0.000 n=10)
MemclrKnownSize1024-4     10.98Gi ± 0%    11.16Gi ± 0%   +1.66% (p=0.000 n=10)
MemmoveKnownSize112-4     8.179Gi ± 1%    9.085Gi ± 1%  +11.07% (p=0.000 n=10)
MemmoveKnownSize128-4     8.261Gi ± 1%    9.046Gi ± 0%   +9.51% (p=0.000 n=10)
MemmoveKnownSize192-4     8.181Gi ± 1%    8.539Gi ± 0%   +4.37% (p=0.000 n=10)
MemmoveKnownSize248-4     8.370Gi ± 1%    8.728Gi ± 1%   +4.27% (p=0.000 n=10)
MemmoveKnownSize256-4     8.385Gi ± 2%    8.847Gi ± 1%   +5.51% (p=0.000 n=10)
MemmoveKnownSize512-4     8.710Gi ± 1%    8.805Gi ± 1%   +1.09% (p=0.000 n=10)
MemmoveKnownSize1024-4    8.942Gi ± 0%    8.822Gi ± 0%   -1.34% (p=0.000 n=10)
geomean                   9.357Gi         9.919Gi        +6.00%


--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.

Filippo Valsorda

unread,
Jun 7, 2025, 7:28:49 AMJun 7
to kei...@alum.mit.edu, golang-dev
Ampere Altra Q64-22 (Neoverse N1)

goos: linux
goarch: arm64
pkg: runtime
                        │ 08146aa6ae  │             fbdf6b5a98             │
                        │   sec/op    │   sec/op     vs base               │
MemclrKnownSize112-64     3.644n ± 2%   3.189n ± 2%  -12.50% (p=0.002 n=6)
MemclrKnownSize128-64     4.117n ± 0%   3.706n ± 3%   -9.98% (p=0.002 n=6)
MemclrKnownSize192-64     5.925n ± 2%   5.471n ± 3%   -7.66% (p=0.002 n=6)
MemclrKnownSize248-64     7.746n ± 0%   7.288n ± 0%   -5.92% (p=0.002 n=6)
MemclrKnownSize256-64     7.749n ± 1%   7.322n ± 0%   -5.50% (p=0.002 n=6)
MemclrKnownSize512-64     15.03n ± 0%   14.72n ± 0%   -2.10% (p=0.002 n=6)
MemclrKnownSize1024-64    30.39n ± 1%   30.25n ± 0%   -0.46% (p=0.002 n=6)
MemmoveKnownSize112-64    5.303n ± 0%   5.222n ± 0%   -1.52% (p=0.002 n=6)
MemmoveKnownSize128-64    5.920n ± 0%   6.008n ± 3%   +1.48% (p=0.002 n=6)
MemmoveKnownSize192-64    8.438n ± 2%   8.790n ± 5%   +4.17% (p=0.002 n=6)
MemmoveKnownSize248-64    10.24n ± 0%   10.59n ± 6%   +3.47% (p=0.002 n=6)
MemmoveKnownSize256-64    11.06n ± 2%   11.48n ± 5%   +3.84% (p=0.002 n=6)
MemmoveKnownSize512-64    21.63n ± 0%   22.53n ± 1%   +4.16% (p=0.002 n=6)
MemmoveKnownSize1024-64   43.06n ± 0%   44.82n ± 0%   +4.10% (p=0.002 n=6)
geomean                   9.729n        9.544n        -1.90%

                        │  08146aa6ae  │             fbdf6b5a98              │
                        │     B/s      │     B/s       vs base               │
MemclrKnownSize112-64     28.63Gi ± 2%   32.71Gi ± 2%  +14.28% (p=0.002 n=6)
MemclrKnownSize128-64     28.96Gi ± 0%   32.17Gi ± 3%  +11.09% (p=0.002 n=6)
MemclrKnownSize192-64     30.18Gi ± 2%   32.68Gi ± 3%   +8.30% (p=0.002 n=6)
MemclrKnownSize248-64     29.82Gi ± 0%   31.69Gi ± 0%   +6.29% (p=0.002 n=6)
MemclrKnownSize256-64     30.77Gi ± 1%   32.56Gi ± 0%   +5.82% (p=0.002 n=6)
MemclrKnownSize512-64     31.72Gi ± 0%   32.41Gi ± 0%   +2.18% (p=0.002 n=6)
MemclrKnownSize1024-64    31.38Gi ± 1%   31.53Gi ± 0%   +0.47% (p=0.002 n=6)
MemmoveKnownSize112-64    19.67Gi ± 0%   19.97Gi ± 0%   +1.54% (p=0.002 n=6)
MemmoveKnownSize128-64    20.14Gi ± 0%   19.84Gi ± 3%   -1.45% (p=0.002 n=6)
MemmoveKnownSize192-64    21.19Gi ± 2%   20.34Gi ± 5%   -3.99% (p=0.002 n=6)
MemmoveKnownSize248-64    22.56Gi ± 0%   21.81Gi ± 5%   -3.32% (p=0.002 n=6)
MemmoveKnownSize256-64    21.57Gi ± 2%   20.76Gi ± 5%   -3.74% (p=0.002 n=6)
MemmoveKnownSize512-64    22.05Gi ± 0%   21.17Gi ± 1%   -4.01% (p=0.002 n=6)
MemmoveKnownSize1024-64   22.15Gi ± 0%   21.28Gi ± 0%   -3.94% (p=0.002 n=6)
geomean                   25.36Gi        25.85Gi        +1.94%

Alex Markin

unread,
Jun 10, 2025, 9:58:26 AMJun 10
to golang-dev
Hi.

Thanks for the feedback. I made few experiments and finally got the following result:

```
goos: linux
goarch: arm64 (Kunpeng920)
pkg: runtime
                     │ base_1.log  │            new_u4_1.log            │
                     │   sec/op    │   sec/op     vs base               │
MemmoveKnownSize112    4.576n ± 0%   4.294n ± 6%  -6.17% (p=0.006 n=20)
MemmoveKnownSize128    5.012n ± 0%   4.911n ± 0%  -2.02% (p=0.000 n=20)
MemmoveKnownSize192    6.932n ± 0%   7.068n ± 0%  +1.96% (p=0.000 n=20)
MemmoveKnownSize248    8.575n ± 0%   8.664n ± 0%  +1.03% (p=0.000 n=20)
MemmoveKnownSize256    9.056n ± 1%   9.202n ± 0%  +1.61% (p=0.000 n=20)
MemmoveKnownSize512    17.37n ± 0%   17.75n ± 0%  +2.19% (p=0.000 n=20)
MemmoveKnownSize1024   35.00n ± 1%   35.25n ± 0%  +0.73% (p=0.000 n=20)
geomean                9.598n        9.585n       -0.14%

                     │  base_1.log  │            new_u4_1.log             │
                     │     B/s      │     B/s       vs base               │
MemmoveKnownSize112    22.79Gi ± 0%   24.29Gi ± 6%  +6.57% (p=0.006 n=20)
MemmoveKnownSize128    23.79Gi ± 0%   24.27Gi ± 0%  +2.04% (p=0.000 n=20)
MemmoveKnownSize192    25.80Gi ± 0%   25.30Gi ± 0%  -1.93% (p=0.000 n=20)
MemmoveKnownSize248    26.93Gi ± 0%   26.66Gi ± 0%  -1.02% (p=0.000 n=20)
MemmoveKnownSize256    26.33Gi ± 1%   25.91Gi ± 0%  -1.58% (p=0.000 n=20)
MemmoveKnownSize512    27.45Gi ± 0%   26.86Gi ± 0%  -2.14% (p=0.000 n=20)
MemmoveKnownSize1024   27.25Gi ± 1%   27.05Gi ± 0%  -0.74% (p=0.000 n=20)
geomean                25.71Gi        25.74Gi       +0.13%


goos: linux
goarch: arm64 (Kunpeng920C)
pkg: runtime
                     │ base_1.log  │            new_u4_1.log             │
                     │   sec/op    │   sec/op     vs base                │
MemmoveKnownSize112    2.885n ± 0%   2.430n ± 0%  -15.77% (p=0.000 n=20)
MemmoveKnownSize128    3.188n ± 0%   3.187n ± 0%   -0.03% (p=0.000 n=20)
MemmoveKnownSize192    4.625n ± 0%   4.656n ± 0%   +0.67% (p=0.000 n=20)
MemmoveKnownSize248    5.712n ± 0%   5.663n ± 0%   -0.85% (p=0.000 n=20)
MemmoveKnownSize256    6.006n ± 0%   5.985n ± 0%   -0.35% (p=0.000 n=20)
MemmoveKnownSize512    12.97n ± 0%   12.96n ± 0%   -0.08% (p=0.000 n=20)
MemmoveKnownSize1024   28.16n ± 0%   28.07n ± 0%   -0.28% (p=0.000 n=20)
geomean                6.578n        6.410n        -2.55%

                     │  base_1.log  │             new_u4_1.log             │
                     │     B/s      │     B/s       vs base                │
MemmoveKnownSize112    36.16Gi ± 0%   42.92Gi ± 0%  +18.70% (p=0.000 n=20)
MemmoveKnownSize128    37.40Gi ± 0%   37.40Gi ± 0%   +0.01% (p=0.000 n=20)
MemmoveKnownSize192    38.66Gi ± 0%   38.41Gi ± 0%   -0.66% (p=0.000 n=20)
MemmoveKnownSize248    40.44Gi ± 0%   40.79Gi ± 0%   +0.85% (p=0.000 n=20)
MemmoveKnownSize256    39.69Gi ± 0%   39.83Gi ± 0%   +0.35% (p=0.000 n=20)
MemmoveKnownSize512    36.77Gi ± 0%   36.81Gi ± 0%   +0.08% (p=0.001 n=20)
MemmoveKnownSize1024   33.87Gi ± 0%   33.97Gi ± 0%   +0.29% (p=0.000 n=20)
geomean                37.51Gi        38.49Gi        +2.61%
```

The only change is unrolling not by 2, but by 4 for DUFFCOPY.

пятница, 6 июня 2025 г. в 03:20:01 UTC+3, Keith Randall:

Keith Randall

unread,
Jun 16, 2025, 7:14:04 PMJun 16
to Alex Markin, golang-dev
Ok, things look mostly positive, thanks.

I've made a more-real stack of CLs. It should lower the overhead somewhat.
Based on the results you all have provided, I've also unrolled the memmove loop more for the larger sizes. Hopefully that will mitigate some of the slowdowns we saw.



peterGo

unread,
Jun 16, 2025, 9:19:59 PMJun 16
to golang-dev
On Monday, June 16, 2025 at 7:14:04 PM UTC-4 Keith Randall wrote:
Ok, things look mostly positive, thanks.

I've made a more-real stack of CLs. It should lower the overhead somewhat.
Based on the results you all have provided, I've also unrolled the memmove loop more for the larger sizes. Hopefully that will mitigate some of the slowdowns we saw.


 
Keith.

~/go/src $ GOROOT_BOOTSTRAP=~/go.boot ./make.bash
~/go/src $ go version
go version go1.25-devel_d058254689 Mon Jun 16 15:48:19 2025 -0700 linux/arm64

~/go $ git apply --stat arm64.removeduff.patchset4.d267f82.diff
 src/cmd/internal/obj/arm64/asm7.go |    8 -
 src/cmd/internal/obj/arm64/obj7.go |  146 --------------------
 src/runtime/duff_arm64.s           |  267 ------------------------------------
 src/runtime/mkduff.go              |    1
 4 files changed, 4 insertions(+), 418 deletions(-)
~/go $ git apply --apply --verbose  arm64.removeduff.patchset4.d267f82.diff
Checking patch src/cmd/internal/obj/arm64/asm7.go...
Checking patch src/cmd/internal/obj/arm64/obj7.go...
Checking patch src/runtime/duff_arm64.s...
Checking patch src/runtime/mkduff.go...
Applied patch src/cmd/internal/obj/arm64/asm7.go cleanly.
Applied patch src/cmd/internal/obj/arm64/obj7.go cleanly.
Applied patch src/runtime/duff_arm64.s cleanly.
Applied patch src/runtime/mkduff.go cleanly.
$

~/go/src $ GOROOT_BOOTSTRAP=~/go.boot ./make.bash
Building Go cmd/dist using /home/peter/go.boot. (go1.23.3 linux/arm64)
Building Go toolchain1 using /home/peter/go.boot.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
<autogenerated>:1: illegal combination: 00072 (/home/peter/go/src/internal/trace/tracev2/spec.go:97) DUFFCOPY runtime.duffcopy+464(SB) NONE NONE NONE ADDR NONE, 0 3

go tool dist: FAILED: /home/peter/go/pkg/tool/linux_arm64/compile -std -pack -o /tmp/go-tool-dist-3673796576/internal/trace/tracev2/_go_.a -p internal/trace/tracev2 -importcfg /tmp/go-tool-dist-3673796576/internal/trace/tracev2/importcfg /home/peter/go/src/internal/trace/tracev2/doc.go /home/peter/go/src/internal/trace/tracev2/events.go /home/peter/go/src/internal/trace/tracev2/spec.go: exit status 2
$


Peter

 

Keith Randall

unread,
Jun 16, 2025, 9:22:30 PMJun 16
to peterGo, golang-dev
I think you just applied the last of the 7 CLs. You will need all 7 CLs (or a prefix of them) to have a working Go build.


--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.

peterGo

unread,
Jun 16, 2025, 9:45:32 PMJun 16
to golang-dev
Keith,

I suspected something like that! What are the other 7 CLs?

Peter

Keith Randall

unread,
Jun 17, 2025, 12:02:45 AMJun 17
to peterGo, golang-dev
If you click on that CL link you'll see the list of them in the "Relation Chain" column.

If you want to apply them in order, it is

678619
678936
680335
678937
678620
678995
679015
679456

If you want all of them, I think on that CL page if you click on Download and select the Checkout command and paste that into a clean client, you should get all of the CLs.
That would be

git fetch https://go.googlesource.com/go refs/changes/56/679456/4 && git checkout FETCH_HEAD

Reply all
Reply to author
Forward
0 new messages