Value copy costs are not very predictable.

241 views
Skip to first unread message

tapi...@gmail.com

unread,
May 29, 2021, 11:49:47 PM5/29/21
to golang-nuts

The result:

$ go test -bench=.
goos: linux
goarch: amd64
pkg: example.com/valuecopy
cpu: Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz
Benchmark_CopyBool-4                    1000000000             0.8885 ns/op
Benchmark_CopyByte-4                    1000000000             0.8872 ns/op
Benchmark_CopyInt16-4                   1000000000             0.8785 ns/op
Benchmark_CopyInt32-4                   1000000000             0.8854 ns/op
Benchmark_CopyInt64-4                   1000000000             0.8831 ns/op
Benchmark_CopyPointer-4                 911733464             1.330 ns/op
Benchmark_CopyString-4                  901249356             1.325 ns/op
Benchmark_CopySlice-4                   664187247             1.765 ns/op
Benchmark_CopyArray_2_elements-4        1000000000             0.8874 ns/op
Benchmark_CopyArray_3_elements-4        1000000000             1.096 ns/op
Benchmark_CopyArray_4_elements-4        1000000000             1.105 ns/op
Benchmark_CopyArray_5_elements-4        534542524             2.202 ns/op
Benchmark_CopyArray_6_elements-4        727849554             1.606 ns/op
Benchmark_CopyArray_7_elements-4        444494692             2.649 ns/op
Benchmark_CopyArray_8_elements-4        584854867             1.993 ns/op
Benchmark_CopyArray_9_elements-4        389639859             3.083 ns/op
Benchmark_CopyArray_10_elements-4       267380602             4.418 ns/op
Benchmark_CopyArray_11_elements-4       242644033             4.867 ns/op
Benchmark_CopyArray_12_elements-4       268304104             4.498 ns/op
Benchmark_CopyArray_13_elements-4       82165272            14.46 ns/op
Benchmark_CopyStruct_2_fields-4         1000000000             0.5029 ns/op
Benchmark_CopyStruct_3_fields-4         671136589             1.769 ns/op
Benchmark_CopyStruct_4_fields-4         1000000000             0.8785 ns/op
Benchmark_CopyStruct_5_fields-4         530876049             2.202 ns/op
Benchmark_CopyStruct_6_fields-4         723380257             1.581 ns/op
Benchmark_CopyStruct_7_fields-4         444619906             2.636 ns/op
Benchmark_CopyStruct_8_fields-4         588605260             1.968 ns/op
Benchmark_CopyStruct_9_fields-4         387253551             3.073 ns/op
Benchmark_CopyStruct_10_fields-4        267450452             4.396 ns/op
Benchmark_CopyStruct_11_fields-4        246289522             4.855 ns/op
Benchmark_CopyStruct_12_fields-4        266212528             4.426 ns/op
Benchmark_CopyStruct_13_fields-4        207298701             5.739 ns/op

From the benchmark result, it looks
* the cost of copying a [13]int value is much larger than copying a [12]int value.
* the cost of copying a struct{a, b, c int} value is about double of copying a struct{a, b, c, d int} value.


Kurtis Rader

unread,
May 30, 2021, 12:28:55 AM5/30/21
to tapi...@gmail.com, golang-nuts
On Sat, May 29, 2021 at 8:50 PM tapi...@gmail.com <tapi...@gmail.com> wrote:
...
From the benchmark result, it looks
* the cost of copying a [13]int value is much larger than copying a [12]int value.
* the cost of copying a struct{a, b, c int} value is about double of copying a struct{a, b, c, d int} value.

The size, and internal layout, of structs has a huge effect on these types of benchmarks due to the interaction with the L1 and L2 caches and the CPU architecture policies for  the management of those caches. Thus this type of benchmark needs to document the CPU architecture being used and also the results on other architectures. It would not be at all surprising if changing the implementation to improve the results on your system resulted in slower behavior on other systems.

--
Kurtis Rader
Caretaker of the exceptional canines Junior and Hank

tapi...@gmail.com

unread,
May 30, 2021, 1:09:55 AM5/30/21
to golang-nuts
I agree.

Could someone post the benchmark results on different architectures other than Intel(R) Core(TM) i5-3210 to make comparisons?

Axel Wagner

unread,
May 30, 2021, 2:46:23 AM5/30/21
to golang-nuts
I believe it is save to say, even without other benchmarks, that a) you are correct that the cost is "unpredictable"¹ and b) that will always be the case. The compiler will always chose different strategies for differently sized values and if not that, the architecture of computers will have different timing characteristics (based on cache sizes, number of registers…) for differently sized values. This does not constitute a problem. So it seems like a waste of everybody's time (yours included, but you're of course free to spend it however you like) to try and determine where these thresholds lie using blackbox benchmarks.

[1] You really just say it doesn't depend lineally on size. But if it actually was unpredictable, it would vary significantly for the same value. All you've shown is that the prediction is more complicated than "Size in bytes times some time interval".

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/6d32d796-513f-40b6-b5d2-a1e54f4d1d25n%40googlegroups.com.

Jan Mercl

unread,
May 30, 2021, 7:04:15 AM5/30/21
to tapi...@gmail.com, golang-nuts
Within the benchmark loops of the linked code a sufficiently smart compiler can optimize the source values away completely and/or collapse all writes to the destination values to a single write.

Have you looked at the actual code the CPU executes?

tapi...@gmail.com

unread,
May 30, 2021, 11:09:52 AM5/30/21
to golang-nuts
gcflags=-S shows the code of copy 3-field and 4-field structs:

    // struct{a, b, c int}
    0x0034 00052 (valuecopy.go:223)    MOVQ    $0, "".struct3_0(SB)
    0x003f 00063 (valuecopy.go:223)    XORPS    X0, X0
    0x0042 00066 (valuecopy.go:223)    MOVUPS    X0, "".struct3_0+8(SB)

    // struct{a, b, c, d int}
    0x0034 00052 (valuecopy.go:233)    XORPS    X0, X0
    0x0037 00055 (valuecopy.go:233)    MOVUPS    X0, "".struct4_0(SB)
    0x003e 00062 (valuecopy.go:233)    MOVUPS    X0, "".struct4_0+16(SB)

I don't understand the instructions.

tapi...@gmail.com

unread,
May 30, 2021, 12:06:46 PM5/30/21
to golang-nuts
It is some strange that if any of the bool/byte/int16/int64 benchmarks is removed in this test file https://play.golang.org/p/w29J9VhtzYH,
then the benchmark result is like:

Benchmark_CopyStruct_3_fields-4       1000000000             0.7780 ns/op
Benchmark_CopyStruct_4_fields-4       1000000000             0.7513 ns/op

Otherwise, it is like:

Benchmark_CopyStruct_3_fields-4       1000000000             1.501 ns/op
Benchmark_CopyStruct_4_fields-4       1000000000             0.7755 ns/op

In other words, it looks there is mutual interference between benchmarks.

Axel Wagner

unread,
May 30, 2021, 12:54:02 PM5/30/21
to tapi...@gmail.com, golang-nuts
That is very normal for micro-benchmarks on a ns scale.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

Ian Lance Taylor

unread,
May 30, 2021, 2:09:31 PM5/30/21
to Jan Mercl, tapi...@gmail.com, golang-nuts
On Sun, May 30, 2021 at 4:04 AM Jan Mercl <0xj...@gmail.com> wrote:
>
> Within the benchmark loops of the linked code a sufficiently smart compiler can optimize the source values away completely and/or collapse all writes to the destination values to a single write.

For example, here are the results when using gccgo on my laptop
(Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz):

goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
Benchmark_CopyBool-8 1000000000 0.0000015 ns/op
Benchmark_CopyByte-8 1000000000 0.0000002 ns/op
Benchmark_CopyInt16-8 1000000000 0.0000002 ns/op
Benchmark_CopyInt32-8 1000000000 0.0000003 ns/op
Benchmark_CopyInt64-8 1000000000 0.0000007 ns/op
Benchmark_CopyPointer-8 1000000000 0.6162 ns/op
Benchmark_CopyString-8 1000000000 0.5529 ns/op
Benchmark_CopySlice-8 1000000000 0.8605 ns/op
Benchmark_CopyArray_2_elements-8 1000000000 0.0000002 ns/op
Benchmark_CopyArray_3_elements-8 1000000000 0.0000002 ns/op
Benchmark_CopyArray_4_elements-8 1000000000 0.0000003 ns/op
Benchmark_CopyArray_5_elements-8 1000000000 0.0000002 ns/op
Benchmark_CopyArray_6_elements-8 1000000000 0.0000002 ns/op
Benchmark_CopyArray_7_elements-8 1000000000 0.0000003 ns/op
Benchmark_CopyArray_8_elements-8 1000000000 0.0000002 ns/op
Benchmark_CopyArray_9_elements-8 1000000000 0.0000001 ns/op
Benchmark_CopyArray_10_elements-8 1000000000 0.0000002 ns/op
Benchmark_CopyArray_11_elements-8 1000000000 0.0000001 ns/op
Benchmark_CopyArray_12_elements-8 1000000000 0.0000002 ns/op
Benchmark_CopyArray_13_elements-8 1000000000 0.0000003 ns/op
Benchmark_CopyStruct_2_fields-8 1000000000 0.0000002 ns/op
Benchmark_CopyStruct_3_fields-8 1000000000 0.0000003 ns/op
Benchmark_CopyStruct_4_fields-8 1000000000 0.0000002 ns/op
Benchmark_CopyStruct_5_fields-8 1000000000 0.0000002 ns/op
Benchmark_CopyStruct_6_fields-8 1000000000 0.0000002 ns/op
Benchmark_CopyStruct_7_fields-8 1000000000 0.0000001 ns/op
Benchmark_CopyStruct_8_fields-8 1000000000 0.0000002 ns/op
Benchmark_CopyStruct_9_fields-8 1000000000 0.0000002 ns/op
Benchmark_CopyStruct_10_fields-8 1000000000 0.0000002 ns/op
Benchmark_CopyStruct_11_fields-8 1000000000 0.0000002 ns/op
Benchmark_CopyStruct_12_fields-8 1000000000 0.0000001 ns/op
Benchmark_CopyStruct_13_fields-8 1000000000 0.0000002 ns/op
PASS
ok command-line-arguments 2.514s

Ian

tapi...@gmail.com

unread,
May 31, 2021, 9:29:54 AM5/31/21
to golang-nuts
On Sunday, May 30, 2021 at 12:54:02 PM UTC-4 axel.wa...@googlemail.com wrote:
That is very normal for micro-benchmarks on a ns scale.

The results are so constantly that I think it is more related to CPU cache and the specified directives.

Axel Wagner

unread,
May 31, 2021, 10:07:35 AM5/31/21
to golang-nuts
On Mon, May 31, 2021 at 3:30 PM tapi...@gmail.com <tapi...@gmail.com> wrote:
On Sunday, May 30, 2021 at 12:54:02 PM UTC-4 axel.wa...@googlemail.com wrote:
That is very normal for micro-benchmarks on a ns scale.
The results are so constantly that I think it is more related to CPU cache and the specified directives.

Yes that's what I meant. Given that the signal is so small (on a ns scale), noise introduced by alignment and code caches and other micro-architectural details becomes so large (relatively), that it's normal to notice an effect in reordering or removing unrelated code.

That's why they are not a good basis to base decisions on. You can't know if what you're seeing is a real effect or just noise.
 

tapi...@gmail.com

unread,
May 31, 2021, 9:19:09 PM5/31/21
to golang-nuts
On Monday, May 31, 2021 at 10:07:35 AM UTC-4 axel.wa...@googlemail.com wrote:
On Mon, May 31, 2021 at 3:30 PM tapi...@gmail.com <tapi...@gmail.com> wrote:
On Sunday, May 30, 2021 at 12:54:02 PM UTC-4 axel.wa...@googlemail.com wrote:
That is very normal for micro-benchmarks on a ns scale.
The results are so constantly that I think it is more related to CPU cache and the specified directives.

Yes that's what I meant. Given that the signal is so small (on a ns scale), noise introduced by alignment and code caches and other micro-architectural details becomes so large (relatively), that it's normal to notice an effect in reordering or removing unrelated code.

That's why they are not a good basis to base decisions on. You can't know if what you're seeing is a real effect or just noise.

I tend to think the result is reliable, though I agree they are not a good basis to base decisions on,
for the performance might be different between different CPU architectures.
And it is still interesting to see the concrete CPU instructions generated for different situations.
For example, if the source values are moved out of the benchmark functions as package-level variables,
the generated instructions will become to

    0x0034 00052 (/home/d630/Desktop/aa/t/x_test.go:11)    MOVQ    "".struct3_1(SB), DX
    0x003b 00059 (/home/d630/Desktop/aa/t/x_test.go:11)    MOVQ    "".struct3_1+8(SB), BX
    0x0042 00066 (/home/d630/Desktop/aa/t/x_test.go:11)    MOVQ    "".struct3_1+16(SB), SI
    0x0049 00073 (/home/d630/Desktop/aa/t/x_test.go:11)    MOVQ    DX, "".struct3_0(SB)
    0x0050 00080 (/home/d630/Desktop/aa/t/x_test.go:11)    MOVQ    BX, "".struct3_0+8(SB)
    0x0057 00087 (/home/d630/Desktop/aa/t/x_test.go:11)    MOVQ    SI, "".struct3_0+16(SB)

and

    0x0034 00052 (/home/d630/Desktop/aa/t/x_test.go:21)    MOVQ    "".struct4_1(SB), DX
    0x003b 00059 (/home/d630/Desktop/aa/t/x_test.go:21)    MOVQ    "".struct4_1+8(SB), BX
    0x0042 00066 (/home/d630/Desktop/aa/t/x_test.go:21)    MOVQ    "".struct4_1+16(SB), SI
    0x0049 00073 (/home/d630/Desktop/aa/t/x_test.go:21)    MOVQ    "".struct4_1+24(SB), DI
    0x0050 00080 (/home/d630/Desktop/aa/t/x_test.go:21)    MOVQ    DX, "".struct4_0(SB)
    0x0057 00087 (/home/d630/Desktop/aa/t/x_test.go:21)    MOVQ    BX, "".struct4_0+8(SB)
    0x005e 00094 (/home/d630/Desktop/aa/t/x_test.go:21)    MOVQ    SI, "".struct4_0+16(SB)
    0x0065 00101 (/home/d630/Desktop/aa/t/x_test.go:21)    MOVQ    DI, "".struct4_0+24(SB)

And the benchmark result is more predictable:

Benchmark_CopyStruct_3_fields-4       988590592             1.441 ns/op
Benchmark_CopyStruct_4_fields-4       597097029             2.590 ns/op
Reply all
Reply to author
Forward
0 new messages