Small number change will affect benchmark results

328 views
Skip to first unread message

tapi...@gmail.com

unread,
Jul 28, 2021, 10:43:38 AM7/28/21
to golang-nuts


When N == 16384, the benchmark result:

Benchmark_Insert-4          134622          8032 ns/op       32768 B/op           1 allocs/op
Benchmark_Insert2-4         132049          8201 ns/op       32768 B/op           1 allocs/op

When N == 16385, the benchmark result:

Benchmark_Insert-4          118677          9374 ns/op       40960 B/op           1 allocs/op
Benchmark_Insert2-4         136845          7744 ns/op       40960 B/op           1 allocs/op

Brian Candler

unread,
Jul 28, 2021, 11:44:13 AM7/28/21
to golang-nuts
I think you have created rather a lot of threads recently on exactly the same topic:

I'm not convinced that another one is needed.  There have been good answers in the previous threads.

Go has a fairly complex runtime (as you'll see from the size of compiling "Hello world"), and such boundary conditions are to be expected, especially when looking at memory allocation.  But these rarely matter in real-world programs.

If they do matter in your application, then you may be happier with a language like C, where the machine-code generated maps more directly to the code you write.  Even then, you will come across oddities in the microarchitectures of the underlying hardware, such as what happens when caches are under eviction pressure.

When I was programming 6800's and 6502's, I was able to work out exactly how long a piece of code would take to run, by generating a cycle-by-cycle count.  That's not possible any more :-)

tapi...@gmail.com

unread,
Jul 28, 2021, 1:30:09 PM7/28/21
to golang-nuts
I think this one is different from previous. I don't criticize Go, I just seek reasons.

Brian Candler

unread,
Jul 28, 2021, 1:45:27 PM7/28/21
to golang-nuts
I didn't see it as criticism of go.  However you're looking at micro-behaviour which has little relevance to the real world, and hence not of interest to most go programmers.

Compilers and runtimes contain heuristics, such as "if I grow a slice/map/whatever beyond size X, then increase it to size Y".  These are necessary because the runtime cannot see the future: it doesn't know how many times you're going to keep growing the same object, and it would be inefficient to keep growing the same object by the minimum amount possible.  Memory management may say "if an object is larger than A, then round it up to a multiple of B" to avoid memory fragmentation.

Therefore, you should expect the implementation to have behaviour like this, which is designed to work well in the majority of use cases, most of the time.  In other words, what you see ("small number change will affect benchmark results") is normal and unremarkable.

Axel Wagner

unread,
Jul 28, 2021, 1:47:44 PM7/28/21
to tapi...@gmail.com, golang-nuts
On Wed, Jul 28, 2021 at 7:30 PM tapi...@gmail.com <tapi...@gmail.com> wrote:
I think this one is different from previous. I don't criticize Go, I just seek reasons.

Implying that previously you've been criticizing Go?

As for your search for reasons: 16384 is a power of two. So I assume that what changes is that an allocation enters a new size-class, which uses a different algorithm for allocation. Or something along those lines.
 
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/2c217862-84b7-4bff-a48a-06810848bcf4n%40googlegroups.com.

tapi...@gmail.com

unread,
Jul 28, 2021, 2:34:20 PM7/28/21
to golang-nuts
On Wednesday, July 28, 2021 at 1:47:44 PM UTC-4 axel.wa...@googlemail.com wrote:
On Wed, Jul 28, 2021 at 7:30 PM tapi...@gmail.com <tapi...@gmail.com> wrote:
I think this one is different from previous. I don't criticize Go, I just seek reasons.

Implying that previously you've been criticizing Go?

Sorry, my English is always not decent.
 

As for your search for reasons: 16384 is a power of two. So I assume that what changes is that an allocation enters a new size-class, which uses a different algorithm for allocation. Or something along those lines.

Yes, 16384+16384= 32768 and 16385+16384= 32769.
32768 bytes will be allocated on a memory block in the 32768 size class,
which is the largest size class.
32769 bytes will be allocated on 5 pages (each page is 8192 bytes).

But the implementations of Insert and Insert2 are very similar,
the allocation algorithms for them should be the same (for a specified number).
So it is weird the number change causes positive impact for one implementation,effect
but negative impact for the other.

Maybe, it is still a CPU related mystery.
 
 

On Wednesday, July 28, 2021 at 11:44:13 AM UTC-4 Brian Candler wrote:
I think you have created rather a lot of threads recently on exactly the same topic:

I'm not convinced that another one is needed.  There have been good answers in the previous threads.

Go has a fairly complex runtime (as you'll see from the size of compiling "Hello world"), and such boundary conditions are to be expected, especially when looking at memory allocation.  But these rarely matter in real-world programs.

If they do matter in your application, then you may be happier with a language like C, where the machine-code generated maps more directly to the code you write.  Even then, you will come across oddities in the microarchitectures of the underlying hardware, such as what happens when caches are under eviction pressure.

When I was programming 6800's and 6502's, I was able to work out exactly how long a piece of code would take to run, by generating a cycle-by-cycle count.  That's not possible any more :-)

On Wednesday, 28 July 2021 at 15:43:38 UTC+1 tapi...@gmail.com wrote:


When N == 16384, the benchmark result:

Benchmark_Insert-4          134622          8032 ns/op       32768 B/op           1 allocs/op
Benchmark_Insert2-4         132049          8201 ns/op       32768 B/op           1 allocs/op

When N == 16385, the benchmark result:

Benchmark_Insert-4          118677          9374 ns/op       40960effect B/op           1 allocs/op

Benchmark_Insert2-4         136845          7744 ns/op       40960 B/op           1 allocs/op

Axel Wagner

unread,
Jul 28, 2021, 4:15:38 PM7/28/21
to golang-nuts
You might not be able to get a cycle-by-cycle accounting, but with unlimited effort, you *can* get pretty darn close. Of course, that effort is usually not worth it. However, what you can always do, which is really the very first step to understand a benchmark result and attempt a micro-optimization like this, is look at the generated code and/or an actual CPU profile to see what the difference in generated code is (it might be none, in which case it's "spooky architecture action at a distance") and what the difference is where the generated code spends its time.

If you've done that, i would've expected you to share your insights. If you haven't done it, it seems rude to skip that step and expect people on golang-nuts to do it for you, if I'm being honest. That is why I'm a bit frustrated with these threads, personally. It's not that they are boring questions - in fact, I'd probably enjoy listening to a talk or reading an extensive blog post answering them very much. It's that you seem to generate arbitrary benchmarks and then ask other people to do the work of interpreting them for you.

In any case, all I was trying to say by pointing out the power of two is that that's the least surprising boundary for something to change. Whether it's allocation-details (apparently unlikely) or just some other optimization heuristic rolling over or whether it is some architectural detail like cache-sizes mattering. Which of these it is, I don't know either - but I'm not surprised enough to find out.

tapi...@gmail.com

unread,
Jul 28, 2021, 10:23:52 PM7/28/21
to golang-nuts
On Wednesday, July 28, 2021 at 4:15:38 PM UTC-4 axel.wa...@googlemail.com wrote:
You might not be able to get a cycle-by-cycle accounting, but with unlimited effort, you *can* get pretty darn close. Of course, that effort is usually not worth it. However, what you can always do, which is really the very first step to understand a benchmark result and attempt a micro-optimization like this, is look at the generated code and/or an actual CPU profile to see what the difference in generated code is (it might be none, in which case it's "spooky architecture action at a distance") and what the difference is where the generated code spends its time.

If you've done that, i would've expected you to share your insights. If you haven't done it, it seems rude to skip that step and expect people on golang-nuts to do it for you, if I'm being honest. That is why I'm a bit frustrated with these threads, personally.

I will when I confirm that no one could give an answer without much effort. If you feel frustrated, you can ignored it. ;D

Kurtis Rader

unread,
Jul 28, 2021, 10:42:58 PM7/28/21
to tapi...@gmail.com, golang-nuts
On Wed, Jul 28, 2021 at 7:24 PM tapi...@gmail.com <tapi...@gmail.com> wrote:
I will when I confirm that no one could give an answer without much effort. If you feel frustrated, you can ignored it. ;D

Like Axel, I too am mildly annoyed by your questions. Primarily because you don't seem to understand that the Go community isn't your private unpaid research team and every time you start one of these threads you omit important facts. Also, you don't seem to understand why things like the CPU architecture, cache line size, L1 & L2 cache sizes, and similar factors are frequently significant for the types of questions you keep asking. Frankly, as Axel pointed out, you should be figuring out the answer and sharing your findings with the community.

--
Kurtis Rader
Caretaker of the exceptional canines Junior and Hank

tapi...@gmail.com

unread,
Jul 28, 2021, 11:23:31 PM7/28/21
to golang-nuts
The go-nuts community is for communicating, right?
I think it is not hard to ignore the uninterested topics.

I will figure out the answer with as less effort as possible.
If someone happens knows the answer, I'm very appreciated.
If no one could give a confirmed answer, then I will seek other routes.

Axel Wagner

unread,
Jul 29, 2021, 2:09:14 AM7/29/21
to golang-nuts
It would have taken you less time to look at the generated code and/or a CPU profile than it would have to post this - let alone the rest of the discussion. I also believe it likely would take more time to type out the answer to your question, than it would for you to look at the generated code and/or a CPU profile. So no, I reject your reasoning that this is more convenient. Even *if* it saves time in absolute terms (i.e. even *if* someone can send an answer faster than you could do these most basic checks), it's still a tradeoff between taking up *your* time and taking up *their* time. It's still rude.

I also reject your reasoning that "golang-nuts is for communication". There are clear and written limitations on that (in the form of the CoC) and also unwritten limitations (otherwise we'd get constant recruiter-spam). If multiple people ask you to do a bit more research yourself before asking these questions, it's an indicator that you are stepping closer to these unwritten limitations. You should at least take them seriously.

Lastly, it's, again, not like your questions are not *interesting*. I'd likely want to read about them, based on the topic. I'm generally interested in optimizations and compiler internals. But I would prefer them to be pre-filtered to actual conundrums, instances where I can reasonably help or learn something new. So, IMO, it would also be in your best interest to do this. Not only do you learn more about investigating these things, you also will likely increase the usefulness of the answers, if people *don't* just discard your message as "oh, it's one of these again".

Anyways. To repeat this one last time: I'd respectfully ask you to conduct a little bit more reasearch on your own, before sending these mails. Namely look at the generated code (the Compiler Explorer makes this really simple) and look at the relevant profiling information (this blog post, while a little dated, still gives a good intro). I genuinely think this is more helpful advice than trying to answer your question.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

Axel Wagner

unread,
Jul 29, 2021, 3:22:35 AM7/29/21
to golang-nuts
FWIW I did take a look at the output (I ended up curious enough): https://go.godbolt.org/z/WK8xYd1E3

Insert and Insert2 generate pretty different code. In particular, Insert2 uses makeslicecopy, to fold the `make` and the `copy(s2, a)` (avoiding having to zero the relevant memory). `Insert` uses makeslice and a memmove for the `copy(s2, s[:k])`.

I assume that's because the compiler can easily prove that `len(s2) == len(a)+len(b)+len(c) > len(a)`, therefore it sees that the `copy` actually copies all of `a`. For `Insert`, it needs to do extra bounds checks.

So, looking at this output might've answered your question before posting it. It certainly would've provided helpful context for anyone trying to explain the behavior who isn't a domain expert in how the compiler optimizes things like this. i.e. it certainly would've increased the likelihood that any questions remaining *will* get an answer.

Axel Wagner

unread,
Jul 29, 2021, 3:26:40 AM7/29/21
to golang-nuts

tapi...@gmail.com

unread,
Jul 29, 2021, 7:54:52 AM7/29/21
to golang-nuts
On Thursday, July 29, 2021 at 3:22:35 AM UTC-4 axel.wa...@googlemail.com wrote:
FWIW I did take a look at the output (I ended up curious enough): https://go.godbolt.org/z/WK8xYd1E3

Insert and Insert2 generate pretty different code. In particular, Insert2 uses makeslicecopy, to fold the `make` and the `copy(s2, a)` (avoiding having to zero the relevant memory). `Insert` uses makeslice and a memmove for the `copy(s2, s[:k])`.

I assume that's because the compiler can easily prove that `len(s2) == len(a)+len(b)+len(c) > len(a)`, therefore it sees that the `copy` actually copies all of `a`. For `Insert`, it needs to do extra bounds checks.

So, looking at this output might've answered your question before posting it. It certainly would've provided helpful context for anyone trying to explain the behavior who isn't a domain expert in how the compiler optimizes things like this. i.e. it certainly would've increased the likelihood that any questions remaining *will* get an answer.

I also found elements of s2[:len(a)] are not reset.
This could explain why Insert2 is faster than Insert.
However, this doesn't answer why Insert2 is less performant than Insert when N == 16384.

Wojciech S. Czarnecki

unread,
Jul 29, 2021, 8:16:45 PM7/29/21
to golan...@googlegroups.com
Dnia 2021-07-28, o godz. 19:42:05
Kurtis Rader <kra...@skepticism.us> napisał(a):

> [...] I too am mildly annoyed by [Tapir Liu] questions.

As I was a few years ago. But let's do Tapir Liu justice: he posted a good manual
online [https://go101.org] and he updates it as his knowledge grows.
Also his stubborn picking at the tiny bits of the language (and community answers
to his drilling) not once gave me that a-ha moment revealing that things I considered
POGos really aren't. As annoying his adventures are, he does a favor to the community.
Both to novices (with his manual) and to the ol' Gophers too, taming their (my!) self-assurance.

> the Go community isn't your private unpaid

He has paid us to the bit with his go101.org manual.
I often recommend it as the "second view" docs. (Second to the official site).

PS. @TapirLiu:
> every time you start one of these threads you omit important facts.

This stands! You often post too soon. Please remember that this is MAILING
list that just happens to have web interface. You do often "thinkk in loud" and I see an
inflow of messages to my mailbox.
Just let your question grow for a while before you post, will you?:)

TC,

--
Wojciech S. Czarnecki
<< ^oo^ >> OHIR-RIPE

tapi...@gmail.com

unread,
Aug 18, 2021, 11:06:53 PM8/18/21
to golang-nuts
Spent some time on investigating this problem. It is confirmed that, when N == 16384, the memclrNoHeapPointers function is called twice in the Insert2 function. That means the "make+copy` optimization introduced in Go 1.15 will perform worse than the normal route under such situation.

tapi...@gmail.com

unread,
Aug 18, 2021, 11:35:05 PM8/18/21
to golang-nuts
More specifically, the last parameter (needzero) of mallocgc is not passed down to c.nextFree, which causes the memclrNoHeapPointers function is called twice in makeslicecopy if the slice size <= 32768.

tapi...@gmail.com

unread,
Aug 19, 2021, 1:17:43 AM8/19/21
to golang-nuts

tapi...@gmail.com

unread,
Oct 30, 2021, 11:11:22 AM10/30/21
to golang-nuts
The benchmarks become much less weird. https://play.golang.org/p/leXp-8MB6gi

On Wednesday, August 18, 2021 at 11:35:05 PM UTC-4 tapi...@gmail.com wrote:
Reply all
Reply to author
Forward
0 new messages