I think this one is different from previous. I don't criticize Go, I just seek reasons.
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/2c217862-84b7-4bff-a48a-06810848bcf4n%40googlegroups.com.
I think this one is different from previous. I don't criticize Go, I just seek reasons.Implying that previously you've been criticizing Go?
As for your search for reasons: 16384 is a power of two. So I assume that what changes is that an allocation enters a new size-class, which uses a different algorithm for allocation. Or something along those lines.
On Wednesday, July 28, 2021 at 11:44:13 AM UTC-4 Brian Candler wrote:
I think you have created rather a lot of threads recently on exactly the same topic:I'm not convinced that another one is needed. There have been good answers in the previous threads.Go has a fairly complex runtime (as you'll see from the size of compiling "Hello world"), and such boundary conditions are to be expected, especially when looking at memory allocation. But these rarely matter in real-world programs.If they do matter in your application, then you may be happier with a language like C, where the machine-code generated maps more directly to the code you write. Even then, you will come across oddities in the microarchitectures of the underlying hardware, such as what happens when caches are under eviction pressure.When I was programming 6800's and 6502's, I was able to work out exactly how long a piece of code would take to run, by generating a cycle-by-cycle count. That's not possible any more :-)
On Wednesday, 28 July 2021 at 15:43:38 UTC+1 tapi...@gmail.com wrote:
The benchmark code: https://play.golang.org/p/IqVnVa5x9qpWhen N == 16384, the benchmark result:Benchmark_Insert-4 134622 8032 ns/op 32768 B/op 1 allocs/op
Benchmark_Insert2-4 132049 8201 ns/op 32768 B/op 1 allocs/op
When N == 16385, the benchmark result:
Benchmark_Insert-4 118677 9374 ns/op 40960effect B/op 1 allocs/op
Benchmark_Insert2-4 136845 7744 ns/op 40960 B/op 1 allocs/op
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/fdc4786d-b844-4b33-9a32-7b4ebb027ff9n%40googlegroups.com.
You might not be able to get a cycle-by-cycle accounting, but with unlimited effort, you *can* get pretty darn close. Of course, that effort is usually not worth it. However, what you can always do, which is really the very first step to understand a benchmark result and attempt a micro-optimization like this, is look at the generated code and/or an actual CPU profile to see what the difference in generated code is (it might be none, in which case it's "spooky architecture action at a distance") and what the difference is where the generated code spends its time.If you've done that, i would've expected you to share your insights. If you haven't done it, it seems rude to skip that step and expect people on golang-nuts to do it for you, if I'm being honest. That is why I'm a bit frustrated with these threads, personally.
I will when I confirm that no one could give an answer without much effort. If you feel frustrated, you can ignored it. ;D
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/74c0a27f-0e3e-4ee6-937b-67bf36dd8ae2n%40googlegroups.com.
FWIW I did take a look at the output (I ended up curious enough): https://go.godbolt.org/z/WK8xYd1E3Insert and Insert2 generate pretty different code. In particular, Insert2 uses makeslicecopy, to fold the `make` and the `copy(s2, a)` (avoiding having to zero the relevant memory). `Insert` uses makeslice and a memmove for the `copy(s2, s[:k])`.I assume that's because the compiler can easily prove that `len(s2) == len(a)+len(b)+len(c) > len(a)`, therefore it sees that the `copy` actually copies all of `a`. For `Insert`, it needs to do extra bounds checks.So, looking at this output might've answered your question before posting it. It certainly would've provided helpful context for anyone trying to explain the behavior who isn't a domain expert in how the compiler optimizes things like this. i.e. it certainly would've increased the likelihood that any questions remaining *will* get an answer.