Brian's words are wise. My point was its almost impossible to generalize.
But to provide a specific workflow to analyze your actual production code in place, in practice:
a) write a Benchmark that focuses on your area of interest in your actual code.
b) compile the test binary, "go test -c". You do this so pprof can show you the disassembly in step e) below.
c) run the benchmark with -benchtime=10s and -cpuprofile cpu.prof
For example, from my run
two minutes ago, the full benchmark run line was:
go test -v -tags memfs -run=xxx -bench Benchmark_Iter_YogaDB_Ascend -benchtime=10s -cpuprofile cpu.prof
d) open pprof; it will show you the flame graph; look for the longest horizontal bar, click it, then click it again
once it expands; then select View -> Source from the upper left menu. It will show you the source
code with the time spent on each hot line, out of the 10 seconds that you ran.
go tool pprof -http :7777 yogadb.test cpu.prof
It will look something like this:
e) if need be, choose View -> Disassembly instead of View -> Source
f) highlight the entire section that contains the hottest line, such as line 1252 above,
g) paste the text into your favorite LLM and ask it for ideas to help you optimize that code.
It will give you amazingly good ideas 90% of the time. try them one by one, running
your benchmark after each one and observing if the timing improved. Sometimes you
need to redirect it to just try to eliminate function calls, or to manually inline hot parts
of functions to avoid function calls. My ability to read assembly -- horrible. Does
not matter because the LLM speaks it fluently. Leverage that.
There is an example of what this process can do here, in this next link. It took a 340 nsec hot iteration path down to
currently less than 10 nsec, which is about an L3 cache load on my 2020 vintage Intel CPU.
Finally, move over to Linux and run "perf" to analyze and optimze your L1 hit rate. Something like:
sudo perf stat -e L1-dcache-loads,L1-dcache-load-misses ./drwmutex-bench -strat 1
Enjoy the process.
- Jason
p.s. there is a nice video that illustrates optimization in Go, here, in case some of the steps were unclear,
from Prashant V: "Profiling and Optimizing Go"