Sure, that makes sense. How does this compare with storing a function pointer and calling through the function pointer? On the one hand I need to be fast. But on the other my application does demand certain flexibility — e.g. I have different transport layers underneath me and that is abstracted through interfaces.
I’ve done that in some of the lower layers in our stack; in the case of my Go stack that’s kind of an application problem, *except* in so far as I’ve tried to optimize certain bits of the code. I could probably still do some further enhancement by optimizing for per-processor specific resource pools, but admittedly I’m not too sure how Go maps go routines to actual scheduled CPUs (I’d know how to do this in C, in the kernel.) It would be nice if Go resource pools introduced in 1.3 were efficient enough to hide this particular little detail from me — sadly my experience with Go’s resource pools in 1.3 was that they performed less well than my hand coded equivalent using channels. I’ve not researched why that is, and it may well be better in 1.4 — certainly it was a somewhat surprising, and disappointing, result.
Processor stalls and their impact vary from CPU to CPU. In my experience implementing network stacks, they can be significant — particularly when you are working with software that processes on the order of 1 Mpps (million packets per second) or more. In that case, every single extra instruction costs, and branch misprediction can be tragic. Admittedly, this is not the “usual” case, but for applications like high frequency trading and core network stacks, getting this right can be substantial. (I do think application developers misuse branch prediction too often — its frequently a sign of premature optimization. But in *some* cases there is value here.)
Imagine that you have 1 M packets per second, and each packet has a dozen or so branches that get executed as it flows along the code path. Most of those branches have a 99.9% chance of going a certain way. But if you get it wrong — 12 M processor pipeline stalls per second can be evil indeed. For example, Pentium 4 has a 20 deep pipeline. If you have to flush that, its dropping 19 stages of the pipeline. You would really not like to do that much. (I’ve seen one estimate that just mispredicting 10% of the branches in branch heavy code can slow a Pentium 4 20-40% of the time. I’m not sure I really believe it is that dramatic — but even a 1% impact can be noticeable. Especially with network protocols where backpressure and such can create rather dramatic cascade effects on latency.)
I’ve not attempted to measure this with Go — partly because I don’t know how the compiler organizes instructions, so its a bit of a black box. But the details of this can be important to some of my consumers from a performance perspective. I’m well aware of the dangerous of premature optimization, but in this case I know exactly what my hot code paths are, and I really want to make sure that they are indeed treated preferentially. (And yes, I’ve spent some effort trying to minimize branches, but it just isn’t possible to remove them all, and as people want more “features”, that means adding more branches…)
We’re not on low-end ARM cores. But we are at high-throughput (messages per second) low-latency processing on modern CPUs with deep pipelines. Saving even a couple hundred nanos can actually have economic impacts in applications like HFT. (Basically HFT is economic warfare, where the winner is the guy with the fasted program/stack. Being even just 100 nsec behind the winner still makes you the loser.)
- Garrett