branch prediction revisited

957 views
Skip to first unread message

Garrett D'Amore

unread,
Mar 2, 2015, 2:28:11 PM3/2/15
to golang-nuts
So, I’m revisiting a topic first asked about here: https://groups.google.com/forum/#!topic/golang-nuts/lOMvhVkGPXU

I have an application (mangos) where I would like to be able to tune to very low latencies (sub-usec in some cases).  Therefore pipeline stalls caused by mis-predicted branches can be fairly tragic.   I’m not normally a fan of hand coding branch prediction (it feels too much like premature optimization to me), but in a few cases I know that I can tell which path is likely and which is not, perhaps in ways that the compiler cannot.

Therefore, it would be nice to have one, or both of, the following:

a) ways to hint the compiler about likeliness (which path is hot)

b) a list of the rules that the compiler uses when deciding which path is likely (in the absence of other information)

This information would be useful for both ordinary ‘if’ branches, but also for other potential branch statements (switch, select, for). 

Thanks!

- Garrett

Dave Cheney

unread,
Mar 2, 2015, 3:38:26 PM3/2/15
to golan...@googlegroups.com, gar...@damore.org
I'm not aware of any hints that can be given for switch statements, select is far to complex to make any branch predictions about. For if's I believe that conditional branch is predicted false. That is to say, Go style uses guard clauses to check for unlikely preconditions, therefore the conditional bodies are considered unlikely.

minux

unread,
Mar 2, 2015, 8:09:38 PM3/2/15
to Garrett D'Amore, golang-nuts
On Mon, Mar 2, 2015 at 1:03 PM, Garrett D'Amore <gar...@damore.org> wrote:
So, I’m revisiting a topic first asked about here: https://groups.google.com/forum/#!topic/golang-nuts/lOMvhVkGPXU

I have an application (mangos) where I would like to be able to tune to very low latencies (sub-usec in some cases).  Therefore pipeline stalls caused by mis-predicted branches can be fairly tragic.
Are you sure your application is so optimized that even the tens cycles of branch mis-predict penalty will
dominate the latencies variation? Have you measured? How did you measure? What CPU are you using?

If you really want to optimize for branch prediction, optimize away indirect calls (e.g. method call through
interfaces), as that will incur larger overhead than mis-predicting a static branch.

Have you optimized cache miss yet? I expect that normally cache miss will play a larger part in Go application's
latency variation than branch mis-predict. Unless you're using low-end ARM cores, I really don't think branch
prediction is something to worry about.

 I’m not normally a fan of hand coding branch prediction (it feels too much like premature optimization to me), but in a few cases I know that I can tell which path is likely and which is not, perhaps in ways that the compiler cannot.

Therefore, it would be nice to have one, or both of, the following:

a) ways to hint the compiler about likeliness (which path is hot)

b) a list of the rules that the compiler uses when deciding which path is likely (in the absence of other information)

This information would be useful for both ordinary ‘if’ branches, but also for other potential branch statements (switch, select, for). 

The compiler will emit hints for branches for nil check, slice index bound checks, etc., but not for any
branch instructions that generated by user program conditional constructs. And these hints only work
on 386/amd64.

Garrett D'Amore

unread,
Mar 2, 2015, 11:04:48 PM3/2/15
to minux, golang-nuts
On Mar 2, 2015, at 5:08 PM, minux <mi...@golang.org> wrote:


On Mon, Mar 2, 2015 at 1:03 PM, Garrett D'Amore <gar...@damore.org> wrote:
So, I’m revisiting a topic first asked about here: https://groups.google.com/forum/#!topic/golang-nuts/lOMvhVkGPXU

I have an application (mangos) where I would like to be able to tune to very low latencies (sub-usec in some cases).  Therefore pipeline stalls caused by mis-predicted branches can be fairly tragic.
Are you sure your application is so optimized that even the tens cycles of branch mis-predict penalty will
dominate the latencies variation? Have you measured? How did you measure? What CPU are you using?

If you really want to optimize for branch prediction, optimize away indirect calls (e.g. method call through
interfaces), as that will incur larger overhead than mis-predicting a static branch.

Sure, that makes sense.  How does this compare with storing a function pointer and calling through the function pointer? On the one hand I need to be fast.  But on the other my application does demand certain flexibility — e.g. I have different transport layers underneath me and that is abstracted through interfaces.


Have you optimized cache miss yet? I expect that normally cache miss will play a larger part in Go application's
latency variation than branch mis-predict.

I’ve done that in some of the lower layers in our stack; in the case of my Go stack that’s kind of an application problem, *except* in so far as I’ve tried to optimize certain bits of the code.  I could probably still do some further enhancement by optimizing for per-processor specific resource pools, but admittedly I’m not too sure how Go maps go routines to actual scheduled CPUs (I’d know how to do this in C, in the kernel.)   It would be nice if Go resource pools introduced in 1.3 were efficient enough to hide this particular little detail from me — sadly my experience with Go’s resource pools in 1.3 was that they performed less well than my hand coded equivalent using channels.  I’ve not researched why that is, and it may well be better in 1.4 — certainly it was a somewhat surprising, and disappointing, result.

Unless you're using low-end ARM cores, I really don't think branch
prediction is something to worry about.

Processor stalls and their impact vary from CPU to CPU.  In my experience implementing network stacks, they can be significant — particularly when you are working with software that processes on the order of 1 Mpps (million packets per second) or more.  In that case, every single extra instruction costs, and branch misprediction can be tragic.  Admittedly, this is not the “usual” case, but for applications like high frequency trading and core network stacks, getting this right can be substantial.  (I do think application developers misuse branch prediction too often — its frequently a sign of premature optimization.  But in *some* cases there is value here.)

Imagine that you have 1 M packets per second, and each packet has a dozen or so branches that get executed as it flows along the code path.  Most of those branches have a 99.9% chance of going a certain way.  But if you get it wrong — 12 M processor pipeline stalls per second can be evil indeed.  For example, Pentium 4 has a 20 deep pipeline.  If you have to flush that, its dropping 19 stages of the pipeline.  You would really not like to do that much.  (I’ve seen one estimate that just mispredicting 10% of the branches in branch heavy code can slow a Pentium 4 20-40% of the time.  I’m not sure I really believe it is that dramatic — but even a 1% impact can be noticeable.  Especially with network protocols where backpressure and such can create rather dramatic cascade effects on latency.)

I’ve not attempted to measure this with Go — partly because I don’t know how the compiler organizes instructions, so its a bit of a black box.  But the details of this can be important to some of my consumers from a performance perspective.  I’m well aware of the dangerous of premature optimization, but in this case I know exactly what my hot code paths are, and I really want to make sure that they are indeed treated preferentially.  (And yes, I’ve spent some effort trying to minimize branches, but it just isn’t possible to remove them all, and as people want more “features”, that means adding more branches…)

We’re not on low-end ARM cores.  But we are at high-throughput (messages per second) low-latency processing on modern CPUs with deep pipelines.  Saving even a couple hundred nanos can actually have economic impacts in applications like HFT.  (Basically HFT is economic warfare, where the winner is the guy with the fasted program/stack.  Being even just 100 nsec behind the winner still makes you the loser.)

- Garrett

Dmitry Vyukov

unread,
Mar 3, 2015, 2:52:52 AM3/3/15
to Garrett D'Amore, golang-nuts
Will it actually help your program?
Please manually fix branch ordering by swapping if/else blocks on the
hot paths, and check how much it improves performance.
I would expect that there are still way more important issues wrt
performance, so that branch prediction will have not visible effect.
How does profile of you app look like?

Dmitry Vyukov

unread,
Mar 3, 2015, 2:56:00 AM3/3/15
to Garrett D'Amore, minux, golang-nuts
Not that it is particularly important, but Pentium4 had 30-stage
pipeline. Modern processors has considerably shorter pipelines.
> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Klaus Post

unread,
Mar 3, 2015, 5:29:36 PM3/3/15
to golan...@googlegroups.com, mi...@golang.org, gar...@damore.org
On Tuesday, March 3, 2015 at 5:04:48 AM UTC+1, Garrett D'Amore wrote:

Imagine that you have 1 M packets per second, and each packet has a dozen or so branches that get executed as it flows along the code path.  Most of those branches have a 99.9% chance of going a certain way.  But if you get it wrong — 12 M processor pipeline stalls per second can be evil indeed.  For example, Pentium 4 has a 20 deep pipeline.  If you have to flush that, its dropping 19 stages of the pipeline.  You would really not like to do that much.  (I’ve seen one estimate that just mispredicting 10% of the branches in branch heavy code can slow a Pentium 4 20-40% of the time.  I’m not sure I really believe it is that dramatic — but even a 1% impact can be noticeable.  Especially with network protocols where backpressure and such can create rather dramatic cascade effects on latency.)

You are right, that branch mispredictions are costly. However, I think you are not considering the branch prediction. If your branches are predictable, they will be predicted correctly after a few visits, and they will become "invisible". If you are afraid that the time predicting these is a factor, do a few warnup runs when your app starts. If it is truly 99.9% predictable you will not have any issues. I have not experienced any code where compilers optimizing for branching on first encounter makes any measurable difference, even on the long-pipelined P4 (which improved a lot with Core architecture).

The biggest killer is branching on random data - it means much more than any optimization you can do compiler-side, because you cannot fix that with code generation. Sure you can use cmov* in some special cases, but cmov has its own downsides, and is *only* faster if you are have random branching.

I think you are focusing your effort in the wrong area - there are a million things that are more important than initial branch prediction, except in image processing and similar math-heavy operations. where you want to do vector-assembly/intrinsics anyway. You don't describe what you are doing, so to outsiders, it seems like you are focusing on something with very little gain.

/Klaus

Garrett D'Amore

unread,
Mar 4, 2015, 12:26:14 PM3/4/15
to Klaus Post, golan...@googlegroups.com, mi...@golang.org
Your comments above were very helpful. I didn’t realize how much modern CPUs do towards recording branches and predicting branch direction for the future.  As a consequence of this message, I wound up doing a little research, and was pleasantly surprised to learn that modern CPUs actually do reasonably well dynamic prediction.  (I was mistaken in the belief that prediction rules were static, as in “forward branches predict false, backwards predict true” which was a rule I’d seen in a few places.)

This article gives a nice summary: http://en.wikipedia.org/wiki/Branch_predictor

Modern CPU architecture has changed a lot since I went to school.  (Back then only static predictors were common.)

I’m going to basically punt on worrying about this for now, as a result.  I suppose it may want to come back if we ever run this code on older CPUs, but for now the use of Go pretty much selects against that — I’m not sure Go is very useful on the very lower end ARM cores, but on those systems there probably isn’t that much pipeline going on in the first place, or at least the pipelines are probably relatively shallow. :-)

- Garrett

Ignacio Grande

unread,
Mar 5, 2015, 2:29:04 AM3/5/15
to golan...@googlegroups.com, gar...@damore.org
If you are using linux, you should be able to measure the missed branch predictions using the "perf" command:

$ perf stat ./knapsack
Solved:  4243395 5.921798452s

 Performance counter stats for './knapsack':

       5959,619054 task-clock (msec)         #    1,002 CPUs utilized
             1.732 context-switches          #    0,291 K/sec
                 7 cpu-migrations            #    0,001 K/sec
             1.382 page-faults               #    0,232 K/sec
    18.261.611.943 cycles                    #    3,064 GHz                     [83,10%]
     2.845.131.798 stalled-cycles-frontend   #   15,58% frontend cycles idle    [83,36%]
     1.216.743.984 stalled-cycles-backend    #    6,66% backend  cycles idle    [66,78%]
    70.393.748.885 instructions              #    3,85  insns per cycle
                                             #    0,04  stalled cycles per insn [83,41%]
    15.326.880.154 branches                  # 2571,789 M/sec                   [83,43%]
           430.822 branch-misses             #    0,00% of all branches         [83,33%]

       5,947633957 seconds time elapsed


There is plenty of "perf" information on the internet, for example, here: http://www.brendangregg.com/perf.html


Garrett D'Amore

unread,
Mar 9, 2015, 12:59:32 AM3/9/15
to Dave Cheney, golan...@googlegroups.com
On Mar 2, 2015, at 12:38 PM, Dave Cheney <da...@cheney.net> wrote:

I'm not aware of any hints that can be given for switch statements, select is far to complex to make any branch predictions about. For if's I believe that conditional branch is predicted false. That is to say, Go style uses guard clauses to check for unlikely preconditions, therefore the conditional bodies are considered unlikely.

That’s helpful.  Does this apply also when an else {} clause is present?  (I.e. the else is likely?)

It sure would be nice to give hints about select as well.  I’m pretty sure I’ve seen that using select has a noticeable impact on performance, to the point that it usually is better to avoid it when you can architect your program so that it is possible to do so.  (Assuming cases where performance trumps readability — which admittedly is probably more often *not* the case.)

- Garrett 

Dave Cheney

unread,
Mar 9, 2015, 1:02:41 AM3/9/15
to Garrett D'Amore, golang-nuts
Select is much more complicated, it's not comparable to a difficult to
predict branch, so branch prediction hints are not relevant.
Reply all
Reply to author
Forward
0 new messages