On intel x86 processors do you have branch predictions on fixed loops ?

1,755 views
Skip to first unread message

ymo

unread,
Jan 25, 2016, 1:40:06 PM1/25/16
to mechanical-sympathy
Hi All

If i have a fixed loop on an array from say 1 to 1024 Is the processor going to do a branch prediction on each iteration ? I would think that the processor *should* not bother to predict this since it is known ahead of time (at the very least least by the compiler). And the question is for native as well as managed (java) code if there are any differences. This is also assuming that the loop is not unrolled by the compiler.

Regards

Todd Lipcon

unread,
Jan 25, 2016, 1:51:16 PM1/25/16
to mechanica...@googlegroups.com
http://www.anandtech.com/show/3922/intels-sandy-bridge-architecture-exposed/2 has some high-level description of loop detection and branch prediction in Sandy Bridge. Maybe that would help get you started for more in-depth research?

-Todd

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Vitaly Davidovich

unread,
Jan 25, 2016, 1:53:21 PM1/25/16
to mechanical-sympathy
Yes, it will execute the loop exit condition check each time.  Initially, static prediction will be used; after a few iterations it'll use the history and predict correctly for all iterations except the last one (loop exit).

The above assumes no optimization done by compiler, as you stipulated.  An optimizing compiler will unroll (typically) and reduce number of checks by the unroll factor (as well as gaining other benefits, e.g. breaking dep chains and allowing more ILP).

--

Avi Kivity

unread,
Jan 26, 2016, 8:01:14 AM1/26/16
to mechanica...@googlegroups.com
The benefits of unrolling are quite small on modern OOOE processors.  The processor will happily run multiple iterations of the loop in parallel even if you don't unroll.  So the only benefit is reducing the number of times you update the loop variable and check it.

Dan Eloff

unread,
Jan 26, 2016, 8:13:19 AM1/26/16
to mechanica...@googlegroups.com

And there's a big disadvantage since ivy bridge. The uop cache is a very small cache for decoded instructions. If your loop doesn't fit in the uop cache it can take a considerable performance penalty. For this reason Agner Fog now recommends not unrolling loops.

Martin Thompson

unread,
Jan 26, 2016, 8:24:13 AM1/26/16
to mechanica...@googlegroups.com
The L0 (uop) cache has been around since Sandy Bridge it can hold just over 1500 uops. It was introduced for power saving and not performance. Decoding instructions can be power hungry.

I find loop unrolling often does not pay off these days. Keeping it small and simple often works best in real world apps. Unrolling sometimes wins in micro benchmarks but seldom within a larger application.

Avi Kivity

unread,
Jan 26, 2016, 8:27:49 AM1/26/16
to mechanica...@googlegroups.com
Good point.  There's also a loop buffer, if the loop is small enough it can run without touching the front-end at all.

Vitaly Davidovich

unread,
Jan 26, 2016, 8:30:22 AM1/26/16
to mechanica...@googlegroups.com
They're small if compiler doesn't further optimize the loop but usually it opens up other optimizations.  I mentioned breaking dependency chains (e.g. instead of using one accumulator it can break up associative operations to use more independent ISA registers), but there are others; auto vectorization pretty much requires it.  Other compilers, such as Hotspot JVM, can unroll some loops fully and constant fold the entire loop when possible (e.g. looping over a constant array).  So it's more a compiler optimization gateway.
Sent from my phone

Vitaly Davidovich

unread,
Jan 26, 2016, 8:43:02 AM1/26/16
to mechanica...@googlegroups.com


On Tuesday, January 26, 2016, Martin Thompson <mjp...@gmail.com> wrote:
The L0 (uop) cache has been around since Sandy Bridge it can hold just over 1500 uops. It was introduced for power saving and not performance. Decoding instructions can be power hungry.

It's both performance and power.  With CISC decode is not trivial and pipeline can stall on tight compute kernels due to decode not keeping up - think loops.  The frontend stalls are usually in the instruction fetch stage though due to i-cache misses but decode is still important for performance in some cases. 


I find loop unrolling often does not pay off these days. Keeping it small and simple often works best in real world apps. Unrolling sometimes wins in micro benchmarks but seldom within a larger application.

As mentioned, unrolling is an optimization gateway for loops like inlining is for calls. 
Sent from my phone

Martin Thompson

unread,
Jan 26, 2016, 10:08:09 AM1/26/16
to mechanica...@googlegroups.com
On 26 January 2016 at 13:42, Vitaly Davidovich <vit...@gmail.com> wrote:

On Tuesday, January 26, 2016, Martin Thompson <mjp...@gmail.com> wrote:
The L0 (uop) cache has been around since Sandy Bridge it can hold just over 1500 uops. It was introduced for power saving and not performance. Decoding instructions can be power hungry.

It's both performance and power.  With CISC decode is not trivial and pipeline can stall on tight compute kernels due to decode not keeping up - think loops.  The frontend stalls are usually in the instruction fetch stage though due to i-cache misses but decode is still important for performance in some cases. 

The performance side is more complicated whereas the power saving is pretty clear. A uop cache hit after a branch mispredict improves things over Nehalem, whereas a uop cache miss after mis-prediction is a couple of extra cycles penalty over Nehalem. One of the biggest wins is for AVX as the bandwidth is increased for feeding all the execution units, the decoders could not do this on their own.

In theory, we all know practice can be different, the decoding is pipelined and therefore for predictable code the performance would not be hit if you did not have a uop cache but you do save on the power for decoding.

One other thing to consider is the 28 uop queue after the decoders and uop cache that is great for tight loops. These are restricted to less than 8 taken branches and no RETs ro CALLs which encourage clean simple code. Branch misprediction will end these small loops.
 
I find loop unrolling often does not pay off these days. Keeping it small and simple often works best in real world apps. Unrolling sometimes wins in micro benchmarks but seldom within a larger application.

As mentioned, unrolling is an optimization gateway for loops like inlining is for calls. 

Agreed. A sufficiently smart compiler would be architecture aware, if such a beast exists. The main point is the developers should be very careful of unrolling their own loops. Much better to focus on reducing data dependencies for increased ILP.

Vitaly Davidovich

unread,
Jan 26, 2016, 10:13:50 AM1/26/16
to mechanica...@googlegroups.com


On Tuesday, January 26, 2016, Dan Eloff <dan....@gmail.com> wrote:

And there's a big disadvantage since ivy bridge. The uop cache is a very small cache for decoded instructions. If your loop doesn't fit in the uop cache it can take a considerable performance penalty. For this reason Agner Fog now recommends not unrolling loops.

I think his recommendation is a bit more nuanced in that he recommends avoiding *unnecessary* unrolling.  His asm optimization guide itemizes pros and cons of loop unrolling quite well.

The decision to unroll (or not) is very similar to call inlining (or not); the set of conditions to make this determination is getting more complex over time and is a moving target.
Sent from my phone

Matt Godbolt

unread,
Jan 26, 2016, 10:15:28 AM1/26/16
to mechanica...@googlegroups.com

On Mon, Jan 25, 2016 at 12:53 PM Vitaly Davidovich <vit...@gmail.com> wrote:
Yes, it will execute the loop exit condition check each time.  Initially, static prediction will be used...

Forgive my pedantry, but as best as I understand it, since Core2 and above, there's no static prediction at all. Instead, the hashed branch history and PC index into the branch prediction table and whatever that predicts is what's used.

ymo

unread,
Jan 26, 2016, 10:24:04 AM1/26/16
to mechanical-sympathy
I am just baffled that in 2016 a CISC processor like intel can not recognize a fixed loop (yet). I am just wondering what is the reason for the processor not having an instruction for loops (fixed or even variable) ? Is it a case where they found not using a specific loop instructions (and just relying on branch prediction) for loops was found to be better for performance ?  

Matt Godbolt

unread,
Jan 26, 2016, 10:33:42 AM1/26/16
to mechanical-sympathy
Pre-sandy bridge, the processor did have a dedicated loop predictor (which recognised loops <64 iterations). Post that, the general branch prediction system is good enough to not need one. (Or the meta-predictor needed to work out whether a branch has loop behaviour or not is too complex/power hungry).

All this information (and a lot more) is available in the very readable microarchitecture PDF by Agner Fog: http://www.agner.org/optimize/microarchitecture.pdf 

--

Vitaly Davidovich

unread,
Jan 26, 2016, 10:36:29 AM1/26/16
to mechanica...@googlegroups.com
Isn't this the other way? Core 2 does what you describe whereas newer has static prediction when there's no history? I may be misremembering but I recall Intel guides talking about static prediction (forward jmps not taken, backwards taken) when there's no history.

But the technicality aside, my main point was it doesn't have dynamic history to use yet. 
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Matt Godbolt

unread,
Jan 26, 2016, 10:39:17 AM1/26/16
to mechanica...@googlegroups.com
I'm going from experience (of initial mispredicts) and from Agner page 36 "Static prediction in PM and Core2" -- 'These processors do not use static prediction..."

It seems how the BTB is organized it seems there's no actual storage of the PC that the branch occurred on; saving on comparators etc for "was this hashed history/branch address actually for this specific branch?", but meaning one loses the ability to discriminate between hash collisions.

Vitaly Davidovich

unread,
Jan 26, 2016, 10:42:54 AM1/26/16
to mechanica...@googlegroups.com
Right but where does he (or something else) say newer than Core2 uses the same technique?

Matt Godbolt

unread,
Jan 26, 2016, 10:52:38 AM1/26/16
to mechanica...@googlegroups.com
Fair point; he doesn't mention static prediction specifically for later processors. I'm still fairly sure there's no static prediction, but I'll try and get some concrete evidence to share before being so vocal again! :)

Vitaly Davidovich

unread,
Jan 26, 2016, 10:58:13 AM1/26/16
to mechanical-sympathy
Just to be clear, I'm not disputing this fact -- I just hadn't heard that and would be interested in getting more info.

Vitaly Davidovich

unread,
Jan 26, 2016, 11:19:10 AM1/26/16
to mechanical-sympathy
So Intel's opto guide (http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf), section 3.4.1.3 talks about static prediction.  It says that branches without BTB history are predicted using static prediction: unconditional branches are taken, indirect branches are not taken.  It then recommends code generation to take advantage of this (when likeliness is known/hinted).  That section closes with:

The Intel Core microarchitecture does not use static prediction heuristic.  However, to maintain consistency across Intel 64 and IA-32 processors, software should maintain the static prediction heuristic as the default.

Perhaps it's outdated/inaccurate though, although the revision date is Sept 2015  ... 


Matt Godbolt

unread,
Jan 26, 2016, 11:50:29 AM1/26/16
to mechanica...@googlegroups.com
I dug a little and wrote a little asm test of 1000 back-to-back branches, each iteration looped over 100 times. Run under Agner's framework, a bunch of times (back to back), on a Haswell:

Taken branch to following instruction.
Processor 0
     Clock   Core cyc   Instruct    BrTaken  BrMispred 
    235824     238663     100305     100099       1006 
    217640     217637     100305     100099          2 
    217624     217624     100305     100099          1 
    217620     217623     100305     100099          1 
    217628     217627     100305     100099          1 
    217620     217621     100305     100099          1 
    217624     217623     100305     100099          1 
    217620     217621     100305     100099          1 

Non-taken branch to following instruction:
Processor 0
     Clock   Core cyc   Instruct    BrTaken  BrMispred 
    128448     128518     100305         99       1488 
    127264     119101     100306         99        958 
    101136     101135     100305         99         55 
    100516     100515     100305         99         22 
    100424     100425     100305         99         17 
    100404     100406     100305         99         16 
    100368     100368     100305         99         14 
    100368     100368     100305         99         14 

Non-taken branch to a prior instruction:
Processor 0
     Clock   Core cyc   Instruct    BrTaken  BrMispred 
    103272     103270     100305         99        152 
    123092     123095     100305         99         39 
    100184     100185     100305         99         22 
    113700     113718     100305         99          8 
    109020     109061     100305         99         23 
    100280     100281     100305         99         15 
    100144     100146     100305         99          3 
    100124     100125     100305         99          2 

Thoughts:
  • Taken branch to following instruction consistently mispredicts ~1000 times initially. This seems to suggest there *is* static prediction of "not taken" for forward jumps. Which is as expected
  • Non-taken branch to following instruction: I can't really explain these results. They seem much more mispredicted: one would imagine no mispredictions. My best guess is that as they are untaken, they don't get to "vote" in any BTB and so the 99 taken branches (the outer loop of the test) dominate...but this is sketchy...
  • Non-taken to a prior; if there's a static predictor it ought to get this wrong every time - backwards branches should be statically predicted taken, I'd think...
I welcome interpretation (and I'm happy to share more info)...

thanks all!

--matt

Vitaly Davidovich

unread,
Jan 26, 2016, 11:57:47 AM1/26/16
to mechanica...@googlegroups.com


On Tuesday, January 26, 2016, Martin Thompson <mjp...@gmail.com> wrote:
On 26 January 2016 at 13:42, Vitaly Davidovich <vit...@gmail.com> wrote:

On Tuesday, January 26, 2016, Martin Thompson <mjp...@gmail.com> wrote:
The L0 (uop) cache has been around since Sandy Bridge it can hold just over 1500 uops. It was introduced for power saving and not performance. Decoding instructions can be power hungry.

It's both performance and power.  With CISC decode is not trivial and pipeline can stall on tight compute kernels due to decode not keeping up - think loops.  The frontend stalls are usually in the instruction fetch stage though due to i-cache misses but decode is still important for performance in some cases. 

The performance side is more complicated whereas the power saving is pretty clear. A uop cache hit after a branch mispredict improves things over Nehalem, whereas a uop cache miss after mis-prediction is a couple of extra cycles penalty over Nehalem. One of the biggest wins is for AVX as the bandwidth is increased for feeding all the execution units, the decoders could not do this on their own.

I actually think the other way: power savings are questionable but speed improvement for small loops can be big.  If the loop stays entirely in uop cache and has a small kernel with cheap instructions, it will not get stalled by decode since it doesn't even touch the decoders.  As you say, this becomes an even bigger issue when using wider and wider vector registers.  64bit instructions are known for having various prefixes, allowing 64bit immediates, and so on that make the instructions encode larger - skipping that step is a big boost.

In theory, we all know practice can be different, the decoding is pipelined and therefore for predictable code the performance would not be hit if you did not have a uop cache but you do save on the power for decoding.

One other thing to consider is the 28 uop queue after the decoders and uop cache that is great for tight loops. These are restricted to less than 8 taken branches and no RETs ro CALLs which encourage clean simple code. Branch misprediction will end these small loops.
 
I find loop unrolling often does not pay off these days. Keeping it small and simple often works best in real world apps. Unrolling sometimes wins in micro benchmarks but seldom within a larger application.

As mentioned, unrolling is an optimization gateway for loops like inlining is for calls. 

Agreed. A sufficiently smart compiler would be architecture aware, if such a beast exists. The main point is the developers should be very careful of unrolling their own loops. Much better to focus on reducing data dependencies for increased ILP.
Compilers do try to model the processors but it's a hard problem, particularly for JITs (even tiered ones).

I agree that manual unrolling should not be done without reason, just like manual inlining.  There are compiler inadequacies that sometimes necessitate this though, again just like inlining (and for inlining, manual *outlining* is sometimes necessary too - current Hotspot is notorious for this).

Avi Kivity

unread,
Jan 26, 2016, 11:58:40 AM1/26/16
to mechanica...@googlegroups.com


On 01/26/2016 06:19 PM, Vitaly Davidovich wrote:
So Intel's opto guide (http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf), section 3.4.1.3 talks about static prediction.  It says that branches without BTB history are predicted using static prediction: unconditional branches are taken, indirect branches are not taken.  It then recommends code generation to take advantage of this (when likeliness is known/hinted).  That section closes with:

The Intel Core microarchitecture does not use static prediction heuristic.  However, to maintain consistency across Intel 64 and IA-32 processors, software should maintain the static prediction heuristic as the default.

Perhaps it's outdated/inaccurate though, although the revision date is Sept 2015  ...

I believe "static prediction heuristic" here refers to the 2e/3e prefix bytes you can insert in front of a Jcc instruction to statically hint whether it is taken or not taken; the recommendation says that while current processors ignore the hint, past and possibly future may not.

Vitaly Davidovich

unread,
Jan 26, 2016, 12:14:02 PM1/26/16
to mechanical-sympathy
That section doesn't talk about hint prefixes -- what makes you think that's what it's referring to?

AFAIK, no compiler actually uses those prefixes precisely because they don't do anything; encoding them just enlarges code size for no benefit.  Once dynamic history is available, those prefixes would be entirely useless unless the CPU skips dynamic history lookup in that case entirely, but that would defeat the purpose of having history (which is very effective in practice).

Martin Thompson

unread,
Jan 26, 2016, 12:30:18 PM1/26/16
to mechanica...@googlegroups.com
On 26 January 2016 at 16:57, Vitaly Davidovich <vit...@gmail.com> wrote:

The L0 (uop) cache has been around since Sandy Bridge it can hold just over 1500 uops. It was introduced for power saving and not performance. Decoding instructions can be power hungry.

It's both performance and power.  With CISC decode is not trivial and pipeline can stall on tight compute kernels due to decode not keeping up - think loops.  The frontend stalls are usually in the instruction fetch stage though due to i-cache misses but decode is still important for performance in some cases. 

The performance side is more complicated whereas the power saving is pretty clear. A uop cache hit after a branch mispredict improves things over Nehalem, whereas a uop cache miss after mis-prediction is a couple of extra cycles penalty over Nehalem. One of the biggest wins is for AVX as the bandwidth is increased for feeding all the execution units, the decoders could not do this on their own.

I actually think the other way: power savings are questionable but speed improvement for small loops can be big.  If the loop stays entirely in uop cache and has a small kernel with cheap instructions, it will not get stalled by decode since it doesn't even touch the decoders. 

I curious to know why you seem to hold this view. "Power savings are questionable", Really? 

Vitaly Davidovich

unread,
Jan 26, 2016, 12:36:48 PM1/26/16
to mechanical-sympathy
I curious to know why you seem to hold this view. "Power savings are questionable", Really?

I thought I somewhat touched upon that in the same response.  What I'm saying is that it doesn't seem like power savings are the main driver here.  While it's true that if a loop is executing fully out of uop cache and not touching the decoders, they can presumably power down if there's nothing else to decode behind the loop instructions, I don't know how often this is the case, especially since good uop cache use requires a bunch more things to align correctly.  But when you *do* encounter such loops, you want them to not stall due to repeated decode, even if it's pipelined, particularly if they're feeding wide registers.

So maybe the right way to phrase my sentiment is that I think power savings are secondary to perf boost for loops.  Happy to have someone demonstrate/explain otherwise ...

Matt Godbolt

unread,
Jan 26, 2016, 12:38:29 PM1/26/16
to mechanica...@googlegroups.com
As a follow-up, if I interpose a bunch of NOPs between the branches, things look more sensible. I suspect there's a limit on the number of branch mispredicts that can be handled at a time; and/or the BTB doesn't handle many many branches in the same region. Or the µop cache is muddying the waters somehow (the 1000 branches plus loop overhead may just fit in it all).

With 64 single-byte NOPs between each of the 1000 branches:

Taken to following:
     Clock   Core cyc   Instruct    BrTaken  BrMispred 
   1779900    1758898    6500320     100099       1142 
   1714820    1714826    6500305     100099          1 
   1736848    1727919    6500306     100099         13 
   1714864    1714867    6500305     100099          2 
   1714848    1714847    6500305     100099          1 
   1714844    1714846    6500305     100099          1 
   1714848    1714846    6500305     100099          1 
   1714848    1714846    6500305     100099          1 

Non-taken to following:
     Clock   Core cyc   Instruct    BrTaken  BrMispred 
   1688968    1667894    6500321         99         38 
   1654784    1654782    6500305         99          9 
   1652408    1652409    6500305         99          1 
   1658500    1653179    6500306         99          9 
   1652476    1652477    6500305         99          1 
   1652492    1652493    6500305         99          2 
   1652348    1652348    6500305         99          1 
   1652348    1652348    6500305         99          1 

Non-taken to behind:
     Clock   Core cyc   Instruct    BrTaken  BrMispred 
   1705324    1683914    6500321         99        271 
   1653132    1653129    6500305         99          4 
   1653116    1653114    6500305         99          2 
   1653084    1653084    6500305         99          2 
   1653084    1653085    6500305         99          2 
   1653084    1653083    6500305         99          2 
   1653128    1653126    6500305         99          2 
   1671468    1654311    6500306         99         31 

Much clearer results.

My take: there really is a static guess and it "does the right thing" after all.

Sorry for the confusion earlier! Glad we could prove something out!

--matt

Vitaly Davidovich

unread,
Jan 26, 2016, 12:42:51 PM1/26/16
to mechanical-sympathy
Matt, can you elaborate a bit about the test setup? AFAIUI, you have a loop with trip count = 100 and body of the loop has 1000 jumps.  You then run this setup 8 separate times.  Is that right? Are the 8 separate runs launching the driver each time or is there another outer loop driving this? You're pinning this driver to same core, yes?


Vitaly Davidovich

unread,
Jan 26, 2016, 12:49:13 PM1/26/16
to mechanical-sympathy
Interesting.  Yes, back-to-back jumps like this does sound stressful on the CPU :).

Agner notes the following for Haswell:

The high throughput for taken branches of one per clock was observed for up to 128 branches with no more than one branch per 16 bytes of code. If there is more than one branch per 16 bytes of code then the throughput is reduced to one jump per two clock cycles. If there are more than 128 branches in the critical part of the code, and if they are spaced by at least 16 bytes, then apparently the first 128 branches have the high throughput and the remaining have the low throughput. These observations may indicate that there are two branch prediction methods: a fast method tied to the µop cache and the instruction cache, and a slower method using a branch target buffer.

Now, he's talking about throughput here but it's possible that spacing the jumps has prediction implications as well.  I'm still surprised that nops, which aren't truly processed by the core in the strict sense, have such a drastic effect.  It does change the PC of the branches, so maybe there's some BTB contention with so many branches packed.

Do fewer than 64 single-byte nops also improve things or is 64 the minimum you found?
 

Avi Kivity

unread,
Jan 26, 2016, 12:54:09 PM1/26/16
to mechanica...@googlegroups.com
It was just a guess.  Looking at the context, it indeed does not refer to the prefixes.

Martin Thompson

unread,
Jan 26, 2016, 1:01:45 PM1/26/16
to mechanica...@googlegroups.com
On 26 January 2016 at 17:49, Vitaly Davidovich <vit...@gmail.com> wrote:
Interesting.  Yes, back-to-back jumps like this does sound stressful on the CPU :).

Agner notes the following for Haswell:

The high throughput for taken branches of one per clock was observed for up to 128 branches with no more than one branch per 16 bytes of code. If there is more than one branch per 16 bytes of code then the throughput is reduced to one jump per two clock cycles. If there are more than 128 branches in the critical part of the code, and if they are spaced by at least 16 bytes, then apparently the first 128 branches have the high throughput and the remaining have the low throughput. These observations may indicate that there are two branch prediction methods: a fast method tied to the µop cache and the instruction cache, and a slower method using a branch target buffer.

Now, he's talking about throughput here but it's possible that spacing the jumps has prediction implications as well.  I'm still surprised that nops, which aren't truly processed by the core in the strict sense, have such a drastic effect.  It does change the PC of the branches, so maybe there's some BTB contention with so many branches packed.

The Intel manual has some interesting comments on how branches can be stored in the micro op cache.


See section 2.3.2.2. If the list of conditions are not meet then it flips back to using the legacy decode pipeline until the next branch.  

Matt Godbolt

unread,
Jan 26, 2016, 1:08:46 PM1/26/16
to mechanica...@googlegroups.com
wrt the test setup, it's basically Agner's tool (from his site) which does do CPU pinning, priority setting and all the good stuff like serialising and whatnot.

The bits I changed were the counters to check, and then the inner loop, which I changed to:

mov ebp, 100
align 16
LL:

cmp ebp, 101
%REP 1000        ; example: 100 shift instructions

;jnz $+2 ; nz = taken
;jz $+2 ; z = not-taken
;jnz LL ; nz = taken
jz LL ; z = not-taken
%REP 64
nop
%ENDREP

%ENDREP

dec ebp
jnz LL

the inner REP can change etc, and I comment-in one of the jumps for each experiment.

wrt 64 byte NOPS; here's some stats for 'non-taken to behind' for differing NOP gaps, just the first 3 runs:

1 NOP:
     Clock   Core cyc   Instruct    BrTaken  BrMispred 
    106904     106907     200305         99        248 
    100492     100495     200305         99         26 
    100032     100036     200305         99          2 

4 NOPs:
     Clock   Core cyc   Instruct    BrTaken  BrMispred 
    143184     137630     500306         99        521 
    125580     125579     500305         99         10 
    126904     126904     500305         99         93 

8 NOPs
     Clock   Core cyc   Instruct    BrTaken  BrMispred 
    251100     246074     900307         99        280 
    237984     237984     900305         99         22 
    238844     238846     900305         99         73 

16:
     Clock   Core cyc   Instruct    BrTaken  BrMispred 
    462396     456374    1700309         99         73 
    450144     450145    1700305         99          2 
    450140     450142    1700305         99          2 

32:
     Clock   Core cyc   Instruct    BrTaken  BrMispred 
    901200     882469    3300313         99        138 
    851724     851721    3300305         99          1 
    851720     851721    3300305         99          1 


NB I did notice a lot of variation running this over and over, so I suspect there's a system dependence still on whatever the CPU was last running.  Ideally I'd generate a routine that has a tonne of taken and non-taken branches - enough to flood any branch prediction system - and run that first.

Maybe I'll look at doing that some other time!

--matt :)

Matt Godbolt

unread,
Jan 26, 2016, 11:07:43 PM1/26/16
to mechanica...@googlegroups.com
Hi all,

Hope nobody minds, but I forked this conversation to report my latest findings.

I added a straight-line list of 65536 sequences of "jump taken ahead; 16 nops; jump not taken ahead; 16 nops" during the initialisation code. This seems to have had the desired effect: the branch prediction numbers are now stable across runs.

Interestingly, fewer than 65536; or fewer NOPs between causes even more instability, so it seems like there's some spooky action at a distance involved in the prediction. To me this suggests the static predictor can be "wrong" about some new branches: misidentifying them as previously-seen on some cases.  However, with sufficient "coldness" of the branches, the results are as follows:

Taken branch ahead (should be mispredicted if statically predicted) ; zero NOPs in between:
     Clock   Core cyc   Instruct    BrTaken  BrMispred 
    234880     234881     100303     100099        867 
    217072     217076     100303     100099          1 
    216880     216883     100303     100099          1 
32 NOPs in between:
     Clock   Core cyc   Instruct    BrTaken  BrMispred 
    891976     945828    3300311     100099        895 
    919650     921657    3300304     100099         13 
    909670     921000    3300303     100099          6 

Not-taken branch ahead (should be predicted OK, if static); zero NOPs:
     Clock   Core cyc   Instruct    BrTaken  BrMispred 
    121788     121790     100303         99       1158 
    102012     102014     100303         99        102 
    103016     103021     100303         99        155 
32 NOPs in between:
     Clock   Core cyc   Instruct    BrTaken  BrMispred 
    899120     877910    3300311         99        523 
    851524     851529    3300303         99          3 
    851512     851519    3300303         99          2 

Not-taken branch backward (should be mispredicted); zero NOPs:
     Clock   Core cyc   Instruct    BrTaken  BrMispred 
    115092     115093     100303         99        810 
    100200     100202     100303         99          5 
    100132     100135     100303         99          2 
32 NOPs in between:
     Clock   Core cyc   Instruct    BrTaken  BrMispred 
    879340     873077    3300311         99        505 
    851740     851742    3300303         99          2 
    851720     851726    3300303         99          1 

I'm rather perplexed with these results: it seems that - assuming my code to try and wipe out the branch prediction history works - static prediction is pretty much getting it wrong half the time...which is to say now I'm starting to suspect there is no static prediction after all! With no static prediction one would expect ~500 mispredicts...and that's what we see in these cases (once there's sufficient NOP spacing to let the magic happen).

This has all been on a Haswell; I'll try and find some older machines known to have static prediction to see if the same test gives different results.

Cheers, Matt

Vitaly Davidovich

unread,
Jan 27, 2016, 7:47:26 AM1/27/16
to mechanica...@googlegroups.com


On Tuesday, January 26, 2016, Matt Godbolt <ma...@godbolt.org> wrote:
Hi all,

Hope nobody minds, but I forked this conversation to report my latest findings.

I added a straight-line list of 65536 sequences of "jump taken ahead; 16 nops; jump not taken ahead; 16 nops" during the initialisation code. This seems to have had the desired effect: the branch prediction numbers are now stable across runs.

Interestingly, fewer than 65536; or fewer NOPs between causes even more instability, so it seems like there's some spooky action at a distance involved in the prediction. To me this suggests the static predictor can be "wrong" about some new branches: misidentifying them as previously-seen on some cases.  However, with sufficient "coldness" of the branches, the results are as follows:

I wonder if this suggests that it's actually not using static prediction but rather some branch history trace.  In other words, a BPU can incorporate branch history from path leading to this never before seen branch.  Maybe the nop padding makes that go away as the trace/path is too long.
If uop inside the uop cache has branch history associated with it, that could muddy the waters too? Some of your instructions in the test are the same right? 32 bytes of padding may be placing the uops on a different way inside the cache, improving hit rate or decreasing contention.  Just hand waving here ...

I'd try this experiment on Ivy or Sandy Bridge (Haswell apparently had significant undisclosed BPU changes) and also something pre Sandy Bridge (which won't have the uop cache), like Nehalem or Westmere.

This could also be some microarchitectural issue with Haswell BPU - would someone at Intel run a branch heavy performance test with this many branches? :)

Finally, maybe running this test under Intel Amplifier or Linux perf with more detailed BPU events can shed some light (I'd need to review the list of events supported so not sure if there's enough granularity there).

I think this is the most attention given to static prediction that I recall seeing :).


 

Cheers, Matt

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Martin Thompson

unread,
Jan 27, 2016, 8:04:07 AM1/27/16
to mechanical-sympathy
On Tuesday, 26 January 2016 17:36:48 UTC, Vitaly Davidovich wrote:
I curious to know why you seem to hold this view. "Power savings are questionable", Really?

I thought I somewhat touched upon that in the same response.  What I'm saying is that it doesn't seem like power savings are the main driver here.  While it's true that if a loop is executing fully out of uop cache and not touching the decoders, they can presumably power down if there's nothing else to decode behind the loop instructions, I don't know how often this is the case, especially since good uop cache use requires a bunch more things to align correctly.  But when you *do* encounter such loops, you want them to not stall due to repeated decode, even if it's pipelined, particularly if they're feeding wide registers.

So maybe the right way to phrase my sentiment is that I think power savings are secondary to perf boost for loops.  Happy to have someone demonstrate/explain otherwise ...

The power saving benefits of the micro op instruction cache and the loop cache are detailed in this paper.


Performance and efficiency are often closely related. Intel have stated that their goals for power savings are more significant than performance. I've heard quotes like Tock to Tock having the goal of 50% reduction in power consumption for same compute power and being happy with a 5-10% IPC performance increase per core. The original goals for the micro op cache can be found in this paper.


The nice part is we get performance improvements and efficiency gains from the same feature. If you have links that show this feature was performance led then I'd love to read those. Probably work on branch prediction?



Vitaly Davidovich

unread,
Jan 27, 2016, 8:54:51 AM1/27/16
to mechanical-sympathy
Thanks for the links.

This was just my personal view based on what I've read about the uop cache.  In your first link, the kernels they're testing for power are straight loops with hand tuning to make them run better without touching the frontend.  I have no doubts that such workloads will benefit from powering down the frontend as it's unused for the duration of the execution (and that paper provides evidence to that).  However, I don't know how much power savings this feature alone provides for bigger applications rather than microbenchmarks.  C-states and being able to reduce voltage/power down the core when idle seems like a more generally applicable feature given modern cores wait (idle) very frequently for tens of hundreds of microseconds and more, in common/typical workloads.  

If you look at the requirements to make full use of the uop cache, the loop is so simple that it'll be running at the speed of the core, possibly with multiple instructions issued per cycle.  How long is the frontend going to be powered down in your typical application's loops? Probably not very long, unless you're running a very specific type of workload (which by the way exist, so I'm not downplaying the notion just questioning its applicability scope).

Just my $.02

Martin Thompson

unread,
Jan 27, 2016, 9:28:55 AM1/27/16
to mechanica...@googlegroups.com
There are multiple decoders, 4-6 depending on microarchitecture. Intel keep working on shutting down parts of a core when not in use. So it would not be shutting down the full front end, though that is possible.

You are right that C-States and other features can give more benefits. Speed Shift that comes with Skylake will be more useful than OS controlled C-States for general purpose use.


I just wanted to make the point that power saving is driving more design decisions than performance in Intel CPUs these days. The argument you made for performance I could see no evidence of other than as a side effect :-) 

Vitaly Davidovich

unread,
Jan 27, 2016, 9:50:52 AM1/27/16
to mechanical-sympathy
I just wanted to make the point that power saving is driving more design decisions than performance in Intel CPUs these days.

Agreed, it's a big consideration on the whole.
 
The argument you made for performance I could see no evidence of other than as a side effect :-) 

Fair enough.  My opinion on *this particular* feature's power impact is based on my own inference.  As for "side effect", Intel's own arch manual briefly mentions reduced power from this feature, and then talks a lot more about throughput/latency improvements -- doesn't really sound like a side effect :).

Martin Thompson

unread,
Jan 27, 2016, 9:57:59 AM1/27/16
to mechanica...@googlegroups.com
On 27 January 2016 at 14:50, Vitaly Davidovich <vit...@gmail.com> wrote:
The argument you made for performance I could see no evidence of other than as a side effect :-) 

Fair enough.  My opinion on *this particular* feature's power impact is based on my own inference.  As for "side effect", Intel's own arch manual briefly mentions reduced power from this feature, and then talks a lot more about throughput/latency improvements -- doesn't really sound like a side effect :).

Yes. I've seen many side effects end up as features after delivery. The architecture guide is written after delivery. The paper written in 2001 by Intel to describe the feature had the design goal of power saving. They thought back then it would be performance neutral :-)  Let's quote them directly, "the eliminated work may save about 10% of the full-chip power consumption with no performance degradation".

I'm only have fun teasing :-)

Vitaly Davidovich

unread,
Jan 27, 2016, 10:07:57 AM1/27/16
to mechanical-sympathy
Yes. I've seen many side effects end up as features after delivery. The architecture guide is written after delivery. The paper written in 2001 by Intel to describe the feature had the design goal of power saving. They thought back then it would be performance neutral :-)  Let's quote them directly, "the eliminated work may save about 10% of the full-chip power consumption with no performance degradation".

To be fair, in 2001 we didn't have AVX, AVX2 or AVX-512, nor were the OoO engines as crazy as today either :)

Power has only relatively recently (it seems to me -- last 6-8 years?) become a hot topic with the rise of clouds, "data center compute", "big data", and mobile.  Then there are other CPU vendors (with weaker single core perf than Intel) who jumped on this, and reformulated performance as XXX/watt :).

--

Martin Thompson

unread,
Jan 27, 2016, 10:22:06 AM1/27/16
to mechanica...@googlegroups.com
On 27 January 2016 at 15:07, Vitaly Davidovich <vit...@gmail.com> wrote:

Power has only relatively recently (it seems to me -- last 6-8 years?) become a hot topic with the rise of clouds, "data center compute", "big data", and mobile.  Then there are other CPU vendors (with weaker single core perf than Intel) who jumped on this, and reformulated performance as XXX/watt :).

My understanding is the concern for power came with the design of "Core" in Haifa. Based on this Apple did the deal in 2005 that would bring Intel processors to their line up in 2006. The work clearly started before that when Netburst was failing and Intel needed a solution for notebooks. Coppermine (1999) was the base design that evolved to Core. That team were behind Sandy Bridge and note where the 2001 paper on the micro op cache came from. Hardware has pretty long cycles.

Matt Godbolt

unread,
Jan 27, 2016, 11:50:08 AM1/27/16
to mechanica...@googlegroups.com
On Wed, Jan 27, 2016 at 6:47 AM Vitaly Davidovich <vit...@gmail.com> wrote:


On Tuesday, January 26, 2016, Matt Godbolt <ma...@godbolt.org> wrote:
....
This has all been on a Haswell; I'll try and find some older machines known to have static prediction to see if the same test gives different results.

If uop inside the uop cache has branch history associated with it, that could muddy the waters too? Some of your instructions in the test are the same right? 32 bytes of padding may be placing the uops on a different way inside the cache, improving hit rate or decreasing contention.  Just hand waving here ...

Seems pretty possible!
 

I'd try this experiment on Ivy or Sandy Bridge (Haswell apparently had significant undisclosed BPU changes) and also something pre Sandy Bridge (which won't have the uop cache), like Nehalem or Westmere.

Right. I just trivially tried this on a tame Westmere box but Agner's code isn't working all for that. I spent all morning debugging it and have a fix, but the code is so fragile I think it worth busting out a new whole copy with fixes etc. I'll hack on this over the next few days and put it on github. Yay GPL :)
 

This could also be some microarchitectural issue with Haswell BPU - would someone at Intel run a branch heavy performance test with this many branches? :) 

Finally, maybe running this test under Intel Amplifier or Linux perf with more detailed BPU events can shed some light (I'd need to review the list of events supported so not sure if there's enough granularity there).

Agreed. In fact, the list of events (section 3B of intel's manuals) is giant and comprehensive and has all manner of things... Agner interprets 0xe6:0x00 as both "BTB Miss" and "number of static branch predictions". This doesn't seem to fit perfectly with Intel's docs: "Counts the number of times the front end is resteered, mainly when the Branch Prediction Unit cannot provide a correct prediction and this is corrected by the Branch Address Calculator at the front end."

I'll have to dig more (and also get stats on what this counter is reading for my cases)..
 

I think this is the most attention given to static prediction that I recall seeing :).

Cool: I've always been a little suspect of my interpretation. It's good to be in a position to be able to experiment a little and maybe even work out what's going on!

Thanks all, Matt 

ymo

unread,
Jan 27, 2016, 1:25:49 PM1/27/16
to mechanical-sympathy
Just curious about your testing and how do you get the missed branch misdirection from perf or maybe intel pcm (-performance-counter-monitor) code ? I know that most people run perf in collection mode but i was wondering if it is worth interfacing between perf and the code that you are testing. Meaning that you write a simple loop but before you start your loop you can tell perf to start the collection now. Is the extra level of control worth doing for these micro benchmark ?

I was hoping to write something that auto generates these test cases in a week end project .. one day )))

Matt Godbolt

unread,
Jan 27, 2016, 2:17:06 PM1/27/16
to mechanical-sympathy
Sorry if I've confused, or am confused: I'm not running perf or intel PCM here; I'm manipulating the perf counters directly.  Specifically for the Haswell tests I'm using:

0xc4:0x20 - branches taken (aka "Number of near taken branches retired." p304 of Intel manual 3B)
0xc5:0x00 - branches mispredicted (aka "Mispredicted branch instructions at retirement" same page).

Having these be very accurately measured does help; the branch counter was pretty much exact in most cases and could be tallied up with the actual instruction stream (e.g. the 100099 branches in the taken cases and the 99 branches in the not-taken cases are exact.)

The reason not to use perf is to make it (as near to) cycle perfect as possible. FWIW Agner's code is here: http://www.agner.org/optimize/testp.zip if that helps frame the discussion. I'm definitely going to put something together using his as a baseline and put it on github over the next few days.

--matt


--

Rajiv Kurian

unread,
Jan 28, 2016, 2:03:20 AM1/28/16
to mechanical-sympathy
Haven't looked at your code yet but are you sure you are not running into:

"Avoid putting two conditional branch instructions in a loop so that both have the same branch target address and, at the same time belong to (i..e. have their last byte's address within) the same 16-byte aligned code block." (From the 64-ia-32-architectures-optimization-manual.pdf)

Matt Godbolt

unread,
Jan 28, 2016, 8:46:53 AM1/28/16
to mechanical-sympathy
Hi Rajiv,

Thanks for the thought: I haven't put the code out yet so you won't have seen it! I'm putting 32 NOPs between branches in an effort to prevent this effect (which means the branch end byte is 34 bytes apart in the "branch forward" case and 38 bytes apart in the "branch back" case.

--matt

--

Travis Downs

unread,
Feb 18, 2016, 11:06:48 PM2/18/16
to mechanical-sympathy
Keep in mind that unlike earlier CPUs recent Intel CPUs use a hashing mechanism to map, for each branch, from the current "branch state" into multiple history tables, and will then pick the best match from there. The branch state involves the IP of the branch, combined with zero or more bits from the prior branch history. Most people believe that Haswell and beyond are using something comparable to TAGE (see here and then here for background reading on TAGE). The complexity of these predictors, and the use of hash functions with only partial confirmation (i.e., you can get spurious hits because the tag is only a dozen or so bits) makes interpreting and testing this stuff quite difficult.

If we take the papers at face value, then at the bottom we have the "base predictor" which is simply a hash from branch IP (no history) into a slot with no key confirmation. That is, a hash table that doesn't store the key, since spurious hits are not fatal (just results in a bogus prediction). The presence of the base predictor pretty much explains why you have the notes about the lack of static prediction. As soon as the CPU has been running for a second or more, this base predictor table is generally going to be full of predictions. Even if all the higher order tables that take history into account miss, you're going to get a prediction from this base predictor, even though it will simply sometimes be a bogus prediction. The only case where you use the "default predictor" (which is just hard-coded and corresponds to static prediction) would be when you hit an empty slot in the base predictor table, but that's not really going to happen unless there is some kind of predictor flushing mechanism (which is a good point - is a security risk to leave the predictors populated across context switches?). So you likely just fill up the base predictor with whatever default value since that short run behavior has negligible impact outside of a fraction of a second at boot.

So after your warm-up code, in the very first batch of 1000 branches we can expect that you'll just be getting random hits in the base predictor, as the higher level predictors won't have tag matches (outside of a few spurious matches). So, you'd expect about 500 mispredicts since the base predictor should be sprinkled about 50/50 with taken and not-taken predictions from the warm up. On the second iteration, it's not exactly clear what's going to happen since you now have one iteration of data populated in the base predictor and also in the higher level predictors. Whether the base predictor or the higher level predictor is used depends in a pretty subtle way on things like the replacement/update policy for the higher level predictors, and as per the papers this is complex and may even involve a random component. If the result is that a higher level predictor is used, there should be almost no more mispredictions, for a total of 500. That seems to mesh with the numbers for the not-taken variants (looking only at the ones with the NOPS interleaved only).

If the base predictor is used again, you'd expect about 250 more mispredictions, as the one round of updates will have flipped perhaps half of the 2 or 3 bit counters that were wrong the first time. So you end up with perhaps 750 mispredicts. By the third iteration, everything should be using the history. That seems to more or less mesh with the taken variant, although it's not clear why it would be different than the taken variant except that the updates are "subtle" as pointed out above.

Perhaps you could do a run which output the number of mispredicts per iteration of the inner loop, so we can see if it is indeed something like:

500, 0, 0, ...
and
500, 250, 1, ...

for the two cases, or something else.

A few other things to consider/try:

Do the two variants encode to the same number of bytes? I.e., do the jump back vs jump forward variants have the same number of bytes in the loop? I know the way jumps are encoded depends on the offset and direction, etc. All this hashing stuff is highly sensitive to the exact number of bytes between jumps. For example, if they are using power-of-two truncation at any point, you'll get a big difference the offset between your jumps are relatively prime with 2, vs if they are not. That can also explain a bit about why using fewer NOPs in the warmup makes a huge difference (because for any NOP count that is even some predictor tables may only be partly filled).

Vary the ratio of taken/not-taken branches in the warmup code and see how that affects the prediction behavior.

Try outputting the number of mispredictions for your straight-line warmup code! That, together with varying the number of jumps, would actually give a lot of info about static prediction behavior, perhaps in a cleaner way than the loops. In particular, if you have static prediction, you expect a long sequence with a mix of T% taken-forward branches and N% not-taken forward branches to have a prediction accuracy of N, since only the not-taken branches are correctly predicted. If something like the base-predictor in TAGE is always getting hit though, you expect an accuracy of max(T,N), since effectively you are always getting a random prediction with the same probability as the prior T/N probability. You can set up separate tests for the take cases also, although it's a bit trickier.



These prediction buffers fill up in an essentially random way based on the hash function and observed branches. If you just had straight-line code that executed randomly taken and untaken branches, the prediction tables will still fill up and you'll still be getting "hits" on pretty much every branch - i.e., the predictors will find some info at wherever the current branch state hashes to, despite the fact that each branch is being seen for the first time!



This explains at least some of your results well.

First, your initialization code of 65k branches doesn't make the prediction cache "cold" - in the sense of a normal CPU cache which can be filled with junk so it always misses, rather it keeps it "hot", but at least fills it in a consistent way - in particular, your code has 50% taken and 50% untaken branches, so the random mess of what's left in the predictors will be about 50%/50% as well (again, subject to some hash-function imposed randomness). A lot of the spooky action at a distance probably comes from branch alignment interacting with the hash function.

So the ~500 odd mispredicts
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Matt Godbolt

unread,
Feb 19, 2016, 9:46:09 AM2/19/16
to mechanical-sympathy
Thanks Tavis; a ton of useful information. I'm going to take a while to digest that all, and read the papers. Then I'll reply and maybe sort out some further tests (though the code's available if you have test machines spare).

FWIW I've also started doing some research into how the first-level BTB works for unconditional branches, at least on Arrendale (I know this is mostly micro-archaeology research... but it's my easy test rig). More info at http://xania.org/Microarchitecture-archive and I have a huge backlog of information to wade through still...



To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Matt Godbolt

unread,
Feb 19, 2016, 9:49:49 AM2/19/16
to mechanical-sympathy
(Travis, sorry! Typo'd! Also, seems your message was truncated now I reached the bottom of it :))

Travis Downs

unread,
Feb 19, 2016, 1:04:56 PM2/19/16
to mechanica...@googlegroups.com
You can ignore that last partial sentence (So the ~500 odd...) - I was editing my post and that was a dangling sentence from an version of the reply, so nothing was lost :)

I read your two blog entries on this topic, which included more information and really confirmed stuff started to change with Ivy Bridge. It re-emphasized what I think think the key shift in mental model for these newer predictors: in the past you had the predictors working like a cache: you could have hits or misses. That is, the lookups were mostly "fully confirmed" - when you got a hit you are pretty sure it corresponds to this branch. For the misses you fall back to static prediction.

The newer predictors mostly just have "hits". You always hit somewhere in the hierarchy of increasingly accurate predictors. You don't really have misses. What would be compulsory misses before (i.e., first time you execute a particular branch) are simply "spurious hits" in one of the shortest history predictors (usually the base predictor with history length zero). So to interpret the numbers you have to try to work out what generally patters will appear in the base predictor.

It's worth noting that this branch prediction is really, really good. In particular, it makes redundant things like the loop count predictors of earlier generations. The general "just works" for longish count loops now (that's why your later iterations get only a single mispredict (likely on the final branch of the iteration) and not 101 (one for each inner iteration).

Taking a look at your new entry now! If you like this kind of micro-archaeology, you should check out: http://blog.stuffedcow.net/tag/cpu/ if you haven't already. Not much recent, but the cache replacement and store-to-load forwarding articles are very good.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Matt Godbolt

unread,
Feb 19, 2016, 3:18:09 PM2/19/16
to mechanical-sympathy
Thanks Travis: I'll be sure to check the papers. This may be orthogonal (as I've been in the BTB mindset recently as you can see!) :  I had pretty much convinced myself from reading the patents and other literature that there's still a "BTB" part that holds the target address cache: but I could be wrong. It does seem the Haswell et al still have "BACLEARS.ANY" which I take to be a BTB miss. This still makes sense to me given the number of pipeline stages before even decode finds the branches in the first place. For conditional branches there's then the multistage prediction, which is likely as not just random noise in the cases that I gave.

I'll reply properly to your earlier message: but it's definitely the case without the "BTB Scramble" (which as you say is more a pollute-the-shared-history-tables-with-junk) the BPU very quickly (usually second iteration) got zero mispredicts.

So glad I'm getting some good links out of this! Had been finding it very difficult to find any real literature.

--matt :)

On Fri, Feb 19, 2016 at 12:05 PM Travis Downs <travis...@gmail.com> wrote:
You can ignore that last partial sentence (So the ~500 odd...) - I was editing my post and that was a dangling sentence from an earlier reply, but nothing was lost :)
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Travis Downs

unread,
Feb 19, 2016, 3:27:48 PM2/19/16
to mechanica...@googlegroups.com
Yeah, I'm not sure exactly how the BTB changed when the taken-not-taken predictor changed. I suspect it has changed as much, because the function is quite different - for the non-indirect case it really just needs to do as good a job as possible matching branch IPs to addresses. Unlike branch decisions, that's has a single fixed behavior, and so isn't history dependent, and so an associative cache makes sense.

Then you get to indirect branches, which have variable targets. Then you pretty much get back to the same kind of prediction that's needed for taken/not-taken and as far as I know that uses separate hardware: the iBTB and that's also covered briefly in the second paper above, namely with the iTAGE algorithm.

You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/UFscifOU8AQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages