Re: Static branch prediction (was: On intel x86 processors do you have branch predictions on fixed loops ?)

Matt Godbolt

unread,

Jan 27, 2016, 10:46:18 PM1/27/16

to mechanical-sympathy

Some quick results for a few other processor types, all with 64 NOPs between the branches.

The expectation is that CPUs with static prediction will predict backwards jumps as taken, and forwards jumps as not taken. In this instance one would expect to see 1000 mispredicts for "ahead taken"; 0 mispredicts for "ahead not taken", and 1000 mispredicts for "behind not taken". I need to work on a test for "behind taken".

Penryn (Xeon X3323)

Branch ahead taken

Processor 0

Clock Core cyc Instruct BrTaken BrMispred

2255069 2050350 3445524 118016 2846

1482772 1482773 3300303 100099 1

1521614 1483802 3300304 100100 8

Branch ahead not taken

Processor 0

Clock Core cyc Instruct BrTaken BrMispred

1148888 1138457 3300310 106 1001

1112393 1112395 3300303 99 5

1112296 1112298 3300303 99 1

Branch behind not taken

Processor 0

Clock Core cyc Instruct BrTaken BrMispred

1268423 1256292 3300311 107 1003

2426363 1881681 3477121 23035 2883

1168328 1168324 3300303 99 2

Arrendale (i5 M520) NB branch taken on this CPU seems to include mispredicts too; investigating why...

Branch ahead taken

Processor 0

Clock Core cyc Instruct BrTaken BrMispred

1532412 1737235 3300311 100108 1605

1357365 1658999 3300303 100100 1

1357500 1659166 3300303 100100 1

Branch ahead not taken

Processor 0

Clock Core cyc Instruct BrTaken BrMispred

973497 1114387 3300310 100107 3

900687 1100855 3300303 100100 2

900681 1100857 3300303 100100 2

Branch behind not taken

Processor 0

Clock Core cyc Instruct BrTaken BrMispred

974793 1141283 3300312 100109 10

900447 1100551 3300303 100100 2

900450 1100551 3300303 100100 2

These early results (on these older machines I have around the house...) seem to suggest that:

Penryn doesn't do static prediction at all (or I can't expose it with this test). It seems to do badly at predicting "predictable" untaken branches, but not as badly as taken ones. That said; it does seem to run slower when running the "branch behind not taken" than the "ahead not taken"; so I wonder if I'm getting the right counters...
Arrendale doesn't seem to do static prediction either: it does equally well with any kind of non-taken branch, and appropriately badly for the taken branch.

I'm beginning to suspect I'm not measuring what I think I'm measuring. Once the code's tidied up and made vaguely runnable elsewhere; I'll share here for comment, analysis, and hopefully some more CPUs to run it on!

--matt

On Wed, Jan 27, 2016 at 1:16 PM Matt Godbolt <ma...@godbolt.org> wrote:

Sorry if I've confused, or am confused: I'm not running perf or intel PCM here; I'm manipulating the perf counters directly. Specifically for the Haswell tests I'm using:

0xc4:0x20 - branches taken (aka "Number of near taken branches retired." p304 of Intel manual 3B)
0xc5:0x00 - branches mispredicted (aka "Mispredicted branch instructions at retirement" same page).

Having these be very accurately measured does help; the branch counter was pretty much exact in most cases and could be tallied up with the actual instruction stream (e.g. the 100099 branches in the taken cases and the 99 branches in the not-taken cases are exact.)

The reason not to use perf is to make it (as near to) cycle perfect as possible. FWIW Agner's code is here: http://www.agner.org/optimize/testp.zip if that helps frame the discussion. I'm definitely going to put something together using his as a baseline and put it on github over the next few days.

--matt

On Wed, Jan 27, 2016 at 12:25 PM ymo <ymol...@gmail.com> wrote:
Just curious about your testing and how do you get the missed branch misdirection from perf or maybe intel pcm (-performance-counter-monitor) code ? I know that most people run perf in collection mode but i was wondering if it is worth interfacing between perf and the code that you are testing. Meaning that you write a simple loop but before you start your loop you can tell perf to start the collection now. Is the extra level of control worth doing for these micro benchmark ?

I was hoping to write something that auto generates these test cases in a week end project .. one day )))

On Wednesday, January 27, 2016 at 11:50:08 AM UTC-5, Matt Godbolt wrote:

On Wed, Jan 27, 2016 at 6:47 AM Vitaly Davidovich <vit...@gmail.com> wrote:

On Tuesday, January 26, 2016, Matt Godbolt <ma...@godbolt.org> wrote:
....
This has all been on a Haswell; I'll try and find some older machines known to have static prediction to see if the same test gives different results.

If uop inside the uop cache has branch history associated with it, that could muddy the waters too? Some of your instructions in the test are the same right? 32 bytes of padding may be placing the uops on a different way inside the cache, improving hit rate or decreasing contention. Just hand waving here ...

Seems pretty possible!

I'd try this experiment on Ivy or Sandy Bridge (Haswell apparently had significant undisclosed BPU changes) and also something pre Sandy Bridge (which won't have the uop cache), like Nehalem or Westmere.

Right. I just trivially tried this on a tame Westmere box but Agner's code isn't working all for that. I spent all morning debugging it and have a fix, but the code is so fragile I think it worth busting out a new whole copy with fixes etc. I'll hack on this over the next few days and put it on github. Yay GPL :)

This could also be some microarchitectural issue with Haswell BPU - would someone at Intel run a branch heavy performance test with this many branches? :)

Finally, maybe running this test under Intel Amplifier or Linux perf with more detailed BPU events can shed some light (I'd need to review the list of events supported so not sure if there's enough granularity there).

Agreed. In fact, the list of events (section 3B of intel's manuals) is giant and comprehensive and has all manner of things... Agner interprets 0xe6:0x00 as both "BTB Miss" and "number of static branch predictions". This doesn't seem to fit perfectly with Intel's docs: "Counts the number of times the front end is resteered, mainly when the Branch Prediction Unit cannot provide a correct prediction and this is corrected by the Branch Address Calculator at the front end."

I'll have to dig more (and also get stats on what this counter is reading for my cases)..

I think this is the most attention given to static prediction that I recall seeing :).

Cool: I've always been a little suspect of my interpretation. It's good to be in a position to be able to experiment a little and maybe even work out what's going on!

Thanks all, Matt

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Matt Godbolt

unread,

Jan 31, 2016, 6:38:58 PM1/31/16

to mechanical-sympathy

Ok; I've spent the weekend hacking about and trying to make sense of the results, and I'm stumped. I hope we can work something out here. The code's at https://github.com/mattgodbolt/agner and so far I've gotten stuff working on my laptop only (the M520 Arrendale). Here's what I'm seeing:

******************************************************************************

Ahead not taken

******************************************************************************

Processor 0

Clock Core cyc Instruct BrMispred BTBMiss

428331 509602 1501707 3 0

409572 500606 1501704 2 0

409569 500601 1501704 2 0

Processor 0

Clock Core cyc Instruct BaClrFIq BaClrClr

444174 512653 1501707 7 7

409599 500613 1501704 0 0

409584 500595 1501704 0 0

Processor 0

Clock Core cyc Instruct BaClrBad BaClrEly

421830 511217 1501707 0 98

409590 500607 1501704 0 99

472941 525462 1501705 0 102

Processor 0

Clock Core cyc Instruct BaClrL8

423177 511326 1501707 0

409590 500609 1501704 0

409584 500605 1501704 0

******************************************************************************

Behind not taken

******************************************************************************

Processor 0

Clock Core cyc Instruct BrMispred BTBMiss

431751 522975 1501707 1010 0

409629 500657 1501704 5 0

409587 500604 1501704 2 0

Processor 0

Clock Core cyc Instruct BaClrFIq BaClrClr

473721 520939 1501707 1003 1003

450621 500692 1501704 0 0

450543 500604 1501704 0 0

Processor 0

Clock Core cyc Instruct BaClrBad BaClrEly

431457 521737 1501707 0 132

434079 502694 1501704 0 107

450183 502514 1501704 0 104

Processor 0

Clock Core cyc Instruct BaClrL8

474192 521855 1501707 0

461493 512776 1501704 0

455712 506349 1501704 0

******************************************************************************

Ahead taken

******************************************************************************

Processor 0

Clock Core cyc Instruct BrMispred BTBMiss

487568 455047 1301707 1004 0

459464 433940 1301704 2 0

459448 433920 1301704 1 0

Processor 0

Clock Core cyc Instruct BaClrFIq BaClrClr

413907 455855 1301707 15 13

390546 433940 1301704 0 0

390528 433920 1301704 0 0

Processor 0

Clock Core cyc Instruct BaClrBad BaClrEly

462378 456857 1301708 0 99699

398835 443136 1301704 0 100425

372933 435816 1301704 0 100138

Processor 0

Clock Core cyc Instruct BaClrL8

402693 457062 1301707 0

355026 433941 1301704 0

355011 433924 1301704 0

******************************************************************************

Behind taken

******************************************************************************

Processor 0

Clock Core cyc Instruct BrMispred BTBMiss

275124 332695 900507 3 0

245727 300350 900504 2 0

245706 300321 900504 1 0

Processor 0

Clock Core cyc Instruct BaClrFIq BaClrClr

347804 324822 900507 1003 1005

318016 300350 900504 0 0

317988 300322 900504 0 0

Processor 0

Clock Core cyc Instruct BaClrBad BaClrEly

269334 324680 900507 0 103458

245739 300351 900504 0 100332

245718 300323 900504 0 100330

Processor 0

Clock Core cyc Instruct BaClrL8

285048 332992 900507 0

354741 302930 900505 0

245706 300329 900504 0

All the funny-named counters are contractions of the counter names as mentioned in Section 3B of the Intel manual (pp370ish). The manual is deliberately vague on what they all mean. Also, on this CPU nothing I can do seems to get correct "Branches Taken": it seems any counters relating to branches are whether they're taken or not. I single-stepped the code to ensure things are going the right way; and the CPU cycle counts seem to bear out the branches being taken or not, so that's a mystery.

Similarly mysterious is the fact that any of my attempts to "reset" the state of the BPU with a ton of (somewhat) unpredictable branches failed. My general technique was to have a giant list of conditional jumps based off a random rotating counter: this I thought would flood the BTB and the predictors...but nothing seems to faze them. Ideas on how to achieve this welcomed. Luckily the first run (of the 3 on each row above) still seems to give consistent results.

The "branch taken backwards" is understandably a little different to the other test but is similar enough I think to give representative results.

So; to interpretation. Despite what I've seen before it seems that there *is* static prediction going on: the counters for "mispredict" read 1000 for "ahead taken" and "behind not taken"; and there are zero mispredicts for the "ahead not taken" and "backwards taken" which is totally consistent with static prediction for backwards-is-taken.

I'll play around with some other architectures and see if I can find any evidence for static prediction elsewhere. I've been linked http://www.ece.uah.edu/~milenka/docs/VladimirUzelac.thesis.pdf and that suggests the Pentium M at least doesn't statically predict (section 8.10), though it's not clear to me quite how that conclusion was drawn.

Further thoughts welcomed: however if this is considered off-topic for this list (or too verbose) I'll happily take it off-list with anyone interested.

Thanks, Matt

Matt D.

unread,

Feb 3, 2016, 8:10:47 AM2/3/16

to mechanical-sympathy

On Monday, February 1, 2016 at 12:38:58 AM UTC+1, Matt Godbolt wrote:

Ok; I've spent the weekend hacking about and trying to make sense of the results, and I'm stumped. I hope we can work something out here. The code's at https://github.com/mattgodbolt/agner and so far I've gotten stuff working on my laptop only (the M520 Arrendale). Here's what I'm seeing:

******************************************************************************
[snipped]
******************************************************************************

Similarly mysterious is the fact that any of my attempts to "reset" the state of the BPU with a ton of (somewhat) unpredictable branches failed. My general technique was to have a giant list of conditional jumps based off a random rotating counter: this I thought would flood the BTB and the predictors...but nothing seems to faze them. Ideas on how to achieve this welcomed. Luckily the first run (of the 3 on each row above) still seems to give consistent results.

The "branch taken backwards" is understandably a little different to the other test but is similar enough I think to give representative results.

So; to interpretation. Despite what I've seen before it seems that there *is* static prediction going on: the counters for "mispredict" read 1000 for "ahead taken" and "behind not taken"; and there are zero mispredicts for the "ahead not taken" and "backwards taken" which is totally consistent with static prediction for backwards-is-taken.

I'll play around with some other architectures and see if I can find any evidence for static prediction elsewhere. I've been linked http://www.ece.uah.edu/~milenka/docs/VladimirUzelac.thesis.pdf and that suggests the Pentium M at least doesn't statically predict (section 8.10), though it's not clear to me quite how that conclusion was drawn.

Further thoughts welcomed: however if this is considered off-topic for this list (or too verbose) I'll happily take it off-list with anyone interested.

Hi Matt (G.),

To add some data points -- perhaps useful:

1. Historical / documented behaviors.
2. BP research.
3. Architectural simulators sources.
4. Patents (focus on Intel; perhaps AMD would also make sense due to their cross-licensing agreement).

1. Historical known behaviors (from Shen & Lipasti "Modern Processor Design: Fundamentals of Superscalar Processors"):
- Single-Direction Prediction: "Older pipelined processors, such as the Intel i486, used the always-not-taken prediction algorithm."
- Backwards Taken/Forwards Not-Taken (BTFNT): "the Intel Pentium 4 processor uses the BTFNT approach as a backup strategy when its dynamic predictor is unable to provide a prediction"

This has been officially documented:
"If a branch is not found in the BTB, the branch prediction hardware statically predicts the outcome of the branch based on the direction of the branch displacement (forward or backward). Backward branches are assumed to be taken and forward branches are assumed to not be taken."
Source: G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Roussel, “The Microarchitecture of the Pentium 4 Processor,” Intel Technology Journal, Issue 1, 2001, article 2.

Unfortunately, the later article on Pentium M contains significantly less detail (while stating it's based on that of P4--and still introducing the loop detector and indirect branch predictor): S. Gochman, R. Ronen, I. Anati, A. Berkovits, T. Kurts, A. Naveh, A. Saeed, Z. Sperber, and R. C. Valentine. The Intel Pentium M processor: Microarchitecture and performance. Intel Technology Journal, 7(2), May 2003.

2. BP research -- starting with the articles related to the following:

> Similarly mysterious is the fact that any of my attempts to "reset" the state of the BPU with a ton of (somewhat) unpredictable branches failed. My general technique was to have a giant list of conditional jumps based off a random rotating counter: this I thought would flood the BTB and the predictors...but nothing seems to faze them. Ideas on how to achieve this welcomed. Luckily the first run (of the 3 on each row above) still seems to give consistent results.

Forcing BP state eviction may not necessarily be trivial -- there's a line of microachitectural attacks, Branch Prediction Attacks (BPA) and Simple Branch Prediction Attacks (SBPA), which rely on the forced eviction (FWIW, the papers seem to focus on Pentium 4):
- "Predicting Secret Keys via Branch Prediction": https://eprint.iacr.org/2006/288.pdf
- "On the Power of Simple Branch Prediction Analysis": https://eprint.iacr.org/2006/351.pdf
- "Microarchitectural Attacks and Countermeasures": http://192.35.222.224/newweb/~koc/cren/docs/w05/ch18.pdf

The authors have found this to be somewhat non-trivial (from "On the Power of Simple Branch Prediction Analysis"):
"The most interesting fact that we can draw from this doubled BTB is the following. Executing a certain sequence of branches in the spy process which evicts just the Front-End BTB might not necessarily suffice to completely enforce the CPU not to find the target address of the target branch in some of the BTB’s. A certain hidden interaction between Front-End BTB and Trace-Cache BTB might allow for some “short-term” victim address evictions, but still store the target branch in one of the BTB’s."

Curiously, the aforementioned papers are also subject of later Intel BP patents:
- "Mitigating branch prediction and other timing based side channel attacks": http://www.google.com/patents/US8869294
- "Protecting a Branch Instruction from Side Channel Vulnerabilities": https://www.google.com/patents/US20090089564

On a side note, the author suggest the NT default on BTB miss; from Slides for "Predicting Secret Keys via Branch Prediction",
http://cs.ucsb.edu/~koc/cren/docs/w05/maa.pdf
- Adversary can clear the BTB via the operations of the dummy process and causes a BTB miss during the execution of the target branch.
- BPU automatically predicts the branch not to be taken if it misses the target address in the BTB.
- There will be a misprediction whenever the actual outcome of the target branch is `taken‘.
- [...] The adversary starts the spy process before the cipher, so when the cipher starts encryption/signing, the CPU cannot find the target address of the target branch in BTB and the prediction must be not-taken

Interestingly, "Branch Prediction and the Performance of Interpreters - Don't Trust Folklore" (an interesting read on its own -- given the remarks on the changing meaning of branch-friendly optimization techniques) seems to tie Haswell's predictor performance to that of ITTAGE:
https://hal.inria.fr/hal-01100647

"This accuracy on Haswell, the most recent Intel processor generation, has reached a level where it can not be considered as an obstacle for performance anymore. We have also shown that this accuracy is on par with the one of the literature state-of-the-art ITTAGE.
While the structure of the Haswell indirect jump predictor is undisclosed, we were able to confirm with simulations of ITTAGE that the few cases where the prediction accuracy is relatively poor are due to footprint issues on the predictor and not inherent to the prediction scheme."

The conditional branch predictor TAGE (TAgged GEometric history length) and ITTAGE (Indirect Target TAgged GEometric length) predictor, were introduced in http://www.jilp.org/vol8/v8paper1.pdf
It may be worth noting that the above is an improvement over Optimized GEometric History Length (O-GEHL), research on which has been supported by Intel: https://www.irisa.fr/caps/people/seznec/ISCA05.pdf

Perhaps it's also worth mentioning that another article co-authored by André Seznec, "A Comprehensive Study of Dynamic Global History Branch Prediction", presents a probabilistic argument that the initialization "strength" (of (non)-takenness) should not affect the prediction quality -- Section "Initial behavior of the two bit counter":
https://hal.inria.fr/inria-00072400

Last but not least, for completeness, more articles like Vladimir Uzelac's thesis, using microbenchmarks to understand BP -- seem to focus on up to Pentium M:
http://lacasa.uah.edu/portal/index.php/research/22-architectural-analysis

3. Simulators.

For what it's worth, the Sniper multi-core simulator seems to agree with Uzelac on BiModal being that last fallback; sources obtainable from http://snipersim.org/w/The_Sniper_Multi-Core_Simulator

In particular, in "sniper-6.1/common/performance_model/branch_predictors/pentium_m_branch_predictor.cc", we have:

bool PentiumMBranchPredictor::predict(IntPtr ip, IntPtr target)
{
   // ... snipped ...
   // Outcome prediction logic
   bool result;
   if (global_pred_out.hit)
   {
      result = global_pred_out.prediction;
   }
   else if (lpb_out.hit & btb_out.hit)
   {
      result = lpb_out.prediction;
   }
   else
   {
      result = bimodal_out;
   }
   // ... snipped ...
}

For reference:
- ZSim: https://github.com/s5z/zsim/blob/342bacac724719bd4f9251d0e5148bf0a486b6b8/src/ooo_core.h#L49
- MARSS-x86: https://github.com/avadhpatel/marss/blob/5b144b46cec2c01939f3f2fbf13ecd4d719bc11a/ptlsim/core/branchpred.cpp

4. Patents.

Focus on: https://patents.google.com/?q=%22branch+predictor%22&assignee=Intel+Corporation

One general caveat:
"The patents at the core of our descriptions above were all issued between 1996 and 1999, which raises the concern of obsolescence. As Intel would not be able to file new patents for the same specifications, we cannot present newer patents with the information above. Fortunately, we were able to find newer patents that mention the techniques described above, proving their relevance to newer CPU models."
From "Intel SGX Explained" https://eprint.iacr.org/2016/086 (an interesting article on Intel's Software Guard Extensions)

Admittedly got the idea from http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html#4.6 (a pretty interesting analysis of AMD's Opteron; it's unfortunate that the website doesn't seem maintained anymore) which references AMD's patent "Dynamic classification of conditional branches in global history branch prediction", http://www.google.com/patents/US6502188
"When a conditional branch is initially detected, it is classified as local and predicted not taken. If the branch is then actually taken, its prediction is changed to taken. If the branch is then actually not taken, its classification is changed to global, uses global branch prediction and participates in global history counter training. Advantageously, branches which exhibit static behavior may not participate in global history counter training. Instead, branches which are not taken may remain classified as local and not taken. Branches Which are taken may remain classified as local and taken."

Another article on chip-architect.com points to an Intel patent "Branch predictor using multiple prediction heuristics and a heuristic identifier in the branch instruction":
https://www.google.com/patents/US5687360
Source: http://www.chip-architect.com/news/2000_06_01_Branch_Prediction.html
"Otherwise, by default, Not Taken is predicted for forward branches, and Taken is predicted for backward branches."

The search may be further narrowed down (but will miss some of the later mentioned results):
https://www.google.com/search?hl=en&tbm=pts&q=inassignee%3A%22Intel+Corporation%22+%E2%80%9CA+Study+of+Branch+Prediction+Strategies%22
Relying on the fact that “A Study of Branch Prediction Strategies" by James E. Smith (1981) is relevant to BTFN, as "Strategy 3" from that paper being exactly Backward-Taken Forward-Not-taken (BTFN).

Yields the aforementioned "Branch predictor using multiple prediction heuristics and a heuristic identifier in the branch instruction" https://www.google.com/patents/US5687360
- 1995 patent, "branch bias" bit seems like a (deprecated) static BP method.

However, there are also other potentially interesting patents (several of Intel) in the "Referenced by" section.
For instance, "Branch prediction architecture", https://www.google.com/patents/US6332189
Which, in turn, also references the following:
- "Apparatus and method for branch prediction utilizing a predictor combination in parallel with a global predictor": http://www.google.com/patents/US7219217
- "Method and system for branch target prediction using path information": https://www.google.com/patents/US6601161

On a side note, some patents seems interesting in terms of illuminating the advanced BP capabilities:
- "Branch Predictor with Jump Ahead Logic to Jump Over Portions of Program Code Lacking Branches": http://www.google.com/patents/US20120311308
- "TLB correlated branch predictor and method for use thereof": https://www.google.com/patents/US20060015706

Let me know if there's any experiment I can help with: I can try running the test you mention on Core i7-3720QM (Ivy Bridge; although I've noticed you've already tested on E5-2667 v2) -- provided I can compile the test(s) on either Arch or Win :-)

Best,

Matt D.

Matt Godbolt

unread,

Feb 3, 2016, 9:59:05 AM2/3/16

to mechanica...@googlegroups.com

Thanks Matt D!

There's a lot to digest here, and thanks to your tweets I've already started looking at some of the references. I'm lucky enough to already have the Shen and Lipasti book, so I'm aware of most of the stuff in there.

At the moment I'm pretty sure (c.f. https://docs.google.com/spreadsheets/d/1GuyiHF3NCxjzI3Ag53U-Z6Ucau4hjrEb6Bp0hZb_kBU/edit?usp=sharing ) that the Arrendale has "traditional" static prediction, and the Ivy/Haswell don't (Though they may do something else).

I've actually become totally distracted in working out the BTB layout: I am in the process of documenting the Arrendale one as a starting point. The fascination came from my attempts to make each run (see the code) as reproducible as possible. As of this morning I've cracked enough of the Arrendale BTB to reliable be able to "clear" the predictor so that every run is the same. (code is in this change: https://github.com/mattgodbolt/agner/commit/e8259f68c5121225fec84527bae49376712f1e6e ) There are still a number of interesting things to work out: I'll put together a more comprehensive post on this tonight, time permitting.

Also having arrived at work it's clear the same technique is ineffective for Haswell: lots more work is needed here too.

Thanks for the post - I look forward to gradually cracking some of the mysteries of this elusive subject with your (and others') help!

-matt

Matt D.

unread,

Feb 3, 2016, 12:44:08 PM2/3/16

to mechanical-sympathy

On Wednesday, February 3, 2016 at 3:59:05 PM UTC+1, Matt Godbolt wrote:

As of this morning I've cracked enough of the Arrendale BTB to reliable be able to "clear" the predictor so that every run is the same. (code is in this change: https://github.com/mattgodbolt/agner/commit/e8259f68c5121225fec84527bae49376712f1e6e )

Wow, this sounds great!

There are still a number of interesting things to work out: I'll put together a more comprehensive post on this tonight, time permitting.

Fantastic, looking forward to it!

Also having arrived at work it's clear the same technique is ineffective for Haswell: lots more work is needed here too.

Out of curiosity, I wonder, could the subtleties similar to the ones mentioned in the BPA/SBPA papers play a role here?
Admittedly, it's unclear to me whether this is applicable here: The papers mentioned a "hidden interaction between Front-End BTB and Trace-Cache BTB" [when attempting to force address eviction from the BTB], with trace cache being specific to P4. What's unclear to me specifically is whether the new micro-op cache in the following microarchitectures -- which comes closest in my mind to constituting a "replacement" -- can exhibit anything like this, too?

Best,

Matt D.

ymo

unread,

Feb 4, 2016, 12:54:18 PM2/4/16

to mechanical-sympathy

Matt, this is brilliant code ... thank you for sharing this !

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

ymo

unread,

Feb 4, 2016, 2:05:15 PM2/4/16

to mechanical-sympathy

Hi Matt.

Intel PMC only requires the perf monitoring code to run as a root user. Any ideas why the code from agner/yours wants to go through a kernel loaded driver instead ? It becomes a drag when i want to interface from java for example (via ioctl).

Regards

On Wednesday, January 27, 2016 at 10:46:18 PM UTC-5, Matt Godbolt wrote:

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Matt D.

unread,

Feb 4, 2016, 4:28:02 PM2/4/16

to mechanical-sympathy

Just one note regarding flushing the branch histories and Haswell:

On Wednesday, February 3, 2016 at 3:59:05 PM UTC+1, Matt Godbolt wrote:

I've actually become totally distracted in working out the BTB layout: I am in the process of documenting the Arrendale one as a starting point. The fascination came from my attempts to make each run (see the code) as reproducible as possible. As of this morning I've cracked enough of the Arrendale BTB to reliable be able to "clear" the predictor so that every run is the same. [...] Also having arrived at work it's clear the same technique is ineffective for Haswell: lots more work is needed here too.

I've just ran across the following: "Understanding and Mitigating Covert Channels Through Branch Predictors"
http://www.cs.ucr.edu/~nael/pubs/taco16_branches.pdf
// to be published in ACM Transactions on Architecture and Code Optimization

The reason it seems potentially interesting is that as a mitigation of the channel under study, they "propose a software-only solution, which flushes the branch predictor (or randomizes its state) on context switches." They also test it on an Intel Core i7-4800MQ (Haswell) CPU.

They seem to be using a similar strategy to the one in your code (BP flushing using repeatedly taken branches), but use a rather large number of branch instructions (going into the hundreds of thousands). Perhaps you've already tried this as well and it may be just barking up the wrong tree, but thought I'd share just in case this can be useful :-)

They also do what they describe "priming the predictor" beforehand (putting the BP into a strong state), but perhaps this may be only necessary for their overall measuring experiment, which is to find the number of necessary branches needed.

Sec. 8.2, "Optimal Number of Branches in Randomization Code" goes into more detail.

The goal: "a large block of branch-intensive code is executed to randomize the branch predictor state. As a result, the newly scheduled process starts execution with a clean predictor state."

Here's the overall strategy:
"To determine the optimal number of branch instructions required to randomize the branch predictor in a secure manner, we conducted the following experiment. First, we executed several times the code consisting of one million branch instructions, thus placing the branch predictor into one of the strong states. We call this phase the priming of the predictor. Then, the flushing code block was executed once. We varied the number of instructions in the flushing block. Finally, we executed one thousand branch instructions with the same outcome as branches in the priming phase and measured the number of mispredictions in this block. We refer to this phase as probing. The number of branch mispredictions encountered in executing the probing code indicates how well the flushing phase resets the predictor."

Findings:
- "As expected, the small blocks of flushing code are not sufficient to reset the predictor and the misprediction rate is very low. As the flush code block increases in size, the misprediction rate of the probing phase also increases. The growth stops at about 50%, indicating that the branch predictor tables are reset. [...] we conservatively selected 300,000 branch instructions in the flush block for further experiments and analysis."
- "The results presented above show that executing an even mix of taken and not-taken branches with random pattern is the most stable way to reset the predictor."

On a side note, this is also pretty interesting: "Surprisingly, not-taken branches have a much smaller impact on the branch predictor state." Indeed, looking at Fig. 15 (Branch Predictor Flushing Using Taken Branches) and Fig. 16 (Branch Predictor Flushing Using Not-taken Branches), flushing using not-taken branches would seem to be completely ineffective.

Best,

Matt D.

Matt Godbolt

unread,

Feb 5, 2016, 9:58:27 AM2/5/16

to mechanical-sympathy

Hi Matt,

A quick reply: thanks again for the link! My reading list is significantly backed up now :)

For now I'm going to concentrate on honing my technique and scripts on the Arrendale - mainly as it's a computer I have available even when I'm commuting (which is where I get most of the time for this!). Once I'm comfortable explaining branch prediction there, I'll move on to the Sandy, Ivy and Haswell: each of which has significant BP improvements.

I'm also first starting on interpreting the BTB size - under the assumption that there's no taken/not-taken information stored there; and that a miss in the BTB always causes one of the "BP failure" counters to tick up. At least on Arrendale/Westmere, there's a bunch of counters for this; early ones where the decoder spots a branch the BTB failed to notice, for example. I have a ton of notes and thoughts; I'll probably write up something over the next few days and post it here (or a link to a blog post or similar).

Thanks, Matt (G) :)

Matt D.

unread,

Feb 7, 2016, 3:11:18 PM2/7/16

to mechanical-sympathy

Hi again!

Fair enough on the reading list -- I know the feeling (as well as that of the hardware availability constraints)! ;-)

Nehalem/Westmere has an extra twist, with BTB being two-level (with the decision later reversed* in Sandy Bridge), fun! :-)

* -- at least according to David Kanter (in Fog's microarch. manual BTB levels count in Sandy Bridge stated "unknown"):
http://www.realworldtech.com/nehalem/4/
http://www.realworldtech.com/sandy-bridge/3/

Let's say the following is mostly of a note-to-myself kind of thing (as well as to anyone interested) -- one more BP RE study (they do build on Uzelac et al. -- with an extra simulation step, so mostly including just for completeness -- perhaps it may be convenient to have references gathered in one place):

Shah Mohammad Faizur Rahman, Zhe Wang, and Daniel A. Jiménez, "Studying Microarchitectural Structures with Object Code Reordering", Proceedings of the 2009 Workshop on Binary Instrumentation and Applications (WBIA), December, 2009.

The authors "use the reordered object code to validate a reverse-engineered model for the Intel Core 2 branch predictor."

Finding: "The branch predictor of the Intel Xeon E5440 is not documented, but through reverse-engineering experiments we have determined that it is likely to contain a hybrid of a GAs-style branch predictor and a bimodal branch predictor."

// To be precise, Xeon E5440 is Harpertown (Penryn microarchitecture).

For completeness (I'm aware that Shen & Lipasti cover this), GAs refers to Global Adaptive Branch Prediction using per-set pattern history tables, using terminology introduced by T. Yeh and Y. Patt, "A Comparison of Dynamic Branch Predictors That Use Two Levels of Branch History," Proc. 20th Annual Intl. Symp. On Computer Architecture, May 1993, pp. 257-266.

Method: "Uzelac et al. introduced reverse engineering techniques using some microbenchmarks to expose the organization of Pentium M Branch Predictor. We try to use similar techniques to get an idea about the organization of Intel Core 2 branch predictor. Those microbenchmarks provide a rough outline about the organization of the branch predictor but there is no certain way to tell whether our assumption is correct. We use the reordered object code to validate our hypothetical branch predictor organization. We first reverse engineered the Intel Core 2 branch predictor to find out its organization. Then we use the reordered object code to validate the predictor and reveal different attributes and characteristics of it. We simulate several branch predictors using Pin and try to find out the hypothetical branch predictor which has the highest co-relation with the original one."

Best,

Matt

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Matt D.

unread,

Feb 8, 2016, 6:39:28 AM2/8/16

to mechanical-sympathy

Attaching the links to the PDFs (also found a 2011 version):

On Sunday, February 7, 2016 at 9:11:18 PM UTC+1, Matt D. wrote:

Shah Mohammad Faizur Rahman, Zhe Wang, and Daniel A. Jiménez, "Studying Microarchitectural Structures with Object Code Reordering", Proceedings of the 2009 Workshop on Binary Instrumentation and Applications (WBIA), December, 2009.

WBIA 2009:

"Studying Microarchitectural Structures with Object Code Reordering"

http://www.cs.virginia.edu/kim/publicity/pin/wbia09.pdf
pp. 7-16

IEEE International Symposium on Workload Characterization (IISWC) 2011:
"Program Interferometry" by Zhe Wang and Daniel A. Jiménez
http://taco.cse.tamu.edu/pdfs/interferometry_dist.pdf

Best,

Matt (D.) :-)

Reply all

Reply to author

Forward