Re: Static branch prediction (was: On intel x86 processors do you have branch predictions on fixed loops ?)

856 views
Skip to first unread message

Matt Godbolt

unread,
Jan 27, 2016, 10:46:18 PM1/27/16
to mechanical-sympathy

Some quick results for a few other processor types, all with 64 NOPs between the branches.

The expectation is that CPUs with static prediction will predict backwards jumps as taken, and forwards jumps as not taken. In this instance one would expect to see 1000 mispredicts for "ahead taken"; 0 mispredicts for "ahead not taken", and 1000 mispredicts for "behind not taken". I need to work on a test for "behind taken".

Penryn (Xeon X3323)
Branch ahead taken

Processor 0
     Clock   Core cyc   Instruct    BrTaken  BrMispred 
   2255069    2050350    3445524     118016       2846 
   1482772    1482773    3300303     100099          1 
   1521614    1483802    3300304     100100          8 

Branch ahead not taken

Processor 0
     Clock   Core cyc   Instruct    BrTaken  BrMispred 
   1148888    1138457    3300310        106       1001 
   1112393    1112395    3300303         99          5 
   1112296    1112298    3300303         99          1 

Branch behind not taken

Processor 0
     Clock   Core cyc   Instruct    BrTaken  BrMispred 
   1268423    1256292    3300311        107       1003 
   2426363    1881681    3477121      23035       2883 
   1168328    1168324    3300303         99          2 

Arrendale (i5 M520) NB branch taken on this CPU seems to include mispredicts too; investigating why...
Branch ahead taken

Processor 0
     Clock   Core cyc   Instruct    BrTaken  BrMispred 
   1532412    1737235    3300311     100108       1605 
   1357365    1658999    3300303     100100          1 
   1357500    1659166    3300303     100100          1 

Branch ahead not taken

Processor 0
     Clock   Core cyc   Instruct    BrTaken  BrMispred 
    973497    1114387    3300310     100107          3 
    900687    1100855    3300303     100100          2 
    900681    1100857    3300303     100100          2 

Branch behind not taken

Processor 0
     Clock   Core cyc   Instruct    BrTaken  BrMispred 
    974793    1141283    3300312     100109         10 
    900447    1100551    3300303     100100          2 
    900450    1100551    3300303     100100          2 

These early results (on these older machines I have around the house...) seem to suggest that:
  • Penryn doesn't do static prediction at all (or I can't expose it with this test). It seems to do badly at predicting "predictable" untaken branches, but not as badly as taken ones. That said; it does seem to run slower when running the "branch behind not taken" than the "ahead not taken"; so I wonder if I'm getting the right counters...
  • Arrendale doesn't seem to do static prediction either: it does equally well with any kind of non-taken branch, and appropriately badly for the taken branch.
I'm beginning to suspect I'm not measuring what I think I'm measuring. Once the code's tidied up and made vaguely runnable elsewhere; I'll share here for comment, analysis, and hopefully some more CPUs to run it on!

--matt

On Wed, Jan 27, 2016 at 1:16 PM Matt Godbolt <ma...@godbolt.org> wrote:
Sorry if I've confused, or am confused: I'm not running perf or intel PCM here; I'm manipulating the perf counters directly.  Specifically for the Haswell tests I'm using:

0xc4:0x20 - branches taken (aka "Number of near taken branches retired." p304 of Intel manual 3B)
0xc5:0x00 - branches mispredicted (aka "Mispredicted branch instructions at retirement" same page).

Having these be very accurately measured does help; the branch counter was pretty much exact in most cases and could be tallied up with the actual instruction stream (e.g. the 100099 branches in the taken cases and the 99 branches in the not-taken cases are exact.)

The reason not to use perf is to make it (as near to) cycle perfect as possible. FWIW Agner's code is here: http://www.agner.org/optimize/testp.zip if that helps frame the discussion. I'm definitely going to put something together using his as a baseline and put it on github over the next few days.

--matt


On Wed, Jan 27, 2016 at 12:25 PM ymo <ymol...@gmail.com> wrote:
Just curious about your testing and how do you get the missed branch misdirection from perf or maybe intel pcm (-performance-counter-monitor) code ? I know that most people run perf in collection mode but i was wondering if it is worth interfacing between perf and the code that you are testing. Meaning that you write a simple loop but before you start your loop you can tell perf to start the collection now. Is the extra level of control worth doing for these micro benchmark ?

I was hoping to write something that auto generates these test cases in a week end project .. one day )))

On Wednesday, January 27, 2016 at 11:50:08 AM UTC-5, Matt Godbolt wrote:


On Wed, Jan 27, 2016 at 6:47 AM Vitaly Davidovich <vit...@gmail.com> wrote:


On Tuesday, January 26, 2016, Matt Godbolt <ma...@godbolt.org> wrote:
....
This has all been on a Haswell; I'll try and find some older machines known to have static prediction to see if the same test gives different results.

If uop inside the uop cache has branch history associated with it, that could muddy the waters too? Some of your instructions in the test are the same right? 32 bytes of padding may be placing the uops on a different way inside the cache, improving hit rate or decreasing contention.  Just hand waving here ...

Seems pretty possible!
 

I'd try this experiment on Ivy or Sandy Bridge (Haswell apparently had significant undisclosed BPU changes) and also something pre Sandy Bridge (which won't have the uop cache), like Nehalem or Westmere.

Right. I just trivially tried this on a tame Westmere box but Agner's code isn't working all for that. I spent all morning debugging it and have a fix, but the code is so fragile I think it worth busting out a new whole copy with fixes etc. I'll hack on this over the next few days and put it on github. Yay GPL :)
 

This could also be some microarchitectural issue with Haswell BPU - would someone at Intel run a branch heavy performance test with this many branches? :) 

Finally, maybe running this test under Intel Amplifier or Linux perf with more detailed BPU events can shed some light (I'd need to review the list of events supported so not sure if there's enough granularity there).

Agreed. In fact, the list of events (section 3B of intel's manuals) is giant and comprehensive and has all manner of things... Agner interprets 0xe6:0x00 as both "BTB Miss" and "number of static branch predictions". This doesn't seem to fit perfectly with Intel's docs: "Counts the number of times the front end is resteered, mainly when the Branch Prediction Unit cannot provide a correct prediction and this is corrected by the Branch Address Calculator at the front end."

I'll have to dig more (and also get stats on what this counter is reading for my cases)..
 

I think this is the most attention given to static prediction that I recall seeing :).

Cool: I've always been a little suspect of my interpretation. It's good to be in a position to be able to experiment a little and maybe even work out what's going on!

Thanks all, Matt 

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Matt Godbolt

unread,
Jan 31, 2016, 6:38:58 PM1/31/16
to mechanical-sympathy
Ok; I've spent the weekend hacking about and trying to make sense of the results, and I'm stumped. I hope we can work something out here. The code's at https://github.com/mattgodbolt/agner and so far I've gotten stuff working on my laptop only (the M520 Arrendale). Here's what I'm seeing:

******************************************************************************
Ahead not taken
******************************************************************************

Processor 0
     Clock   Core cyc   Instruct  BrMispred    BTBMiss 
    428331     509602    1501707          3          0 
    409572     500606    1501704          2          0 
    409569     500601    1501704          2          0 

Processor 0
     Clock   Core cyc   Instruct   BaClrFIq   BaClrClr 
    444174     512653    1501707          7          7 
    409599     500613    1501704          0          0 
    409584     500595    1501704          0          0 

Processor 0
     Clock   Core cyc   Instruct   BaClrBad   BaClrEly 
    421830     511217    1501707          0         98 
    409590     500607    1501704          0         99 
    472941     525462    1501705          0        102 

Processor 0
     Clock   Core cyc   Instruct    BaClrL8 
    423177     511326    1501707          0 
    409590     500609    1501704          0 
    409584     500605    1501704          0 

******************************************************************************
Behind not taken
******************************************************************************

Processor 0
     Clock   Core cyc   Instruct  BrMispred    BTBMiss 
    431751     522975    1501707       1010          0 
    409629     500657    1501704          5          0 
    409587     500604    1501704          2          0 

Processor 0
     Clock   Core cyc   Instruct   BaClrFIq   BaClrClr 
    473721     520939    1501707       1003       1003 
    450621     500692    1501704          0          0 
    450543     500604    1501704          0          0 

Processor 0
     Clock   Core cyc   Instruct   BaClrBad   BaClrEly 
    431457     521737    1501707          0        132 
    434079     502694    1501704          0        107 
    450183     502514    1501704          0        104 

Processor 0
     Clock   Core cyc   Instruct    BaClrL8 
    474192     521855    1501707          0 
    461493     512776    1501704          0 
    455712     506349    1501704          0 

******************************************************************************
Ahead taken
******************************************************************************

Processor 0
     Clock   Core cyc   Instruct  BrMispred    BTBMiss 
    487568     455047    1301707       1004          0 
    459464     433940    1301704          2          0 
    459448     433920    1301704          1          0 

Processor 0
     Clock   Core cyc   Instruct   BaClrFIq   BaClrClr 
    413907     455855    1301707         15         13 
    390546     433940    1301704          0          0 
    390528     433920    1301704          0          0 

Processor 0
     Clock   Core cyc   Instruct   BaClrBad   BaClrEly 
    462378     456857    1301708          0      99699 
    398835     443136    1301704          0     100425 
    372933     435816    1301704          0     100138 

Processor 0
     Clock   Core cyc   Instruct    BaClrL8 
    402693     457062    1301707          0 
    355026     433941    1301704          0 
    355011     433924    1301704          0 

******************************************************************************
Behind taken
******************************************************************************

Processor 0
     Clock   Core cyc   Instruct  BrMispred    BTBMiss 
    275124     332695     900507          3          0 
    245727     300350     900504          2          0 
    245706     300321     900504          1          0 

Processor 0
     Clock   Core cyc   Instruct   BaClrFIq   BaClrClr 
    347804     324822     900507       1003       1005 
    318016     300350     900504          0          0 
    317988     300322     900504          0          0 

Processor 0
     Clock   Core cyc   Instruct   BaClrBad   BaClrEly 
    269334     324680     900507          0     103458 
    245739     300351     900504          0     100332 
    245718     300323     900504          0     100330 

Processor 0
     Clock   Core cyc   Instruct    BaClrL8 
    285048     332992     900507          0 
    354741     302930     900505          0 
    245706     300329     900504          0 

All the funny-named counters are contractions of the counter names as mentioned in Section 3B of the Intel manual (pp370ish). The manual is deliberately vague on what they all mean. Also, on this CPU nothing I can do seems to get correct "Branches Taken": it seems any counters relating to branches are whether they're taken or not. I single-stepped the code to ensure things are going the right way; and the CPU cycle counts seem to bear out the branches being taken or not, so that's a mystery.

Similarly mysterious is the fact that any of my attempts to "reset" the state of the BPU with a ton of (somewhat) unpredictable branches failed. My general technique was to have a giant list of conditional jumps based off a random rotating counter: this I thought would flood the BTB and the predictors...but nothing seems to faze them. Ideas on how to achieve this welcomed. Luckily the first run (of the 3 on each row above) still seems to give consistent results.

The "branch taken backwards" is understandably a little different to the other test but is similar enough I think to give representative results.

So; to interpretation. Despite what I've seen before it seems that there *is* static prediction going on: the counters for "mispredict" read 1000 for "ahead taken" and "behind not taken"; and there are zero mispredicts for the "ahead not taken" and "backwards taken" which is totally consistent with static prediction for backwards-is-taken.

I'll play around with some other architectures and see if I can find any evidence for static prediction elsewhere. I've been linked http://www.ece.uah.edu/~milenka/docs/VladimirUzelac.thesis.pdf and that suggests the Pentium M at least doesn't statically predict (section 8.10), though it's not clear to me quite how that conclusion was drawn.

Further thoughts welcomed: however if this is considered off-topic for this list (or too verbose) I'll happily take it off-list with anyone interested.

Thanks, Matt


Matt D.

unread,
Feb 3, 2016, 8:10:47 AM2/3/16
to mechanical-sympathy
On Monday, February 1, 2016 at 12:38:58 AM UTC+1, Matt Godbolt wrote:
Ok; I've spent the weekend hacking about and trying to make sense of the results, and I'm stumped. I hope we can work something out here. The code's at https://github.com/mattgodbolt/agner and so far I've gotten stuff working on my laptop only (the M520 Arrendale). Here's what I'm seeing:

******************************************************************************
[snipped]
******************************************************************************

Similarly mysterious is the fact that any of my attempts to "reset" the state of the BPU with a ton of (somewhat) unpredictable branches failed. My general technique was to have a giant list of conditional jumps based off a random rotating counter: this I thought would flood the BTB and the predictors...but nothing seems to faze them. Ideas on how to achieve this welcomed. Luckily the first run (of the 3 on each row above) still seems to give consistent results.

The "branch taken backwards" is understandably a little different to the other test but is similar enough I think to give representative results.

So; to interpretation. Despite what I've seen before it seems that there *is* static prediction going on: the counters for "mispredict" read 1000 for "ahead taken" and "behind not taken"; and there are zero mispredicts for the "ahead not taken" and "backwards taken" which is totally consistent with static prediction for backwards-is-taken.

I'll play around with some other architectures and see if I can find any evidence for static prediction elsewhere. I've been linked http://www.ece.uah.edu/~milenka/docs/VladimirUzelac.thesis.pdf and that suggests the Pentium M at least doesn't statically predict (section 8.10), though it's not clear to me quite how that conclusion was drawn.

Further thoughts welcomed: however if this is considered off-topic for this list (or too verbose) I'll happily take it off-list with anyone interested.

Hi Matt (G.),

To add some data points -- perhaps useful:

1. Historical / documented behaviors.
2. BP research.
3. Architectural simulators sources.
4. Patents (focus on Intel; perhaps AMD would also make sense due to their cross-licensing agreement).

1. Historical known behaviors (from Shen & Lipasti "Modern Processor Design: Fundamentals of Superscalar Processors"):
- Single-Direction Prediction: "Older pipelined processors, such as the Intel i486, used the always-not-taken prediction algorithm."
- Backwards Taken/Forwards Not-Taken (BTFNT): "the Intel Pentium 4 processor uses the BTFNT approach as a backup strategy when its dynamic predictor is unable to provide a prediction"

This has been officially documented:
"If a branch is not found in the BTB, the branch prediction hardware statically predicts the outcome of the branch based on the direction of the branch displacement (forward or backward). Backward branches are assumed to be taken and forward branches are assumed to not be taken."
Source: G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Roussel, “The Microarchitecture of the Pentium 4 Processor,” Intel Technology Journal, Issue 1, 2001, article 2.

Unfortunately, the later article on Pentium M contains significantly less detail (while stating it's based on that of P4--and still introducing the loop detector and indirect branch predictor): S. Gochman, R. Ronen, I. Anati, A. Berkovits, T. Kurts, A. Naveh, A. Saeed, Z. Sperber, and R. C. Valentine. The Intel Pentium M processor: Microarchitecture and performance. Intel Technology Journal, 7(2), May 2003.

2. BP research -- starting with the articles related to the following:


> Similarly mysterious is the fact that any of my attempts to "reset" the state of the BPU with a ton of (somewhat) unpredictable branches failed. My general technique was to have a giant list of conditional jumps based off a random rotating counter: this I thought would flood the BTB and the predictors...but nothing seems to faze them. Ideas on how to achieve this welcomed. Luckily the first run (of the 3 on each row above) still seems to give consistent results.

Forcing BP state eviction may not necessarily be trivial -- there's a line of microachitectural attacks, Branch Prediction Attacks (BPA) and Simple Branch Prediction Attacks (SBPA), which rely on the forced eviction (FWIW, the papers seem to focus on Pentium 4):
- "Predicting Secret Keys via Branch Prediction": https://eprint.iacr.org/2006/288.pdf
- "On the Power of Simple Branch Prediction Analysis": https://eprint.iacr.org/2006/351.pdf
- "Microarchitectural Attacks and Countermeasures": http://192.35.222.224/newweb/~koc/cren/docs/w05/ch18.pdf

The authors have found this to be somewhat non-trivial (from "On the Power of Simple Branch Prediction Analysis"):
"The most interesting fact that we can draw from this doubled BTB is the following. Executing a certain sequence of branches in the spy process which evicts just the Front-End BTB might not necessarily suffice to completely enforce the CPU not to find the target address of the target branch in some of the BTB’s. A certain hidden interaction between Front-End BTB and Trace-Cache BTB might allow for some “short-term” victim address evictions, but still store the target branch in one of the BTB’s."

Curiously, the aforementioned papers are also subject of later Intel BP patents:
- "Mitigating branch prediction and other timing based side channel attacks": http://www.google.com/patents/US8869294
- "Protecting a Branch Instruction from Side Channel Vulnerabilities": https://www.google.com/patents/US20090089564

On a side note, the author suggest the NT default on BTB miss; from Slides for "Predicting Secret Keys via Branch Prediction",
http://cs.ucsb.edu/~koc/cren/docs/w05/maa.pdf
- Adversary can clear the BTB via the operations of the dummy process and causes a BTB miss during the execution of the target branch.
- BPU automatically predicts the branch not to be taken if it misses the target address in the BTB.
- There will be a misprediction whenever the actual outcome of the target branch is `taken‘.
- [...] The adversary starts the spy process before the cipher, so when the cipher starts encryption/signing, the CPU cannot find the target address of the target branch in BTB and the prediction must be not-taken

Interestingly, "Branch Prediction and the Performance of Interpreters - Don't Trust Folklore" (an interesting read on its own -- given the remarks on the changing meaning of branch-friendly optimization techniques) seems to tie Haswell's predictor performance to that of ITTAGE:
https://hal.inria.fr/hal-01100647

"This accuracy on Haswell, the most recent Intel processor generation, has reached a level where it can not be considered as an obstacle for performance anymore. We have also shown that this accuracy is on par with the one of the literature state-of-the-art ITTAGE.
While the structure of the Haswell indirect jump predictor is undisclosed, we were able to confirm with simulations of ITTAGE that the few cases where the prediction accuracy is relatively poor are due to footprint issues on the predictor and not inherent to the prediction scheme."

The conditional branch predictor TAGE (TAgged GEometric history length) and ITTAGE (Indirect Target TAgged GEometric length) predictor, were introduced in http://www.jilp.org/vol8/v8paper1.pdf
It may be worth noting that the above is an improvement over Optimized GEometric History Length (O-GEHL), research on which has been supported by Intel: https://www.irisa.fr/caps/people/seznec/ISCA05.pdf

Perhaps it's also worth mentioning that another article co-authored by André Seznec, "A Comprehensive Study of Dynamic Global History Branch Prediction", presents a probabilistic argument that the initialization "strength" (of (non)-takenness) should not affect the prediction quality -- Section "Initial behavior of the two bit counter":
https://hal.inria.fr/inria-00072400

Last but not least, for completeness, more articles like Vladimir Uzelac's thesis, using microbenchmarks to understand BP -- seem to focus on up to Pentium M:
http://lacasa.uah.edu/portal/index.php/research/22-architectural-analysis

3. Simulators.

For what it's worth, the Sniper multi-core simulator seems to agree with Uzelac on BiModal being that last fallback; sources obtainable from http://snipersim.org/w/The_Sniper_Multi-Core_Simulator

In particular, in "sniper-6.1/common/performance_model/branch_predictors/pentium_m_branch_predictor.cc", we have:

bool PentiumMBranchPredictor::predict(IntPtr ip, IntPtr target)
{
   // ... snipped ...
   // Outcome prediction logic
   bool result;
   if (global_pred_out.hit)
   {
      result = global_pred_out.prediction;
   }
   else if (lpb_out.hit & btb_out.hit)
   {
      result = lpb_out.prediction;
   }
   else
   {
      result = bimodal_out;
   }
   // ... snipped ...
}
 
For reference:
- ZSim: https://github.com/s5z/zsim/blob/342bacac724719bd4f9251d0e5148bf0a486b6b8/src/ooo_core.h#L49
- MARSS-x86: https://github.com/avadhpatel/marss/blob/5b144b46cec2c01939f3f2fbf13ecd4d719bc11a/ptlsim/core/branchpred.cpp

4. Patents.

Focus on: https://patents.google.com/?q=%22branch+predictor%22&assignee=Intel+Corporation

One general caveat:
"The patents at the core of our descriptions above were all issued between 1996 and 1999, which raises the concern of obsolescence. As Intel would not be able to file new patents for the same specifications, we cannot present newer patents with the information above. Fortunately, we were able to find newer patents that mention the techniques described above, proving their relevance to newer CPU models."
From "Intel SGX Explained" https://eprint.iacr.org/2016/086 (an interesting article on Intel's Software Guard Extensions)

Admittedly got the idea from http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html#4.6 (a pretty interesting analysis of AMD's Opteron; it's unfortunate that the website doesn't seem maintained anymore) which references AMD's patent "Dynamic classification of conditional branches in global history branch prediction", http://www.google.com/patents/US6502188
"When  a conditional  branch  is  initially  detected,  it  is  classified  as local  and  predicted  not  taken.  If  the  branch  is  then  actually taken,  its  prediction  is  changed  to  taken.  If  the  branch  is  then actually  not  taken,  its  classification  is  changed  to  global, uses  global  branch  prediction  and  participates  in  global history  counter  training.  Advantageously,  branches  which exhibit static behavior may not participate in  global history counter  training.  Instead,  branches  which  are  not  taken  may remain  classified  as  local  and  not  taken.  Branches  Which  are taken  may  remain  classified  as  local  and  taken."

Another article on chip-architect.com points to an Intel patent "Branch predictor using multiple prediction heuristics and a heuristic identifier in the branch instruction":
https://www.google.com/patents/US5687360
Source: http://www.chip-architect.com/news/2000_06_01_Branch_Prediction.html
"Otherwise,  by  default,  Not  Taken  is predicted  for  forward  branches,  and  Taken  is  predicted  for backward  branches."

The search may be further narrowed down (but will miss some of the later mentioned results):
https://www.google.com/search?hl=en&tbm=pts&q=inassignee%3A%22Intel+Corporation%22+%E2%80%9CA+Study+of+Branch+Prediction+Strategies%22
Relying on the fact that “A Study of Branch Prediction Strategies" by James E. Smith (1981) is relevant to BTFN, as "Strategy 3" from that paper being exactly Backward-Taken Forward-Not-taken (BTFN).

Yields the aforementioned "Branch predictor using multiple prediction heuristics and a heuristic identifier in the branch instruction" https://www.google.com/patents/US5687360
- 1995 patent, "branch bias" bit seems like a (deprecated) static BP method.

However, there are also other potentially interesting patents (several of Intel) in the "Referenced by" section.
For instance, "Branch prediction architecture", https://www.google.com/patents/US6332189
Which, in turn, also references the following:
- "Apparatus and method for branch prediction utilizing a predictor combination in parallel with a global predictor": http://www.google.com/patents/US7219217
- "Method and system for branch target prediction using path information": https://www.google.com/patents/US6601161

On a side note, some patents seems interesting in terms of illuminating the advanced BP capabilities:
- "Branch Predictor with Jump Ahead Logic to Jump Over Portions of Program Code Lacking Branches": http://www.google.com/patents/US20120311308
- "TLB correlated branch predictor and method for use thereof": https://www.google.com/patents/US20060015706

Let me know if there's any experiment I can help with: I can try running the test you mention on Core i7-3720QM (Ivy Bridge; although I've noticed you've already tested on E5-2667 v2) -- provided I can compile the test(s) on either Arch or Win :-)

Best,

Matt D.

Matt Godbolt

unread,
Feb 3, 2016, 9:59:05 AM2/3/16
to mechanica...@googlegroups.com
Thanks Matt D!

There's a lot to digest here, and thanks to your tweets I've already started looking at some of the references. I'm lucky enough to already have the Shen and Lipasti book, so I'm aware of most of the stuff in there.

At the moment I'm pretty sure (c.f. https://docs.google.com/spreadsheets/d/1GuyiHF3NCxjzI3Ag53U-Z6Ucau4hjrEb6Bp0hZb_kBU/edit?usp=sharing ) that the Arrendale has "traditional" static prediction, and the Ivy/Haswell don't (Though they may do something else).

I've actually become totally distracted in working out the BTB layout: I am in the process of documenting the Arrendale one as a starting point. The fascination came from my attempts to make each run (see the code) as reproducible as possible. As of this morning I've cracked enough of the Arrendale BTB to reliable be able to "clear" the predictor so that every run is the same. (code is in this change: https://github.com/mattgodbolt/agner/commit/e8259f68c5121225fec84527bae49376712f1e6e ) There are still a number of interesting things to work out: I'll put together a more comprehensive post on this tonight, time permitting.

Also having arrived at work it's clear the same technique is ineffective for Haswell: lots more work is needed here too.

Thanks for the post - I look forward to gradually cracking some of the mysteries of this elusive subject with your (and others') help!

-matt

Matt D.

unread,
Feb 3, 2016, 12:44:08 PM2/3/16
to mechanical-sympathy
On Wednesday, February 3, 2016 at 3:59:05 PM UTC+1, Matt Godbolt wrote:
As of this morning I've cracked enough of the Arrendale BTB to reliable be able to "clear" the predictor so that every run is the same. (code is in this change: https://github.com/mattgodbolt/agner/commit/e8259f68c5121225fec84527bae49376712f1e6e ) 
Wow, this sounds great!
 
There are still a number of interesting things to work out: I'll put together a more comprehensive post on this tonight, time permitting.
Fantastic, looking forward to it!
 
Also having arrived at work it's clear the same technique is ineffective for Haswell: lots more work is needed here too.
Out of curiosity, I wonder, could the subtleties similar to the ones mentioned in the BPA/SBPA papers play a role here?
Admittedly, it's unclear to me whether this is applicable here: The papers mentioned a "hidden interaction between Front-End BTB and Trace-Cache BTB" [when attempting to force address eviction from the BTB], with trace cache being specific to P4. What's unclear to me specifically is whether the new micro-op cache in the following microarchitectures -- which comes closest in my mind to constituting a "replacement" -- can exhibit anything like this, too?

Best,

Matt D.

ymo

unread,
Feb 4, 2016, 12:54:18 PM2/4/16
to mechanical-sympathy
Matt, this is brilliant code ... thank you for sharing this !

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

ymo

unread,
Feb 4, 2016, 2:05:15 PM2/4/16
to mechanical-sympathy
Hi Matt.

Intel PMC only requires the perf monitoring code to run as a root user. Any ideas why the code from agner/yours wants to go through a kernel loaded driver instead ? It becomes a drag when i want to interface from java for example (via ioctl).

Regards


On Wednesday, January 27, 2016 at 10:46:18 PM UTC-5, Matt Godbolt wrote:
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Matt D.

unread,
Feb 4, 2016, 4:28:02 PM2/4/16
to mechanical-sympathy
Just one note regarding flushing the branch histories and Haswell:


On Wednesday, February 3, 2016 at 3:59:05 PM UTC+1, Matt Godbolt wrote:
I've actually become totally distracted in working out the BTB layout: I am in the process of documenting the Arrendale one as a starting point. The fascination came from my attempts to make each run (see the code) as reproducible as possible. As of this morning I've cracked enough of the Arrendale BTB to reliable be able to "clear" the predictor so that every run is the same. [...] Also having arrived at work it's clear the same technique is ineffective for Haswell: lots more work is needed here too.

I've just ran across the following: "Understanding and Mitigating Covert Channels Through Branch Predictors"
http://www.cs.ucr.edu/~nael/pubs/taco16_branches.pdf
// to be published in ACM Transactions on Architecture and Code Optimization

The reason it seems potentially interesting is that as a mitigation of the channel under study, they "propose a software-only solution, which flushes the branch predictor (or randomizes its state) on context switches." They also test it on an Intel Core i7-4800MQ (Haswell) CPU.

They seem to be using a similar strategy to the one in your code (BP flushing using repeatedly taken branches), but use a rather large number of branch instructions (going into the hundreds of thousands). Perhaps you've already tried this as well and it may be just barking up the wrong tree, but thought I'd share just in case this can be useful :-)

They also do what they describe "priming the predictor" beforehand (putting the BP into a strong state), but perhaps this may be only necessary for their overall measuring experiment, which is to find the number of necessary branches needed.

Sec. 8.2, "Optimal Number of Branches in Randomization Code" goes into more detail.

The goal: "a large block of branch-intensive code is executed to randomize the branch predictor state. As a result, the newly scheduled process starts execution with a clean predictor state."

Here's the overall strategy:
"To determine the optimal number of branch instructions required to randomize the branch predictor in a secure manner, we conducted the following experiment. First, we executed several times the code consisting of one million branch instructions, thus placing the branch predictor into one of the strong states. We call this phase the priming of the predictor. Then, the flushing code block was executed once. We varied the number of instructions in the flushing block. Finally, we executed one thousand branch instructions with the same outcome as branches in the priming phase and measured the number of mispredictions in this block. We refer to this phase as probing. The number of branch mispredictions encountered in executing the probing code indicates how well the flushing phase resets the predictor."

Findings:
- "As expected, the small blocks of flushing code are not sufficient to reset the predictor and the misprediction rate is very low. As the flush code block increases in size, the misprediction rate of the probing phase also increases. The growth stops at about 50%, indicating that the branch predictor tables are reset. [...] we conservatively selected 300,000 branch instructions in the flush block for further experiments and analysis."
- "The results presented above show that executing an even mix of taken and not-taken branches with random pattern is the most stable way to reset the predictor."

On a side note, this is also pretty interesting: "Surprisingly, not-taken branches have a much smaller impact on the branch predictor state." Indeed, looking at Fig. 15 (Branch Predictor Flushing Using Taken Branches) and Fig. 16 (Branch Predictor Flushing Using Not-taken Branches), flushing using not-taken branches would seem to be completely ineffective.

Best,

Matt D.

Matt Godbolt

unread,
Feb 5, 2016, 9:58:27 AM2/5/16
to mechanical-sympathy
Hi Matt,

A quick reply: thanks again for the link! My reading list is significantly backed up now :)

For now I'm going to concentrate on honing my technique and scripts on the Arrendale - mainly as it's a computer I have available even when I'm commuting (which is where I get most of the time for this!).  Once I'm comfortable explaining branch prediction there, I'll move on to the Sandy, Ivy and Haswell: each of which has significant BP improvements.

I'm also first starting on interpreting the BTB size - under the assumption that there's no taken/not-taken information stored there; and that a miss in the BTB always causes one of the "BP failure" counters to tick up. At least on Arrendale/Westmere, there's a bunch of counters for this; early ones where the decoder spots a branch the BTB failed to notice, for example. I have a ton of notes and thoughts; I'll probably write up something over the next few days and post it here (or a link to a blog post or similar).

Thanks, Matt (G) :)

Matt D.

unread,
Feb 7, 2016, 3:11:18 PM2/7/16
to mechanical-sympathy
Hi again!

Fair enough on the reading list -- I know the feeling (as well as that of the hardware availability constraints)! ;-)

Nehalem/Westmere has an extra twist, with BTB being two-level (with the decision later reversed* in Sandy Bridge), fun! :-)

* -- at least according to David Kanter (in Fog's microarch. manual BTB levels count in Sandy Bridge stated "unknown"):
http://www.realworldtech.com/nehalem/4/
http://www.realworldtech.com/sandy-bridge/3/

Let's say the following is mostly of a note-to-myself kind of thing (as well as to anyone interested) -- one more BP RE study (they do build on Uzelac et al. -- with an extra simulation step, so mostly including just for completeness -- perhaps it may be convenient to have references gathered in one place):

Shah Mohammad Faizur Rahman, Zhe Wang, and Daniel A. Jiménez, "Studying Microarchitectural Structures with Object Code Reordering", Proceedings of the 2009 Workshop on Binary Instrumentation and Applications (WBIA), December, 2009.

The authors "use the reordered object code to validate a reverse-engineered model for the Intel Core 2 branch predictor."

Finding: "The branch predictor of the Intel Xeon E5440 is not documented, but through reverse-engineering experiments we have determined that it is likely to contain a hybrid of a GAs-style branch predictor and a bimodal branch predictor."

// To be precise, Xeon E5440 is Harpertown (Penryn microarchitecture).

For completeness (I'm aware that Shen & Lipasti cover this), GAs refers to Global Adaptive Branch Prediction using per-set pattern history tables, using terminology introduced by T. Yeh and Y. Patt, "A Comparison of Dynamic Branch Predictors That Use Two Levels of Branch History," Proc. 20th Annual Intl. Symp. On Computer Architecture, May 1993, pp. 257-266.

Method: "Uzelac et al. introduced reverse engineering techniques using some microbenchmarks to expose the organization of Pentium M Branch Predictor. We try to use similar techniques to get an idea about the organization of Intel Core 2 branch predictor. Those microbenchmarks provide a rough outline about the organization of the branch predictor but there is no certain way to tell whether our assumption is correct. We use the reordered object code to validate our hypothetical branch predictor organization. We first reverse engineered the Intel Core 2 branch predictor to find out its organization. Then we use the reordered object code to validate the predictor and reveal different attributes and characteristics of it. We simulate several branch predictors using Pin and try to find out the hypothetical branch predictor which has the highest co-relation with the original one."

Best,

Matt
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Matt D.

unread,
Feb 8, 2016, 6:39:28 AM2/8/16
to mechanical-sympathy
Attaching the links to the PDFs (also found a 2011 version):

On Sunday, February 7, 2016 at 9:11:18 PM UTC+1, Matt D. wrote:
Shah Mohammad Faizur Rahman, Zhe Wang, and Daniel A. Jiménez, "Studying Microarchitectural Structures with Object Code Reordering", Proceedings of the 2009 Workshop on Binary Instrumentation and Applications (WBIA), December, 2009.


WBIA 2009:

"Studying Microarchitectural Structures with Object Code Reordering"
http://www.cs.virginia.edu/kim/publicity/pin/wbia09.pdf
pp. 7-16

IEEE International Symposium on Workload Characterization (IISWC) 2011:
"Program Interferometry" by Zhe Wang and Daniel A. Jiménez
http://taco.cse.tamu.edu/pdfs/interferometry_dist.pdf

Best,

Matt (D.) :-)

Reply all
Reply to author
Forward
0 new messages