Hi Matt (G.),
To add some data points -- perhaps useful:
1. Historical / documented behaviors.
2. BP research.
3. Architectural simulators sources.
4. Patents (focus on Intel; perhaps AMD would also make sense due to their cross-licensing agreement).
1. Historical known behaviors (from Shen & Lipasti "Modern Processor Design: Fundamentals of Superscalar Processors"):
- Single-Direction Prediction: "Older pipelined processors, such as the Intel i486, used the always-not-taken prediction algorithm."
- Backwards Taken/Forwards Not-Taken (BTFNT): "the Intel Pentium 4 processor uses the BTFNT approach as a backup strategy when its dynamic predictor is unable to provide a prediction"
This has been officially documented:
"If a branch is not found in the BTB, the branch prediction hardware statically predicts the outcome of the branch based on the direction of the branch displacement (forward or backward). Backward branches are assumed to be taken and forward branches are assumed to not be taken."
Source: G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Roussel, “The Microarchitecture of the Pentium 4 Processor,” Intel Technology Journal, Issue 1, 2001, article 2.
Unfortunately, the later article on Pentium M contains significantly less detail (while stating it's based on that of P4--and still introducing the loop detector and indirect branch predictor): S. Gochman, R. Ronen, I. Anati, A. Berkovits, T. Kurts, A. Naveh, A. Saeed, Z. Sperber, and R. C. Valentine. The Intel Pentium M processor: Microarchitecture and performance. Intel Technology Journal, 7(2), May 2003.
2. BP research -- starting with the articles related to the following:
Forcing BP state eviction may not necessarily be trivial -- there's a line of microachitectural attacks, Branch Prediction Attacks (BPA) and Simple Branch Prediction Attacks (SBPA), which rely on the forced eviction (FWIW, the papers seem to focus on Pentium 4):
- "Predicting Secret Keys via Branch Prediction":
https://eprint.iacr.org/2006/288.pdf- "On the Power of Simple Branch Prediction Analysis":
https://eprint.iacr.org/2006/351.pdf- "Microarchitectural Attacks and Countermeasures":
http://192.35.222.224/newweb/~koc/cren/docs/w05/ch18.pdfThe authors have found this to be somewhat non-trivial (from "On the Power of Simple Branch Prediction Analysis"):
"The most interesting fact that we can draw from this doubled BTB is the following. Executing a certain sequence of branches in the spy process which evicts just the Front-End BTB might not necessarily suffice to completely enforce the CPU not to find the target address of the target branch in some of the BTB’s. A certain hidden interaction between Front-End BTB and Trace-Cache BTB might allow for some “short-term” victim address evictions, but still store the target branch in one of the BTB’s."
Curiously, the aforementioned papers are also subject of later Intel BP patents:
- "Mitigating branch prediction and other timing based side channel attacks":
http://www.google.com/patents/US8869294- "Protecting a Branch Instruction from Side Channel Vulnerabilities":
https://www.google.com/patents/US20090089564On a side note, the author suggest the NT default on BTB miss; from Slides for "Predicting Secret Keys via Branch Prediction",
http://cs.ucsb.edu/~koc/cren/docs/w05/maa.pdf- Adversary can clear the BTB via the operations of the dummy process and causes a BTB miss during the execution of the target branch.
- BPU automatically predicts the branch not to be taken if it misses the target address in the BTB.
- There will be a misprediction whenever the actual outcome of the target branch is `taken‘.
- [...] The adversary starts the spy process before the cipher, so when the cipher starts encryption/signing, the CPU cannot find the target address of the target branch in BTB and the prediction must be not-taken
Interestingly, "Branch Prediction and the Performance of Interpreters - Don't Trust Folklore" (an interesting read on its own -- given the remarks on the changing meaning of branch-friendly optimization techniques) seems to tie Haswell's predictor performance to that of ITTAGE:
https://hal.inria.fr/hal-01100647"This accuracy on Haswell, the most recent Intel processor generation, has reached a level where it can not be considered as an obstacle for performance anymore. We have also shown that this accuracy is on par with the one of the literature state-of-the-art ITTAGE.
While the structure of the Haswell indirect jump predictor is undisclosed, we were able to confirm with simulations of ITTAGE that the few cases where the prediction accuracy is relatively poor are due to footprint issues on the predictor and not inherent to the prediction scheme."
The conditional branch predictor TAGE (TAgged GEometric history length) and ITTAGE (Indirect Target TAgged GEometric length) predictor, were introduced in
http://www.jilp.org/vol8/v8paper1.pdfIt may be worth noting that the above is an improvement over Optimized GEometric History Length (O-GEHL), research on which has been supported by Intel:
https://www.irisa.fr/caps/people/seznec/ISCA05.pdfPerhaps it's also worth mentioning that another article co-authored by André Seznec, "A Comprehensive Study of Dynamic Global History Branch Prediction", presents a probabilistic argument that the initialization "strength" (of (non)-takenness) should not affect the prediction quality -- Section "Initial behavior of the two bit counter":
https://hal.inria.fr/inria-00072400Last but not least, for completeness, more articles like Vladimir Uzelac's thesis, using microbenchmarks to understand BP -- seem to focus on up to Pentium M:
http://lacasa.uah.edu/portal/index.php/research/22-architectural-analysis3. Simulators.
For what it's worth, the Sniper multi-core simulator seems to agree with Uzelac on BiModal being that last fallback; sources obtainable from
http://snipersim.org/w/The_Sniper_Multi-Core_SimulatorIn particular, in "sniper-6.1/common/performance_model/branch_predictors/pentium_m_branch_predictor.cc", we have:
bool PentiumMBranchPredictor::predict(IntPtr ip, IntPtr target)
{
// ... snipped ...
// Outcome prediction logic
bool result;
if (global_pred_out.hit)
{
result = global_pred_out.prediction;
}
else if (lpb_out.hit & btb_out.hit)
{
result = lpb_out.prediction;
}
else
{
result = bimodal_out;
}
// ... snipped ...
}
For reference:
- ZSim:
https://github.com/s5z/zsim/blob/342bacac724719bd4f9251d0e5148bf0a486b6b8/src/ooo_core.h#L49- MARSS-x86:
https://github.com/avadhpatel/marss/blob/5b144b46cec2c01939f3f2fbf13ecd4d719bc11a/ptlsim/core/branchpred.cpp4. Patents.
Focus on:
https://patents.google.com/?q=%22branch+predictor%22&assignee=Intel+CorporationOne general caveat:
"The patents at the core of our descriptions above were all issued between 1996 and 1999, which raises the concern of obsolescence. As Intel would not be able to file new patents for the same specifications, we cannot present newer patents with the information above. Fortunately, we were able to find newer patents that mention the techniques described above, proving their relevance to newer CPU models."
From "Intel SGX Explained"
https://eprint.iacr.org/2016/086 (an interesting article on Intel's Software Guard Extensions)
Admittedly got the idea from
http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html#4.6 (a pretty interesting analysis of AMD's Opteron; it's unfortunate that the website doesn't seem maintained anymore) which references AMD's patent "
Dynamic classification of conditional branches in global history branch prediction", http://www.google.com/patents/US6502188"When a conditional branch is initially detected, it is classified as local and predicted not taken. If the branch is then actually taken, its prediction is changed to taken. If the branch is then actually not taken, its classification is changed to global, uses global branch prediction and participates in global history counter training. Advantageously, branches which exhibit static behavior may not participate in global history counter training. Instead, branches which are not taken may remain classified as local and not taken. Branches Which are taken may remain classified as local and taken."
Another article on
chip-architect.com points to an Intel patent "Branch predictor using multiple prediction heuristics and a heuristic identifier in the branch instruction":
https://www.google.com/patents/US5687360Source:
http://www.chip-architect.com/news/2000_06_01_Branch_Prediction.html"Otherwise, by default, Not Taken is predicted for forward branches, and Taken is predicted for backward branches."
The search may be further narrowed down (but will miss some of the later mentioned results):
https://www.google.com/search?hl=en&tbm=pts&q=inassignee%3A%22Intel+Corporation%22+%E2%80%9CA+Study+of+Branch+Prediction+Strategies%22Relying on the fact that “A Study of Branch Prediction Strategies" by James E. Smith (1981) is relevant to BTFN, as "Strategy 3" from that paper being exactly Backward-Taken Forward-Not-taken (BTFN).
Yields the aforementioned "Branch predictor using multiple prediction heuristics and a heuristic identifier in the branch instruction"
https://www.google.com/patents/US5687360- 1995 patent, "branch bias" bit seems like a (deprecated) static BP method.
However, there are also other potentially interesting patents (several of Intel) in the "Referenced by" section.
For instance, "Branch prediction architecture",
https://www.google.com/patents/US6332189Which, in turn, also references the following:
- "Apparatus and method for branch prediction utilizing a predictor combination in parallel with a global predictor":
http://www.google.com/patents/US7219217- "Method and system for branch target prediction using path information":
https://www.google.com/patents/US6601161On a side note, some patents seems interesting in terms of illuminating the advanced BP capabilities:
- "Branch Predictor with Jump Ahead Logic to Jump Over Portions of Program Code Lacking Branches":
http://www.google.com/patents/US20120311308- "TLB correlated branch predictor and method for use thereof":
https://www.google.com/patents/US20060015706Let me know if there's any experiment I can help with: I can try running the test you mention on Core i7-3720QM (Ivy Bridge; although I've noticed you've already tested on E5-2667 v2) -- provided I can compile the test(s) on either Arch or Win :-)
Best,
Matt D.