--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
And there's a big disadvantage since ivy bridge. The uop cache is a very small cache for decoded instructions. If your loop doesn't fit in the uop cache it can take a considerable performance penalty. For this reason Agner Fog now recommends not unrolling loops.
The L0 (uop) cache has been around since Sandy Bridge it can hold just over 1500 uops. It was introduced for power saving and not performance. Decoding instructions can be power hungry.
I find loop unrolling often does not pay off these days. Keeping it small and simple often works best in real world apps. Unrolling sometimes wins in micro benchmarks but seldom within a larger application.
On Tuesday, January 26, 2016, Martin Thompson <mjp...@gmail.com> wrote:The L0 (uop) cache has been around since Sandy Bridge it can hold just over 1500 uops. It was introduced for power saving and not performance. Decoding instructions can be power hungry.It's both performance and power. With CISC decode is not trivial and pipeline can stall on tight compute kernels due to decode not keeping up - think loops. The frontend stalls are usually in the instruction fetch stage though due to i-cache misses but decode is still important for performance in some cases.
I find loop unrolling often does not pay off these days. Keeping it small and simple often works best in real world apps. Unrolling sometimes wins in micro benchmarks but seldom within a larger application.As mentioned, unrolling is an optimization gateway for loops like inlining is for calls.
And there's a big disadvantage since ivy bridge. The uop cache is a very small cache for decoded instructions. If your loop doesn't fit in the uop cache it can take a considerable performance penalty. For this reason Agner Fog now recommends not unrolling loops.
Yes, it will execute the loop exit condition check each time. Initially, static prediction will be used...
--
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
The Intel Core microarchitecture does not use static prediction heuristic. However, to maintain consistency across Intel 64 and IA-32 processors, software should maintain the static prediction heuristic as the default.
On 26 January 2016 at 13:42, Vitaly Davidovich <vit...@gmail.com> wrote:
On Tuesday, January 26, 2016, Martin Thompson <mjp...@gmail.com> wrote:The L0 (uop) cache has been around since Sandy Bridge it can hold just over 1500 uops. It was introduced for power saving and not performance. Decoding instructions can be power hungry.It's both performance and power. With CISC decode is not trivial and pipeline can stall on tight compute kernels due to decode not keeping up - think loops. The frontend stalls are usually in the instruction fetch stage though due to i-cache misses but decode is still important for performance in some cases.The performance side is more complicated whereas the power saving is pretty clear. A uop cache hit after a branch mispredict improves things over Nehalem, whereas a uop cache miss after mis-prediction is a couple of extra cycles penalty over Nehalem. One of the biggest wins is for AVX as the bandwidth is increased for feeding all the execution units, the decoders could not do this on their own.
In theory, we all know practice can be different, the decoding is pipelined and therefore for predictable code the performance would not be hit if you did not have a uop cache but you do save on the power for decoding.One other thing to consider is the 28 uop queue after the decoders and uop cache that is great for tight loops. These are restricted to less than 8 taken branches and no RETs ro CALLs which encourage clean simple code. Branch misprediction will end these small loops.I find loop unrolling often does not pay off these days. Keeping it small and simple often works best in real world apps. Unrolling sometimes wins in micro benchmarks but seldom within a larger application.As mentioned, unrolling is an optimization gateway for loops like inlining is for calls.Agreed. A sufficiently smart compiler would be architecture aware, if such a beast exists. The main point is the developers should be very careful of unrolling their own loops. Much better to focus on reducing data dependencies for increased ILP.
So Intel's opto guide (http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf), section 3.4.1.3 talks about static prediction. It says that branches without BTB history are predicted using static prediction: unconditional branches are taken, indirect branches are not taken. It then recommends code generation to take advantage of this (when likeliness is known/hinted). That section closes with:
The Intel Core microarchitecture does not use static prediction heuristic. However, to maintain consistency across Intel 64 and IA-32 processors, software should maintain the static prediction heuristic as the default.
Perhaps it's outdated/inaccurate though, although the revision date is Sept 2015 ...
The L0 (uop) cache has been around since Sandy Bridge it can hold just over 1500 uops. It was introduced for power saving and not performance. Decoding instructions can be power hungry.It's both performance and power. With CISC decode is not trivial and pipeline can stall on tight compute kernels due to decode not keeping up - think loops. The frontend stalls are usually in the instruction fetch stage though due to i-cache misses but decode is still important for performance in some cases.The performance side is more complicated whereas the power saving is pretty clear. A uop cache hit after a branch mispredict improves things over Nehalem, whereas a uop cache miss after mis-prediction is a couple of extra cycles penalty over Nehalem. One of the biggest wins is for AVX as the bandwidth is increased for feeding all the execution units, the decoders could not do this on their own.I actually think the other way: power savings are questionable but speed improvement for small loops can be big. If the loop stays entirely in uop cache and has a small kernel with cheap instructions, it will not get stalled by decode since it doesn't even touch the decoders.
I curious to know why you seem to hold this view. "Power savings are questionable", Really?
The high throughput for taken branches of one per clock was observed for up to 128 branches with no more than one branch per 16 bytes of code. If there is more than one branch per 16 bytes of code then the throughput is reduced to one jump per two clock cycles. If there are more than 128 branches in the critical part of the code, and if they are spaced by at least 16 bytes, then apparently the first 128 branches have the high throughput and the remaining have the low throughput. These observations may indicate that there are two branch prediction methods: a fast method tied to the µop cache and the instruction cache, and a slower method using a branch target buffer.
Interesting. Yes, back-to-back jumps like this does sound stressful on the CPU :).Agner notes the following for Haswell:The high throughput for taken branches of one per clock was observed for up to 128 branches with no more than one branch per 16 bytes of code. If there is more than one branch per 16 bytes of code then the throughput is reduced to one jump per two clock cycles. If there are more than 128 branches in the critical part of the code, and if they are spaced by at least 16 bytes, then apparently the first 128 branches have the high throughput and the remaining have the low throughput. These observations may indicate that there are two branch prediction methods: a fast method tied to the µop cache and the instruction cache, and a slower method using a branch target buffer.Now, he's talking about throughput here but it's possible that spacing the jumps has prediction implications as well. I'm still surprised that nops, which aren't truly processed by the core in the strict sense, have such a drastic effect. It does change the PC of the branches, so maybe there's some BTB contention with so many branches packed.
Hi all,Hope nobody minds, but I forked this conversation to report my latest findings.I added a straight-line list of 65536 sequences of "jump taken ahead; 16 nops; jump not taken ahead; 16 nops" during the initialisation code. This seems to have had the desired effect: the branch prediction numbers are now stable across runs.Interestingly, fewer than 65536; or fewer NOPs between causes even more instability, so it seems like there's some spooky action at a distance involved in the prediction. To me this suggests the static predictor can be "wrong" about some new branches: misidentifying them as previously-seen on some cases. However, with sufficient "coldness" of the branches, the results are as follows:
Cheers, Matt
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
I curious to know why you seem to hold this view. "Power savings are questionable", Really?I thought I somewhat touched upon that in the same response. What I'm saying is that it doesn't seem like power savings are the main driver here. While it's true that if a loop is executing fully out of uop cache and not touching the decoders, they can presumably power down if there's nothing else to decode behind the loop instructions, I don't know how often this is the case, especially since good uop cache use requires a bunch more things to align correctly. But when you *do* encounter such loops, you want them to not stall due to repeated decode, even if it's pipelined, particularly if they're feeding wide registers.So maybe the right way to phrase my sentiment is that I think power savings are secondary to perf boost for loops. Happy to have someone demonstrate/explain otherwise ...
I just wanted to make the point that power saving is driving more design decisions than performance in Intel CPUs these days.
The argument you made for performance I could see no evidence of other than as a side effect :-)
The argument you made for performance I could see no evidence of other than as a side effect :-)Fair enough. My opinion on *this particular* feature's power impact is based on my own inference. As for "side effect", Intel's own arch manual briefly mentions reduced power from this feature, and then talks a lot more about throughput/latency improvements -- doesn't really sound like a side effect :).
Yes. I've seen many side effects end up as features after delivery. The architecture guide is written after delivery. The paper written in 2001 by Intel to describe the feature had the design goal of power saving. They thought back then it would be performance neutral :-) Let's quote them directly, "the eliminated work may save about 10% of the full-chip power consumption with no performance degradation".
--
Power has only relatively recently (it seems to me -- last 6-8 years?) become a hot topic with the rise of clouds, "data center compute", "big data", and mobile. Then there are other CPU vendors (with weaker single core perf than Intel) who jumped on this, and reformulated performance as XXX/watt :).
On Tuesday, January 26, 2016, Matt Godbolt <ma...@godbolt.org> wrote:
....
This has all been on a Haswell; I'll try and find some older machines known to have static prediction to see if the same test gives different results.If uop inside the uop cache has branch history associated with it, that could muddy the waters too? Some of your instructions in the test are the same right? 32 bytes of padding may be placing the uops on a different way inside the cache, improving hit rate or decreasing contention. Just hand waving here ...
I'd try this experiment on Ivy or Sandy Bridge (Haswell apparently had significant undisclosed BPU changes) and also something pre Sandy Bridge (which won't have the uop cache), like Nehalem or Westmere.
This could also be some microarchitectural issue with Haswell BPU - would someone at Intel run a branch heavy performance test with this many branches? :)
Finally, maybe running this test under Intel Amplifier or Linux perf with more detailed BPU events can shed some light (I'd need to review the list of events supported so not sure if there's enough granularity there).
I think this is the most attention given to static prediction that I recall seeing :).
--
"Avoid putting two conditional branch instructions in a loop so that both have the same branch target address and, at the same time belong to (i..e. have their last byte's address within) the same 16-byte aligned code block." (From the 64-ia-32-architectures-optimization-manual.pdf)
--
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
You can ignore that last partial sentence (So the ~500 odd...) - I was editing my post and that was a dangling sentence from an earlier reply, but nothing was lost :)
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/UFscifOU8AQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.