It's also not clear whether it's the AVX-2/AVX-512 instructions that trigger the slowdown, or the 256/516 bit datapaths, or using the 256/512 bit FPUs, or something different. I find it hard to believe that, for example, 128-bit VPMASKMOVD causes a slowdown, but it's an AVX2 instruction. Similarly AVX-512 instructions can work with 128-bit and 256-bit lengths.
I'm also disappointed there's no documentation of AVX-3 through
AVX-511.
As many of you may know, the behavior of CPU speed on various Intel processor varies with workload. This often makes questions like "which chip is fastest for my workload" and "does AVX-512 help or hurt performance" not a clear thing to answer. It can also make measuring behavior on one configuration (e.g. AWS c5 or m5 instances or equivalent Azure or GCE instances, all of which are easy to get your hands on) and projecting from that to what other setups will see (e.g. a Xeon Gold 6146 or Xeon Silver 4116) mind-boggling-ly hard...
After digging around quite a bit to try and find data on how frequencies react to instruction mixes and active thread counts on various Skylake models, I found the following VERY useful document, which I figured others on this list may find useful:
Pages 13-14 have some very useful data for higher end models, followed by data for lots of other models on following pages.
An example interesting conclusion is that while the higher frequency parts like the Xeon Gold 6146 (165W TDP 3.2GHz base with 4.2GHz Max Turbo boost) may be the fastest when no AVX/AVX2/AVX-512 instructions are involved, the highest end Platinum 8180 (with its 205W TDP, 2.5GHz base, 3.8Ghz Max Turbo Boost) is actually same-or-faster across the board when AVX/AVX2/AVX512 instructions are involved. And it's larger L3 caches won't hurt to have around either.
For people asking the "which processor is the fastest I can get"? question, this can change the answer quite a bit, since at least some form of AVX instructions is often/typically interleaved in most workloads (they are built into most optimized memcpy implementations, in java object allocation, etc.). Of course, the price point is a bit different too (e.g. https://ark.intel.com/compare/120481,120508,120496,124942), but if speed is what you really care about...
As usual, YMMV and you should measure this for yourself, but the charts in this document can serve as good initial guesswork.
Here are some extracted charts for the higher end processors. The link above has details for many more models in more charts.
-- Gil.
P.S.: Note that these are the documented highest frequencies for the specified no-AVX/AVX2/AVX512, but it is not clear (from this or other docs I've found so far) whether e.g. a workload that executes a mix with a small amount (like 1-2% of instructions) of AVX, AVX2, or AVX512 instruction interleaved into otherwise non-AVX execution (as would be the case due to object allocation on most JVMs) will be clocked at the AVX2/AVX512 clock rates, or at some effective middleground between the AVX2/AVX512 and the non-AVX clock rates. I've run into speculation in either direction on this subject, but have not seen actual data. If someone else on the list confidently (based on actual data or documentation) knows the answer, please chime in...
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
After digging around quite a bit to try and find data on how frequencies react to instruction mixes and active thread counts on various Skylake models, I found the following VERY handy document, which I figured others on this list may find useful:
I helped write this article: https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/ which answers some of the other questions such as if a single instruction causes a frequency shift.