--
To unsubscribe from this group, send email to lwc-forum+...@list.nist.gov
Visit this group at https://groups.google.com/a/list.nist.gov/d/forum/lwc-forum
---
To unsubscribe from this group and stop receiving emails from it, send an email to lwc-forum+...@list.nist.gov.
Hello Anjan and everyone,
I have done
some more programming of Grain utilising AVX512 and managed to improve my own fastest
previous code by +50%, now it runs with the speed 2.3 Gbps. While this is a
good exercise, we should bear in mind that Grain is a hardware-oriented cipher
and thus I did not have big expectations for the software side, but still there
is a possibility to do a number of interesting tricks and the final performance looks decent.
I was more focusing on improving the keystream (preoutput) function but also made some minor improvements to other parts -- thanks a lot for pointing me to the _pext_u64() instruction, I did not know about that. I also removed GF2 instructions by instead using vpshufbitqmb for 64 bits reversal, so there is now no dependency on GF2-NI.
In the attached zip you will find the new code and 5 implementation attempts of the keystream (preoutput) generation. Current settings in the sources are set to the fastest code (which is version 5). The updated performance table is as follows:
A short description of what I have tried in 5 versions:
V1. Here was my very first attempt to use full 512-bit registers with a better alignment of the 64-bit input arguments. I heavily used the instruction vpshrdvq which makes it possible to get 8x64-bit offsets from two zmm registers that represent the high and low 64-bit parts of 128-bit lanes, all with just a single call. Then I also used parallel calls for ternary logic and kmasks – the latter is the feature of AVX512 and kmasks have their own registers bank and I tested my own expansion of the kmasks as well as I tried to let the compiler to expand these k-registers. The speed achieved was around 1800Mbps.
V2. In v1 I realised that the biggest problem/most of instructions are spent for aligning of input arguments into 512-bit registers while the logic part is short, so I tried to make a program where I do alignment in x64 style but this did not help to speedup, the speed drops to only 638Mbps. The lesson learnt is that loading/storing between RAM and registers is quite costly.
V3. This is a refactored attempt of v1 where read/write from RAM is optimized in a better way. I also started to use the instruction vpermq for a quicker data aligning in registers, although it has latency 3 but the throughput is 1 – should still be ok. However, no significant speedup vs v1.
V4. This was my last attempt to generate 64 bits of the preoutput with all my knowledge gathered in previous attempts, with the focus on reducing the number of k-masks, which effectively saved some clocks but still no visible improvements compared to v1 and v3.
V5. In all previous versions I was computing 64 bits of the preoutput, thus I had to keep wide 64-bit values so that the computation of the upper 32 bits is shorter than the computation of the lower half. In this very last attempt I just wanted to see what would be my best try if we only focus on generation of 32 bits of the preoutput, and we get 64 bits by simply repeating the function twice. I have used instructions stitching and interleaving techniques, leveraging on instructions’ throughputs rather than the latency, and utilised only 4 k-masks that overall gave me an impressive speedup of 2377Mbps – much better than when I was trying to make 64 bits in one go. I provided more comments and details in the v5-code itself.
Best
regards,
/Alexander