Recent Observation on Performance of TinyJambu AEAD

45 views
Skip to first unread message

Anjan Roy

unread,
Jan 4, 2023, 1:28:21 PM1/4/23
to lwc-forum
Hi all,

Happy New Year. I hope you all are doing good.

During 2022, I decided to implement all NIST LWC finalists as zero-dependency, header-only, C++ libraries and few months ago I had informed the community about it. More here.

Recently I was revisiting my work on implementation of TinyJambu AEAD and came across some interesting results which I'm here to share with you all. During following benchmark, I was targeting Intel Xeon Platinum 8375C CPU @ 2.90GHz, while using google-benchmark as benchmark harness.

Because the library implementation is in C++, due to presence of template functions, I can use std::is_constant_evaluated() for compile-time branch evaluation. And when that's employed, it boosts bytes processing bandwidth of encrypt/ decrypt routines ~(17-25)x. More on compile-time branch evaluation here.

Another observation I wanted to share with you, using std::memcpy when processing associated data/ plain text/ cipher text bytes ( following little-endian convention as specified in TinyJambu specification ) on x86_64 target CPU ( little-endian ) s.t. code is compiled with GCC ( with -O3 -march=native -mtune=native flags ) brings another ~2x performance boost. To be more specific, when benchmarking TinyJambu encrypt/ decrypt routines with 4096 -bytes of plain text and 32 -bytes of associated data, on aforementioned Intel CPU, compiling with GCC, byte processing bandwidth can go as high as ~4GB/sec --- which is impressive. Without using std::memcpy, if one decides to use unrolled/ auto-vectorized loops for processing little-endian bytes, as 32 -bit words, on same target CPU, byte processing bandwidth goes ~2GB/sec. I also notice that when targeting same CPU but compiling with Clang, it doesn't generate very good code, which is pretty much clear from the results. I notice, using std::memcpy resulting in performance boost, doesn't really work on aarch64 CPU targets. You can find more about these benchmark results here.

My implementation of TinyJambu AEAD lives @ https://github.com/itzmeanjan/tinyjambu/tree/2383715. Both of above observations are implementation details, but I just wanted to draw your attention to it and get some feedback.

Thanks

Anjan Roy

h...@arnepadmos.com

unread,
Jan 5, 2023, 3:22:50 PM1/5/23
to Anjan Roy, lwc-forum
Dear Anjan,

Thank you for sharing these results.

I'm not sure whether this is relevant to you (it probably depends on the
different vector instructions supported by Intel Xeon), but I shared the
following optimisation with Rhys Weatherley two years ago:

On platforms where there is no NAND instruction but only AND and INVERT
instructions (such as AVR and RISC-V) you could rewrite the round
function to use an AND instead of a NAND, and to use an inverted version
for the key material. You could do this inversion at the start of the
function call, or save the inverted version and work with that. This
works because the state update function "s0^s47^(∼(s70&s85))^s91^ki" can
also be written as "s0^s47^(s70&s85)^s91^~ki".

In the ARM Cortex M3 assembly code versions of TinyJAMBU, this change
led to an implementation 'about 5% faster on average, and sometimes up
to 10% faster for some variants and packet sizes'. Might be worth
exploring for Intel Xeon as well (and/or any of the other processor
families that you have been benchmarking).

Regards,
Arne

On 2023-01-04 19:28, Anjan Roy wrote:
> Hi all,
>
> Happy New Year. I hope you all are doing good.
>
> During 2022, I decided to implement all NIST LWC finalists as
> zero-dependency, header-only, C++ libraries and few months ago I had
> informed the community about it. More here
> <https://groups.google.com/a/list.nist.gov/g/lwc-forum/c/abb6cy7jP8s/m/E6-_Kzs6AQAJ>.
>
>
> Recently I was revisiting my work on implementation of TinyJambu AEAD
> and
> came across some interesting results which I'm here to share with you
> all.
> During following benchmark, I was targeting Intel Xeon Platinum 8375C
> CPU @
> 2.90GHz, while using google-benchmark as benchmark harness.
>
> Because the library implementation is in C++, due to presence of
> template
> functions, I can use std::is_constant_evaluated() for compile-time
> branch
> evaluation. And when that's employed, it boosts bytes processing
> bandwidth
> of encrypt/ decrypt routines ~(17-25)x. More on compile-time branch
> evaluation here
> <https://en.cppreference.com/w/cpp/types/is_constant_evaluated>.
> <https://github.com/itzmeanjan/tinyjambu/blob/238371569855c81d47d28e99397910a84e603589/bench/README.md>
> .
Reply all
Reply to author
Forward
0 new messages