Hi PQC forum, this is my first post here, so please go easy on me.
I'm a freelance cryptographic engineer and open source researcher. Recently I put some time into surveying various techniques used to optimize SLH-DSA. As a result of my findings, I produced what I believe to be
the fastest open-source CPU implementation of SLH-DSA available anywhere. It is
approximately 5x-10x faster than the SPHINCS+ team's AVX2 reference code, at least on my testing machines. It can also leverage GPUs if available, in which case the speedup is even more pronounced.
I'm writing to you all here because I'd like to fact check myself on that bold claim, and to see if there is any parallel research ongoing which might support or refute my findings?
I wrote an article
published on my personal blog which explains the methodology. TLDR: I used an open graphics library called
Vulkan, typically used by video game developers. Vulkan can compile and executes compute shaders on CPU and/or GPU devices, maximizing parallelism and apparently making better use of available CPU hardware resources than handwritten AVX2 and multithreading can, which surprised me. Vulkan is similar to
OpenCL which I also tested, but Vulkan seems to perform much better on the same hardware, at the cost of being
way more verbose and harder to learn.
There are some caveats: To save time and reduce scope, my code focused only on the SHA2-128s parameter set, though in theory these techniques could be applied to any parameter set. Vulkan shaders must be compiled at runtime on the device that executes them, which results in a noticeable startup penalty. This can be mitigated through caching.
I hope someone finds this useful!
regards,
conduition