Dear OpenSSL Dev Team,
We are a group of researchers from the University of Edinburgh and the Indian Institute of Technology Gandhinagar, who are working on developing a novel algorithm that employs SIMD instructions to significantly accelerate large-integer addition (BN_add()) and subtraction (BN_sub()), and we would like to contribute this work to the OpenSSL project.
The Challenge: Serial Carry PropagationThe large-integer addition in OpenSSL is currently implemented with serial add-with-carry instructions. This carry-propagation dependency prevents the direct use of data-parallel SIMD instructions, limiting throughput.
Our SIMD-based SolutionOur algorithm parallelizes large-integer arithmetic to leverage SIMD units.
Implementation: We've coded the algorithm in C using AVX512 intrinsics for x86-64 CPUs. Ours is a generic implementation that does not perform any microarchitectural optimizations, unlike other libraries such as the GNU Multiple Precision Library (GMP), and it gives good performance across the CPUs used for experimentation.
Scalability: The approach is scalable to any vector length (128-bit, 256-bit), but we've observed the best performance gains with AVX512.
Integration: We have successfully integrated this into OpenSSL 3.5.2 redirecting calls within BN_add() (specifically bn_add_n()) to our optimized functions.
We benchmarked our modified OpenSSL 3.5.2 against the original build on four different CPUs.
BN_add() Speedups
On average, we achieve ~2.4x speedup across operand sizes. The figure below shows the speedups across the four CPUs over operand sizes ranging from 256 to 65536 bits.

Correctness: We validated our implementation using the full OpenSSL test suite, and all tests passed. We are also in the process of formally verifying its correctness.
BN_sub(): We observed similar, significant speedups for subtraction.
Instruction Count: We also observe up to 60% instruction count reduction for both operations.
BN_mul(): Because multiplication recursively relies on addition/subtraction (e.g., for Karatsuba), our changes resulted in an average performance improvement of 13% for BN_mul() across various operand sizes.
We have submitted this work to a Systems conference and are eager to contribute the code to future OpenSSL builds.
We can provide a patch, a link to our fork, or the conference paper for your review. Please let us know the best way to proceed.
Looking forward to hearing from you.
--
Subhrajit Das
PhD Scholar
Computer Science and Engineering
Indian Institute of Technology, Gandhinagar
On average, we achieve ~2.4x speedup across operand sizes. The figure below shows the speedups across the four CPUs over operand sizes ranging from 256 to 65536 bits
Validation and Broader Impact
Correctness: We validated our implementation using the full OpenSSL test suite, and all tests passed. We are also in the process of formally verifying its correctness.
BN_sub(): We observed similar, significant speedups for subtraction.
Instruction Count: We also observe up to 60% instruction count reduction for both operations.
BN_mul(): Because multiplication recursively relies on addition/subtraction (e.g., for Karatsuba), our changes resulted in an average performance improvement of 13% for BN_mul() across various operand sizes.
We have submitted this work to a Systems conference and are eager to contribute the code to future OpenSSL builds.
We can provide a patch, a link to our fork, or the conference paper for your review. Please let us know the best way to proceed.
Looking forward to hearing from you.
--
Subhrajit Das
PhD Scholar
Computer Science and Engineering
Indian Institute of Technology, Gandhinagar
--
You received this message because you are subscribed to the Google Groups "openssl-project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openssl-proje...@openssl.org.
To view this discussion visit https://groups.google.com/a/openssl.org/d/msgid/openssl-project/0ecd214c-bb58-47c8-affa-3317bd948e80n%40openssl.org.