Hi all,
Following Ethan Heilman's "Post Quantum Signatures and Scaling Bitcoin"
post [0], which proposed using STARKs to aggregate PQ signatures per
block and raised the concern that proof generation could give large
miners an unfair advantage if too expensive, I ran some benchmarks to
put numbers on this.
Full write-up with charts:
https://remix7531.com/post/slh-dsa-stark-bench/
I built a proof-of-concept [1] that aggregates N SLH-DSA-SHA2-128s (FIPS
205) signature verifications into a single STARK proof using RISC Zero's
zkVM with its SHA-256 precompile.
Results (wall-clock proving time, succinct proofs):
N RTX 5090 B200 CPU (Ryzen 8640U) Proof size
1 4.1 s 4.2 s 14 min 17 s 218 KiB
8 28.9 s 19.5 s 1 h 14 min 222 KiB
64 3 min 31 s 2 min 33 s -- 247 KiB
512 26 min 28 s 20 min 3 s -- 454 KiB
Key findings:
- Proving scales roughly linearly with N.
- ~3.1 s/sig on RTX 5090, ~2.3 s/sig on B200.
- Proof size grows sublinearly: 218 KiB (N=1) to 454 KiB (N=512),
vs 3.8 MiB of raw signatures at N=512.
- Verification is constant at ~12-15 ms regardless of N.
- B200 is only 1.3x faster than RTX 5090. The workload is
compute-bound; RISC Zero limits segment size (PO2) to 22.
At 3.1 s/sig, proving a full block on a single RTX 5090 would take over
2 hours. That is too slow as-is, but this is a general-purpose zkVM
upper bound. Several things could improve this:
1. Dedicated AIR and prover: S-two's benchmarks [2] show their prover
running SHA-256 chains up to 85x faster than RISC Zero's SHA-256
precompile on CPU. SLH-DSA verification has overhead beyond SHA-256
that is not accelerated, so the real-world speedup is unclear.
What speedup could we realistically expect from a custom AIR and
prover built specifically for SLH-DSA verification? I would love
to hear from someone with more experience building STARK provers.
2. Preprocessing: if transactions are proven as they enter the
mempool and proofs are aggregated recursively, most proving work
shifts to before the block is mined. Only a final aggregation step
remains. This needs clever batching algorithms, probably grouping
by fee level.
How much of the per-block proving cost could preprocessing
realistically eliminate?
3. Multi-GPU: STARK segment proving is embarrassingly parallel. RISC
Zero has experimental multi-GPU support. A cluster divides the
workload proportionally.
Kudinov and Nick's Bitcoin-optimized SPHINCS+ [3] reduces SHA-256
compression calls by roughly 3x, which would also reduce the number
of cycles a STARK prover needs per signature. That said, I lean
toward sticking with NIST-standardized SLH-DSA for the ecosystem
benefits (vetted implementations, HSM support, hardware acceleration
path) and letting miners run a larger GPU cluster to compensate, but
that is a trade-off worth discussing.
Best
remix7531
[0]
https://groups.google.com/g/bitcoindev/c/wKizvPUfO7w
[1]
https://github.com/remix7531/slh-dsa-stark-bench
[2]
https://docs.starknet.io/learn/S-two-book/benchmarks
[3]
https://eprint.iacr.org/2025/2203