Benchmarking SLH-DSA STARK Aggregation

32 views
Skip to first unread message

remix7531

unread,
Apr 13, 2026, 5:27:57 AM (yesterday) Apr 13
to bitco...@googlegroups.com
Hi all,

Following Ethan Heilman's "Post Quantum Signatures and Scaling Bitcoin"
post [0], which proposed using STARKs to aggregate PQ signatures per
block and raised the concern that proof generation could give large
miners an unfair advantage if too expensive, I ran some benchmarks to
put numbers on this.

Full write-up with charts:
https://remix7531.com/post/slh-dsa-stark-bench/

I built a proof-of-concept [1] that aggregates N SLH-DSA-SHA2-128s (FIPS
205) signature verifications into a single STARK proof using RISC Zero's
zkVM with its SHA-256 precompile.

Results (wall-clock proving time, succinct proofs):

  N      RTX 5090      B200         CPU (Ryzen 8640U)   Proof size
  1      4.1 s         4.2 s        14 min 17 s         218 KiB
  8      28.9 s        19.5 s       1 h 14 min          222 KiB
  64     3 min 31 s    2 min 33 s   --                  247 KiB
  512    26 min 28 s   20 min 3 s   --                  454 KiB

Key findings:
- Proving scales roughly linearly with N.
- ~3.1 s/sig on RTX 5090, ~2.3 s/sig on B200.
- Proof size grows sublinearly: 218 KiB (N=1) to 454 KiB (N=512),
  vs 3.8 MiB of raw signatures at N=512.
- Verification is constant at ~12-15 ms regardless of N.
- B200 is only 1.3x faster than RTX 5090. The workload is
  compute-bound; RISC Zero limits segment size (PO2) to 22.

At 3.1 s/sig, proving a full block on a single RTX 5090 would take over
2 hours. That is too slow as-is, but this is a general-purpose zkVM
upper bound. Several things could improve this:

1. Dedicated AIR and prover: S-two's benchmarks [2] show their prover
   running SHA-256 chains up to 85x faster than RISC Zero's SHA-256
   precompile on CPU. SLH-DSA verification has overhead beyond SHA-256
   that is not accelerated, so the real-world speedup is unclear.

   What speedup could we realistically expect from a custom AIR and
   prover built specifically for SLH-DSA verification? I would love
   to hear from someone with more experience building STARK provers.

2. Preprocessing: if transactions are proven as they enter the
   mempool and proofs are aggregated recursively, most proving work
   shifts to before the block is mined. Only a final aggregation step
   remains. This needs clever batching algorithms, probably grouping
   by fee level.

   How much of the per-block proving cost could preprocessing
   realistically eliminate?

3. Multi-GPU: STARK segment proving is embarrassingly parallel. RISC
   Zero has experimental multi-GPU support. A cluster divides the
   workload proportionally.

Kudinov and Nick's Bitcoin-optimized SPHINCS+ [3] reduces SHA-256
compression calls by roughly 3x, which would also reduce the number
of cycles a STARK prover needs per signature. That said, I lean
toward sticking with NIST-standardized SLH-DSA for the ecosystem
benefits (vetted implementations, HSM support, hardware acceleration
path) and letting miners run a larger GPU cluster to compensate, but
that is a trade-off worth discussing.


Best
remix7531


[0] https://groups.google.com/g/bitcoindev/c/wKizvPUfO7w
[1] https://github.com/remix7531/slh-dsa-stark-bench
[2] https://docs.starknet.io/learn/S-two-book/benchmarks
[3] https://eprint.iacr.org/2025/2203


Reply all
Reply to author
Forward
0 new messages