[TL;DR: broad consensus on base64url encoding, more discussion needed for the
hash, and crc, new version of the swhid-rs binary supports blake3]
Dear all,
Thanks to everyone who contributed so far to the discussion on the SWHID v2 exploration.
Below is a structured recap of the key points raised so far, followed by an update on the experimental tooling.
1) Scope recap: what we are exploring
We have been exploring two (largely independent) design choices for “SWHID v2”:
- Hash algorithm (e.g., SHA-256, possibly others for comparison)
- Digest encoding / serialization (e.g., hex, base64url, ... )
To make the discussion concrete, we shared an experimental variant of the Rust
implementation (swhid-rs) that allows to select hash + encoding at runtime,
and we provided dashboards to track test results.
2) Hash algorithm: positions and arguments
2.1 SHA-256 as the leading baseline
- Strong support for SHA-256 as the v2 baseline was expressed (Miguel, Stefano).
- Main reasons cited: widely deployed standard, “battle-tested”
cryptographically, alignment with the direction taken by Git’s post-SHA1
transition rationale, practical acceleration on common hardware.
2.2 “Faster hashes” (BLAKE3) as a potential alternative / future version
- Alexios raised the desirability of a faster cryptographic hash for
high-volume use cases (mentioning BLAKE3), while noting that Git’s 2018
choice predates BLAKE3 (published later).
- A key point in that exchange: Git-compatibility is an argument, but its
practical impact may depend on how/when SHA-256 repositories become widely
supported by forges and ecosystems.
- Stefano replied that “newer” algorithms can be less battle-tested from a
cryptanalysis perspective, which is a separate argument in favor of SHA-256, independent of Git.
3) Digest encoding: positions, constraints, and open trade-offs
3.1 Strong preference for base64url
- Miguel recommended mandating base64url (single canonical encoding) to keep
the ecosystem simpler and avoid ambiguities that arise when trying to deduce encoding from the string itself.
- Stefano also leaned toward base64url as a good balance between compactness and practical usability.
3.2 Should we allow multiple encodings?
- If multiple encodings are allowed, Alexios strongly argued that the SWHID
must carry explicit information indicating which serialization/encoding is
used (and suggested a way to encode that in the version field).
- Stefano is still unsure: allowing only one encoding simplifies things,
but choosing a compact one (base64url) means you lose direct string-level
comparison with Git’s usual hex-encoded SHA-256; comparison would require
decode/encode tooling. Not a real blocker for tools, maybe a human user
usability consideration.
3.3 z85: broad opposition
- Nicolas (and also Alexios) advised against z85: its alphabet contains
characters that complicate parsing and URL handling (and creates more
“delimitation” hazards than needed), for relatively small gains over base64url.
3.4 hex: acknowledged advantages
- Even where base64url is preferred, hex was recognized as having practical
benefits: small and unambiguous character set, and direct comparability with
“typical” hex encodings used elsewhere (e.g., Git).
4) Error detection / CRC: interesting idea, but needs a clearer case
- Miguel proposed adding an internal consistency check, e.g., appending a
CRC32 over the decoded digest bytes, to help detect typos/transport
errors (especially relevant for compact encodings).
- Nicolas found the idea interesting, particularly as a mitigation if we pick compact encodings.
- Stefano suggests to:
- build a stronger case (do we have evidence of real-world
mistyping/transport corruption that needs in-band detection? will v2 make it worse?),
- consider whether this protection must be in-band (inside the identifier)
vs. out-of-band (e.g., protected by the container like an SBOM checksum),
- likely avoid multiple CRC types; and
- consider making it optional (possibly via qualifiers rather than core fields).
5) Related prior art: multiformats (multihash / multibase)
- David suggest to look at multiformats’ multihash and multibase as
potential inspiration for configurable hash/encoding, but noted the general
tension between “configurable” schemes and persistent identifiers, while still
seeing value in reusing well-known building blocks.
Conclusions from my side, and next steps:
-----------------------------------------
- Format: broad consensus for base64url, choice to retain if we want to keep SWHID short(ish)
Important remark: inside the computation everything is done in hex, so after
decoding a SWHID we get the same hex code, no matter the encoding.
- CRC: needs more discussion
- Hash function: here too broad consensus on SHA256, but a real question about efficiency
To enable practical evaluation of the “fast cryptographic hash” line of
discussion within the same experimental framework, we have published a newer
pre-release of the v2 exploration binaries:
v2-exp-20260313 (pre-release):
https://github.com/swhid/swhid-rs/releases/tag/v2-exp-20260313
@Alexios: may you run some performance test on real world data, and report on
the differences you see?
Cheers
--
Roberto