SWHID v2 exploration — thread recap + updated v2-exp binaries (now including BLAKE3)

13 views

Skip to first unread message

Roberto Di Cosmo

unread,

Mar 17, 2026, 3:56:06 PMMar 17

to swhid-...@googlegroups.com

[TL;DR: broad consensus on base64url encoding, more discussion needed for the
hash, and crc, new version of the swhid-rs binary supports blake3]

Dear all,

Thanks to everyone who contributed so far to the discussion on the SWHID v2 exploration.

Below is a structured recap of the key points raised so far, followed by an update on the experimental tooling.

1) Scope recap: what we are exploring
We have been exploring two (largely independent) design choices for “SWHID v2”:
- Hash algorithm (e.g., SHA-256, possibly others for comparison)
- Digest encoding / serialization (e.g., hex, base64url, ... )

To make the discussion concrete, we shared an experimental variant of the Rust
implementation (swhid-rs) that allows to select hash + encoding at runtime,
and we provided dashboards to track test results.

2) Hash algorithm: positions and arguments

2.1 SHA-256 as the leading baseline
- Strong support for SHA-256 as the v2 baseline was expressed (Miguel, Stefano).
- Main reasons cited: widely deployed standard, “battle-tested”
cryptographically, alignment with the direction taken by Git’s post-SHA1
transition rationale, practical acceleration on common hardware.

2.2 “Faster hashes” (BLAKE3) as a potential alternative / future version
- Alexios raised the desirability of a faster cryptographic hash for
high-volume use cases (mentioning BLAKE3), while noting that Git’s 2018
choice predates BLAKE3 (published later).
- A key point in that exchange: Git-compatibility is an argument, but its
practical impact may depend on how/when SHA-256 repositories become widely
supported by forges and ecosystems.
- Stefano replied that “newer” algorithms can be less battle-tested from a
cryptanalysis perspective, which is a separate argument in favor of SHA-256, independent of Git.

3) Digest encoding: positions, constraints, and open trade-offs
3.1 Strong preference for base64url
- Miguel recommended mandating base64url (single canonical encoding) to keep
the ecosystem simpler and avoid ambiguities that arise when trying to deduce encoding from the string itself.
- Stefano also leaned toward base64url as a good balance between compactness and practical usability.

3.2 Should we allow multiple encodings?
- If multiple encodings are allowed, Alexios strongly argued that the SWHID
must carry explicit information indicating which serialization/encoding is
used (and suggested a way to encode that in the version field).
- Stefano is still unsure: allowing only one encoding simplifies things,
but choosing a compact one (base64url) means you lose direct string-level
comparison with Git’s usual hex-encoded SHA-256; comparison would require
decode/encode tooling. Not a real blocker for tools, maybe a human user
usability consideration.

3.3 z85: broad opposition
- Nicolas (and also Alexios) advised against z85: its alphabet contains
characters that complicate parsing and URL handling (and creates more
“delimitation” hazards than needed), for relatively small gains over base64url.

3.4 hex: acknowledged advantages

- Even where base64url is preferred, hex was recognized as having practical
benefits: small and unambiguous character set, and direct comparability with
“typical” hex encodings used elsewhere (e.g., Git).

4) Error detection / CRC: interesting idea, but needs a clearer case
- Miguel proposed adding an internal consistency check, e.g., appending a
CRC32 over the decoded digest bytes, to help detect typos/transport
errors (especially relevant for compact encodings).
- Nicolas found the idea interesting, particularly as a mitigation if we pick compact encodings.
- Stefano suggests to:
- build a stronger case (do we have evidence of real-world
mistyping/transport corruption that needs in-band detection? will v2 make it worse?),
- consider whether this protection must be in-band (inside the identifier)
vs. out-of-band (e.g., protected by the container like an SBOM checksum),
- likely avoid multiple CRC types; and
- consider making it optional (possibly via qualifiers rather than core fields).

5) Related prior art: multiformats (multihash / multibase)
- David suggest to look at multiformats’ multihash and multibase as
potential inspiration for configurable hash/encoding, but noted the general
tension between “configurable” schemes and persistent identifiers, while still
seeing value in reusing well-known building blocks.

Conclusions from my side, and next steps:
-----------------------------------------

- Format: broad consensus for base64url, choice to retain if we want to keep SWHID short(ish)
Important remark: inside the computation everything is done in hex, so after
decoding a SWHID we get the same hex code, no matter the encoding.

- CRC: needs more discussion

- Hash function: here too broad consensus on SHA256, but a real question about efficiency

To enable practical evaluation of the “fast cryptographic hash” line of
discussion within the same experimental framework, we have published a newer
pre-release of the v2 exploration binaries:

v2-exp-20260313 (pre-release): https://github.com/swhid/swhid-rs/releases/tag/v2-exp-20260313

@Alexios: may you run some performance test on real world data, and report on
the differences you see?

Cheers

--
Roberto

Anders F Björklund

unread,

Mar 17, 2026, 5:22:26 PMMar 17

to Roberto Di Cosmo, swhid-...@googlegroups.com

Hi SWHID! Just joined.

1) Scope recap: what we are exploring
We have been exploring two (largely independent) design choices for “SWHID v2”:
- Hash algorithm (e.g., SHA-256, possibly others for comparison)
- Digest encoding / serialization (e.g., hex, base64url, ... )

Late to the party, I have a sweet spot for base32hex (without padding!)
myself but it does get confused with base32 (with the weird 0 1 handling)
and it is not as compact as the ones with upper and lower (i.e. base64)

- Format: broad consensus for base64url, choice to retain if we want to keep SWHID short(ish)
Important remark: inside the computation everything is done in hex, so after
decoding a SWHID we get the same hex code, no matter the encoding.

For me the multiformat is tied to their use in IPFS, not so "generic"?

So settling on base64url for SHA2 sounds like a pragmatic decision.

- Hash function: here too broad consensus on SHA256, but a real question about efficiency

I noticed when I got my Mac that it was much _faster_ to do sha256...

Mostly because it is handled in hardware, whereas the others were not

(i.e. blake3 was much faster on x86, but sha256 was faster on arm)

The code is here: https://github.com/afbjorklund/bsdsum

It all started when coreutils was "missing", and perl was too slow.

And then I remembered that it used to have "/sbin/md5" from FreeBSD.

But it is just calling the CommonCrypto functions, from libSystem...

/Anders

Reply all

Reply to author

Forward

0 new messages