Hello all, on the two main raised questions up-thread, my take and
thoughts are:
- We should go for SHA256, for 1) compatibility with Git, 2) how
"battle-tested" SHA256 is in comparison to alternatives.
On this point, I recommend that interested people read the *reasons*
behind the post-SHA1 hash choice made by the Git community, starting
from
https://git-scm.com/docs/hash-function-transition#_choice_of_hash
and
https://lore.kernel.org/git/20180609224...@genre.crustytoothpaste.net/
.
- In terms of encoding, I'm a big fan of base64url, because it strikes a
great balance between compactness, practical aspects, and avoids most
(not all) "gotchas" of characters looking like different ones to the
naked eye.
- I'm still not sure about whether we should support only one encoding
or multiple ones. I'm leaning "only one", with a caveat. If we go for
only one, it should clearly be a compact one, hence base64url in my
view. *But* that would mean that string comparison with git2 hashes
(i.e., SHA256, hex-encoded) will be impossible. To compare a SWHIDv2
with a Git hash one will need to rely on decode/encode cycles, via
some tool. This would be less convenient than the alternative. Not the
end of the world, especially if we provide good tooling. But it's
still something that gives me pause.
About CRC:
On Thu, Mar 05, 2026 at 11:47:45AM +0100, Miguel Colom wrote:
> For example, one could compute the CRC32 (or others) of the decoded base64
> stream, and compare it to what it's declared in the string.
> This would give something like this:
> swh:2:cnt:U4TkgBucljYmlWXr8OLGQUR2cqVAOyu4MR01xCfGM9E:ccrc32:1281047013
I agree with others that this is an interesting idea. A few comments:
1) Before embracing it, though, I'd like to see a stronger case for it.
For example, do we have examples/cases where SWHIDs were "broken" by
mistyping them? Do we have reasons to believe the problem will become
worse with v2? Do we need the CRC to be *in-band*, i.e., part of the
identifier or can it be elsewhere? For example, if a SWHID is referenced
form a SBOM, and the SBOM has a global checksum, the SWHID will be
integrity-protected by it too.
2) I don't think we will need to support *multiple* CRCs. If so, the
"ccrc32" qualifier in the example above is probably not needed.
3) We should consider making the CRC optional. After all, it's an extra
guarantee, not a mandatory feature. And if we make it optional, can this
be a new qualifier, after the "?"?
Cheers
--
Stefano Zacchiroli -
https://upsilon.cc/zack
Full professor of Computer Science, Polytechnic Institute of Paris
Co-founder & CSO Software Heritage