SWHID v2 exploration: hash and digest encoding choices - experimental implementation available and request for comments

4 views
Skip to first unread message

Roberto Di Cosmo

unread,
Mar 3, 2026, 7:44:48 AM (11 days ago) Mar 3
to swhid-...@googlegroups.com, Roberto Di Cosmo, Morane Gruenpeter, Jean-Francois Abramatic, Alexios Zavras, Stefano Zacchiroli
Dear all,
   SWHID v1 was published as ISO/IEC 18670 in April 2025, the official site at https://swhid.org is up and running, and the ecosystem around SWHID is rapidly evolving.

To prepare for the coming years, we are starting an exploration phase for “SWHID v2”. At this stage, we are focusing on two design questions:

1) Hash algorithm: which cryptographic hash should v2 use?
2) User-facing digest encoding: which textual encoding should we use to represent the digest in the identifier?

We value compatibility with widely deployed tooling. Git has introduced support for SHA-256 repositories as part of its hash-function transition,  making SHA-256 a strong candidate for the SWHID v2 hashing function.

For the digest encoding, hex is the current v1 choice and remains the simplest and most widely interoperable option, but it requires 64 chars for SHA-256.
For user convenience (copy/paste, QR codes, UI display, URL length), we need to also evaluate more compact encodings, including:

- base64url (RFC 4648; compact, URL-safe; 43–44 chars depending on padding)
- base64 (compact, standard, but includes “+” and “/”)
- base32 / base32hex (RFC 4648; case-insensitive alphabets, longer than base64)
- z85 (ZeroMQ Base85; compact, ASCII-safe)

To facilitate comparison and discussion, I have prepared an experimental variant of the swhid-rs reference implementation that lets you select the hash and encoding at runtime:

- Source code: https://github.com/swhid/swhid-rs (branch: v2-typespecialisation)
- User guide: https://github.com/swhid/swhid-rs/blob/v2-typespecialisation/docs/user-guide.md (CLI reference + available hash/format options) 
- Pre-built binaries: GitHub Releases (pre-releases tagged v2-exp-YYYYMMDD, e.g. v2-exp-20260301), see for example: https://github.com/swhid/swhid-rs/releases/tag/v2-exp-20260301

Example usage:

# V1 (SHA-1 + hex, default)
swhid content --file README.md

# V2 exploration with SHA-256 and various encodings
swhid content --hash sha256 --format hex  --file README.md
swhid content --hash sha256 --format base64url --file README.md
swhid content --hash sha256 --format z85 --file README.md

Test results:
- v1 dashboard (multiple implementations across platforms): https://www.swhid.org/test-suite/
- v2 exploration dashboard (currently rust_v1 vs rust_v2 side-by-side): https://www.swhid.org/swhid-exploration-deploy/

We would particularly appreciate feedback on the following points:

- Hash: is SHA-256 the right baseline for v2? If not, what concrete constraints suggest another choice?
- Encoding: should v2 mandate a single canonical digest encoding, or allow for multiple encodings?
- If multiple encodings are acceptable: should the identifier explicitly signal the encoding, or should we just deduce it from the string size/characters used (will not work well if we go for z85)?
- Are case-insensitivity and/or strict URL/path safety requirements for the digest representation important for your use cases?
- Any strong preferences regarding padding (mandatory vs omitted) for base64/base32 variants?

Thanks in advance for your input, and for any testing you can do with the experimental implementation.

All the best

--
Roberto

------------------------------------------------------------------
Computer Science Professor
            (on leave at INRIA from IRIF/Université Paris Cité)  

Director                                                          
Software Heritage                https://www.softwareheritage.org 
INRIA                 https://y2u.be/Ez4xKTKJO2o 
Bureau D202      E-mail : rob...@dicosmo.org         
48 Rue Barrault            Web page : https://www.dicosmo.org      
CS 61534     Twitter : https://twitter.com/rdicosmo 
75647 Paris Cedex         Tel : +33 1 80 49 44 42           
------------------------------------------------------------------
GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3 

Miguel Colom

unread,
Mar 3, 2026, 8:50:04 AM (11 days ago) Mar 3
to Roberto Di Cosmo, swhid-...@googlegroups.com, Morane Gruenpeter, Jean-Francois Abramatic, Alexios Zavras, Stefano Zacchiroli
Dear Roberto and all,

Great work! I'll add my comments below.
 
- base64url (RFC 4648; compact, URL-safe; 43–44 chars depending on padding)
- base64 (compact, standard, but includes “+” and “/”)
- base32 / base32hex (RFC 4648; case-insensitive alphabets, longer than base64)
- z85 (ZeroMQ Base85; compact, ASCII-safe)

I'd choose base64url, because although we could encode anything with any of the other alternatives, baseurl64 seems to facilitate the task without requiring extra transformations (making it URL-safe, encoding letter case, ...).
 
- Hash: is SHA-256 the right baseline for v2? If not, what concrete constraints suggest another choice?
It's an excellent choice, in my opinion.
It's a standard and moreover it's got a large encoding space that makes collisions quite rare, almost impossible unless a vulnerability is discovered.
And common hardware can accelerate the computation with specialized instructions https://en.wikipedia.org/wiki/SHA_instruction_set probably motivated because of its use in the Bitcoin's blockchain.
 
- Encoding: should v2 mandate a single canonical digest encoding, or allow for multiple encodings?

Since base64url seems to be adequate for all the test cases, I'd simply allow only this one, to simplify.
If any inconvenient is found, then this should be re-evaluated.
 
- If multiple encodings are acceptable: should the identifier explicitly signal the encoding, or should we just deduce it from the string size/characters used (will not work well if we go for z85)?
I'd only use base64url and, in general, avoid assuming anything from the data itself, since we could have ambiguities.
Not now, but if options are left open and assumed from the data, it could happen in the future.

- Are case-insensitivity and/or strict URL/path safety requirements for the digest representation important for your use cases? 
- Any strong preferences regarding padding (mandatory vs omitted) for base64/base32 variants?

Since padding is an option in base64/32, I'd make it optional too.
It could be removed since the size can be inferred from the digest used, but in that case it'd be weird that a valid base64url encoding is rejected because it contains a padding, when it's valid in the encoding schema itself. That could be confusing, in my opinion.

Cheers,
Miguel
 

Thanks in advance for your input, and for any testing you can do with the experimental implementation.

All the best

--
Roberto

------------------------------------------------------------------
Computer Science Professor
            (on leave at INRIA from IRIF/Université Paris Cité)  

Director                                                          
Software Heritage                https://www.softwareheritage.org 
INRIA                 https://y2u.be/Ez4xKTKJO2o 
Bureau D202      E-mail : rob...@dicosmo.org         
48 Rue Barrault            Web page : https://www.dicosmo.org      
CS 61534     Twitter : https://twitter.com/rdicosmo 
75647 Paris Cedex         Tel : +33 1 80 49 44 42           
------------------------------------------------------------------
GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3 

--
You received this message because you are subscribed to the Google Groups "SWHID (Software Hash Identifiers) discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to swhid-discus...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/swhid-discuss/CAJBwKuVB-FWLhrReoJ64uNzvAaoqL42zh8_Ee-gGwZooM3gHQw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Roberto Di Cosmo

unread,
Mar 5, 2026, 3:56:21 AM (10 days ago) Mar 5
to Miguel Colom, swhid-...@googlegroups.com, Morane Gruenpeter, Jean-Francois Abramatic, Alexios Zavras, Stefano Zacchiroli
Thanks Miguel for your prompt response, happy to see that you like the work done to ease progress on this matter.

To facilitate the discussion, I am adding here an example of the different outputs one can get for sha256 using the various encodings:

swhid v1        - swh:1:cnt:5fe9f915b92b32126828d87d259fbe1742f62a8e
hex             - swh:2:cnt:5384e4801b9c9636269565ebf0e2c641447672a5403b2bb8311d35c427c633d1
base64          - swh:2:cnt:U4TkgBucljYmlWXr8OLGQUR2cqVAOyu4MR01xCfGM9E=
base64url       - swh:2:cnt:U4TkgBucljYmlWXr8OLGQUR2cqVAOyu4MR01xCfGM9E
base32          - swh:2:cnt:KOCOJAA3TSLDMJUVMXV7BYWGIFCHM4VFIA5SXOBRDU24IJ6GGPIQ====
base32hex       - swh:2:cnt:AE2E900RJIB3C9KLCNLV1OM68527CSL580TINE1H3KQS89U66F8G====
z85             - swh:2:cnt:q?TjZ8>riLcy5fd[z.CNm0rQEkSZHaf=-c&c=N%&


--
Roberto

P.S.: here is how to play with all this yourself on a Linux box (binaries are also available for MacOS and Windows, see https://github.com/swhid/swhid-rs/releases/tag/v2-exp-20260301)

wget https://github.com/swhid/swhid-rs/releases/download/v2-exp-20260301/swhid-v2-exp-x86_64-unknown-linux-gnu
chmod +x swhid-v2-exp-x86_64-unknown-linux-gnu
bin=./swhid-v2-exp-x86_64-unknown-linux-gnu
printf '%-9s\t- %s\n' "swhid v1" "$("$bin" content --file "$bin")"
for enc in hex base64 base64url base32 base32hex z85; do
 printf '%-9s\t- %s\n' "$enc" "$("$bin" --hash sha256 --format "$enc" content --file "$bin")"
done

------------------------------------------------------------------
Computer Science Professor
            (on leave at INRIA from IRIF/Université Paris Cité)  

Director                                                          
Software Heritage                https://www.softwareheritage.org 
INRIA                 https://y2u.be/Ez4xKTKJO2o 
Bureau D202      E-mail : rob...@dicosmo.org         
48 Rue Barrault            Web page : https://www.dicosmo.org      
CS 61534     Twitter : https://twitter.com/rdicosmo 
75647 Paris Cedex         Tel : +33 1 80 49 44 42           
------------------------------------------------------------------
GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3 

Miguel Colom

unread,
Mar 5, 2026, 5:47:59 AM (9 days ago) Mar 5
to Roberto Di Cosmo, swhid-...@googlegroups.com, Morane Gruenpeter, Jean-Francois Abramatic, Alexios Zavras, Stefano Zacchiroli
Thanks for the code Roberto.

Related to SWHID v2 and the digests, I was wondering if some kind of consistency within the identified itself has been proposed.
For example, one could compute the CRC32 (or others) of the decoded base64 stream, and compare it to what it's declared in the string.
This would give something like this:
swh:2:cnt:U4TkgBucljYmlWXr8OLGQUR2cqVAOyu4MR01xCfGM9E:ccrc32:1281047013

Given that changes in the ASCII characters of the cnt field would simply return a valid 256-bit list, and given that the hash is not reversible, then one wouldn't ever know if a SWHID actually corresponds to an existing object, or whether it's an error.
This way we'd have two different checks:
- One on the size of the base64-encoded digest, that needs to be 256 bits;
- Another one of the consistency of the hash itself.

I assume that obviously in the SWH infrastructure there are consistency checks for the Merkle tree, but this could be incorporated in the identifier itself. That would enforce consistency at any system implementing SWHID v2, rather than relying on external metadata.

Well, just a thought :)

Cheers,
Miguel

Nicolas Dandrimont

unread,
Mar 5, 2026, 6:00:00 AM (9 days ago) Mar 5
to swhid-...@googlegroups.com, morane gruenpeter, Jean-Francois Abramatic, Alexios Zavras, Stefano Zacchiroli

Hi,

Looking at your examples, I would avoid the z85 encoding as the character set contains a lot of characters that will make parsing, URL encoding, and overall "delimitation" of the SWHID more ambiguous than needed (including ?%, :[]), seeing how little space gain we get compared to base64url. The question mark is especially problematic for SWHIDs with context.

Even though it's longer, the hex encoding has the big advantage of a small, inambiguous character set (no 0/O, 1/l, lowercase/uppercase). Miguel's suggestion of introducing some form of CRC for transport is interesting, especially for compact encodings.

-- 
Nicolas Dandrimont
Operations Tech Lead, Software Heritage

Zavras, Alexios

unread,
Mar 5, 2026, 6:43:47 AM (9 days ago) Mar 5
to Roberto Di Cosmo, Miguel Colom, swhid-...@googlegroups.com, Morane Gruenpeter, Jean-Francois Abramatic, Stefano Zacchiroli
Thanks, Roberto, for the examples.
We are discussing two independent questions: the hash algorithm and the digest encoding.

I strongly believe, that, if we allow more than one serialization ("encoding"), the SWHID should include information about which one it uses.
We might use the "version" field, which can include information on the hashing and the serialization. Using "/" to separate these (in the manner of mime-types), we could have:
hex             - swh:2/0:cnt:5384e4801b9c9636269565ebf0e2c641447672a5403b2bb8311d35c427c633d1
base64          - swh:2/1:cnt:U4TkgBucljYmlWXr8OLGQUR2cqVAOyu4MR01xCfGM9E=
base64url       - swh:2/2:cnt:U4TkgBucljYmlWXr8OLGQUR2cqVAOyu4MR01xCfGM9E
base32          - swh:2/3:cnt:KOCOJAA3TSLDMJUVMXV7BYWGIFCHM4VFIA5SXOBRDU24IJ6GGPIQ====
base32hex       - swh:2/4:cnt:AE2E900RJIB3C9KLCNLV1OM68527CSL580TINE1H3KQS89U66F8G====
(using single-characters for various serializations).

I am also against using z85, since it includes ":" in its output alphabet and it will introduce complexity without major gains.


Regarding the hashing function, as someone who is computing thousands (if not millions) every day, I would have loved to have a very fast one. Maybe we can simultaneously introduce version 3 using BLAKE3? I assume we still want to use a cryptographic hash function to keep the security guarantees. Otherwise we could even use something like XXH3 (but then we purposefully abandon the "tamper-proof" functionality).

We should also be clear of the use case: I am only talking about for software identification; and therefore also comparison with other SWHIDs of the exact same version.

The fact that SWHIDs can also be used to access the Software Heritage archive is an extra advantage, but this is not a requirement.
What I mean is that it would be totally OK (for me) to have SWHID v2 using SHA-256 and SWHID v3 using BLAKE3, and also specify that the SwHer archive does not support accessing artifacts with v3 identifiers, only v2 (and v1).
If we want to take the "access to the archive" use case in account, I'd also avoid using the simple "base64" encoding, since its output alphabet includes "/" and "+".

-- zvr --



From: Roberto Di Cosmo <rob...@dicosmo.org>
Sent: Thursday, March 05, 2026 09:56
To: Miguel Colom <miguel...@gmail.com>
Cc: swhid-...@googlegroups.com <swhid-...@googlegroups.com>; Morane Gruenpeter <morane.g...@inria.fr>; Jean-Francois Abramatic <jfabr...@gmail.com>; Zavras, Alexios <alexios...@intel.com>; Stefano Zacchiroli <stefano.z...@telecom-paris.fr>
Subject: Re: SWHID v2 exploration: hash and digest encoding choices - experimental implementation available and request for comments
Intel Deutschland GmbH
Registered Address: Dornacher Straße 1, 85622 Feldkirchen, Germany
Tel: +49 89 991 430, www.intel.de
Managing Directors: Harry Demas, Jeffrey Schneiderman, Yin Chong Sorrell
Chairperson of the Supervisory Board: Nicole Lau
Registered Seat: Munich
Commercial Register: Amtsgericht München HRB 186928

Stefano Zacchiroli

unread,
Mar 9, 2026, 5:40:49 AM (5 days ago) Mar 9
to swhid-...@googlegroups.com
Hello all, on the two main raised questions up-thread, my take and
thoughts are:

- We should go for SHA256, for 1) compatibility with Git, 2) how
"battle-tested" SHA256 is in comparison to alternatives.

On this point, I recommend that interested people read the *reasons*
behind the post-SHA1 hash choice made by the Git community, starting
from https://git-scm.com/docs/hash-function-transition#_choice_of_hash
and
https://lore.kernel.org/git/20180609224...@genre.crustytoothpaste.net/
.

- In terms of encoding, I'm a big fan of base64url, because it strikes a
great balance between compactness, practical aspects, and avoids most
(not all) "gotchas" of characters looking like different ones to the
naked eye.

- I'm still not sure about whether we should support only one encoding
or multiple ones. I'm leaning "only one", with a caveat. If we go for
only one, it should clearly be a compact one, hence base64url in my
view. *But* that would mean that string comparison with git2 hashes
(i.e., SHA256, hex-encoded) will be impossible. To compare a SWHIDv2
with a Git hash one will need to rely on decode/encode cycles, via
some tool. This would be less convenient than the alternative. Not the
end of the world, especially if we provide good tooling. But it's
still something that gives me pause.

About CRC:

On Thu, Mar 05, 2026 at 11:47:45AM +0100, Miguel Colom wrote:
> For example, one could compute the CRC32 (or others) of the decoded base64
> stream, and compare it to what it's declared in the string.
> This would give something like this:
> swh:2:cnt:U4TkgBucljYmlWXr8OLGQUR2cqVAOyu4MR01xCfGM9E:ccrc32:1281047013

I agree with others that this is an interesting idea. A few comments:

1) Before embracing it, though, I'd like to see a stronger case for it.
For example, do we have examples/cases where SWHIDs were "broken" by
mistyping them? Do we have reasons to believe the problem will become
worse with v2? Do we need the CRC to be *in-band*, i.e., part of the
identifier or can it be elsewhere? For example, if a SWHID is referenced
form a SBOM, and the SBOM has a global checksum, the SWHID will be
integrity-protected by it too.

2) I don't think we will need to support *multiple* CRCs. If so, the
"ccrc32" qualifier in the example above is probably not needed.

3) We should consider making the CRC optional. After all, it's an extra
guarantee, not a mandatory feature. And if we make it optional, can this
be a new qualifier, after the "?"?

Cheers
--
Stefano Zacchiroli - https://upsilon.cc/zack
Full professor of Computer Science, Polytechnic Institute of Paris
Co-founder & CSO Software Heritage

Zavras, Alexios

unread,
Mar 9, 2026, 6:44:48 AM (5 days ago) Mar 9
to Stefano Zacchiroli, swhid-...@googlegroups.com
Just an observation that the discussion about the new git hashing algorithm (and decision) happened in 2018, examining the algorithms that were available then.
BLAKE3, for example, was published in 2020, so it was not part of the discussion.

I understand the "git compatibility" argument in a theoretical sense, but I don't think it translates to real use case. Git can use sha-256, but "At present, there is no interoperability between SHA-256 repositories and SHA-1 repositories." [copied from git-init(1)]. And while one can have a local git repository with sha-256 hashes, I don't think any of the forges support such repositories yet.
Maybe by the time they do, git might be able to use even more hashes...

I also believe that, if git compatibility is paramount, we should use the exact same encoding (hex) as well.

-- zvr --

From: swhid-...@googlegroups.com <swhid-...@googlegroups.com> on behalf of Stefano Zacchiroli <za...@upsilon.cc>
Sent: Monday, March 9, 2026 10:40
To: swhid-...@googlegroups.com <swhid-...@googlegroups.com>

Subject: Re: SWHID v2 exploration: hash and digest encoding choices - experimental implementation available and request for comments
--
You received this message because you are subscribed to the Google Groups "SWHID (Software Hash Identifiers) discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to swhid-discus...@googlegroups.com.

Stefano Zacchiroli

unread,
Mar 9, 2026, 8:03:03 AM (5 days ago) Mar 9
to swhid-...@googlegroups.com
On Mon, Mar 09, 2026 at 10:44:39AM +0000, Zavras, Alexios wrote:
> Just an observation that the discussion about the new git hashing
> algorithm (and decision) happened in 2018, examining the algorithms
> that were available then. BLAKE3, for example, was published in 2020,
> so it was not part of the discussion.

Yes. But my point is precisely that, according to some arguments from
the Git discussion back then, this timing make things *worse* for new
and shiny algorithms (like BLAKE3), because they are less battle tested,
both practically and from a cryptanalysis point of view, than "older"
and better known ones (like SHA 256).

This for me is a separate argument in favor of SHA 256 than "Git
compatibility".

David Douard

unread,
Mar 11, 2026, 8:43:19 AM (3 days ago) Mar 11
to swhid-...@googlegroups.com
Le 03/03/2026 à 13:44, Roberto Di Cosmo a écrit :
Dear all,
   SWHID v1 was published as ISO/IEC 18670 in April 2025, the official site at https://swhid.org is up and running, and the ecosystem around SWHID is rapidly evolving.

To prepare for the coming years, we are starting an exploration phase for “SWHID v2”. At this stage, we are focusing on two design questions:

1) Hash algorithm: which cryptographic hash should v2 use?

Would it make sense to have a look at multiformats' multihash¹ and multibase² format for this?

¹ https://github.com/multiformats/multihash 
² https://github.com/multiformats/multibase


2) User-facing digest encoding: which textual encoding should we use to represent the digest in the identifier?

This can be handled by the multiformat. Obviously using a "configurable" hash algorithm can be tricky to use as a persistent identifier... but the configurable formating might be interesting. 

(sorry I just joined the list and I don't have the whole discussion)

About the idea of supporting a CRC field, should it be part of the core SWHID or could it be handled in the qualifiers section?

--
You received this message because you are subscribed to the Google Groups "SWHID (Software Hash Identifiers) discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to swhid-discus...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

-- 
David Douard - Software Heritage
https://softwareheritage.org
matrix: @david:sdfa3.org
mastodon: https://pouet.chapril.org/@douardda
gpg: 7DC7 325E F1A6 226A B6C3  D7E3 2388 A3BF 6F0A 6938

OpenPGP_0x2388A3BF6F0A6938.asc
OpenPGP_signature.asc
Reply all
Reply to author
Forward
0 new messages