git clone $swhid

16 views
Skip to first unread message

Moritz

unread,
Mar 18, 2026, 5:33:28 AMMar 18
to SWHID (Software Hash Identifiers) discussions
Dear all,

I was wondering if someone has implemented swhid protocol support for git.

I propose the following idea


This would not only be a nice to have but also solve real world problem such as


I know that (in contrast to IPFS) the SWHID is only an identifier without an retrieval protocol, however, technically it should not be too hard to implement an retrieval protocol that fetches from a SWH mirror.

I discussed shortly with Maxence about this, but he is also not aware of any progress towards implementing this idea, so maybe someone here has already thought about this?

All the best
Moritz

Anders F Björklund

unread,
Mar 18, 2026, 7:43:46 AMMar 18
to Moritz, SWHID (Software Hash Identifiers) discussions
I know that (in contrast to IPFS) the SWHID is only an identifier without an retrieval protocol, however, technically it should not be too hard to implement an retrieval protocol that fetches from a SWH mirror.

I discussed shortly with Maxence about this, but he is also not aware of any progress towards implementing this idea, so maybe someone here has already thought about this?

When I made my SWHID implementation, I implemented support for "--write-objects" similar to `git hash-object`
Right now it only does blobs and trees, but it should not be hard to add also the "git" subcommand objects...

Your question is more about the opposite direction, how to translate from SWH and to Git objects?
But so far, I created a .swh/objects directory (with zlib) and was able to use that as a $GIT_DIR.

While it is _possible_ to just tar this .swh directory, I was thinking about using sqlite3 instead.
Then you would have everything in one file, although not as efficient as git's bundle and pack files.

There are some neat implementations of git's transfer protocol that can clone from such an archive:
<https://github.com/chrislloyd/git-remote-sqlite> (although this is written in Zig, but I used Go)

/Anders


PS. Something similar to the CAR files for IPFS, if continuing on that analogy.
And very similar to ZIP, or the "sqlar" that is already built-in to SQLite:


Stefano Zacchiroli

unread,
Mar 18, 2026, 2:31:12 PMMar 18
to swhid-...@googlegroups.com
Dear all, while the ideal entry point for the proposed feature is indeed
a SWHID, this technical discussion is really about a technical feature
of the Software Heritage archive, rather than SWHID as a standard. As
such, I suggest to bring it up on the Software Heritage development
channels, that you will find documented at:
https://www.softwareheritage.org/community/developers/

FWIW, this idea has been (coincidentally) discussed recently among
Software Heritage developers and external collaborators, resulting in
this blog post just today:
https://nesbitt.io/2026/03/18/git-remote-helpers.html

Cheers

On Wed, Mar 18, 2026 at 12:43:33PM +0100, Anders F Björklund wrote:
> >
> > I know that (in contrast to IPFS) the SWHID is only an identifier without
> > an retrieval protocol, however, technically it should not be too hard to
> > implement an retrieval protocol that fetches from a SWH mirror.
> >
> > I discussed shortly with Maxence
> > <https://www.fiz-karlsruhe.de/de/bereiche/lebenslauf-und-publikationen-maxence-azzouz-thuderoz>
> > about this, but he is also not aware of any progress towards implementing
> > this idea, so maybe someone here has already thought about this?
> >
>
> When I made my SWHID implementation, I implemented support for
> "--write-objects" similar to `git hash-object`
> Right now it only does blobs and trees, but it should not be hard to add
> also the "git" subcommand objects...
>
> Your question is more about the opposite direction, how to translate from
> SWH and to Git objects?
> But so far, I created a .swh/objects directory (with zlib) and was able to
> use that as a $GIT_DIR.
>
> While it is _possible_ to just tar this .swh directory, I was thinking
> about using sqlite3 instead.
> Then you would have everything in one file, although not as efficient as
> git's bundle and pack files.
>
> There are some neat implementations of git's transfer protocol that can
> clone from such an archive:
> <https://github.com/chrislloyd/git-remote-sqlite> (although this is written
> in Zig, but I used Go)
>
> /Anders
>
> https://github.com/afbjorklund/go-swhid
>
> PS. Something similar to the CAR files for IPFS, if continuing on that
> analogy.
> And very similar to ZIP, or the "sqlar" that is already built-in to SQLite:
>
> * https://sqlite.org/sqlar.html
>
> --
> You received this message because you are subscribed to the Google Groups "SWHID (Software Hash Identifiers) discussions" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to swhid-discus...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/swhid-discuss/CAJ1VKD19K6nwtKg8EQu96pQEYHhGoo%2Bb%2BOAPC%2BYjaWQmOWu6UA%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.

--
Stefano Zacchiroli - https://upsilon.cc/zack
Full professor of Computer Science, Polytechnic Institute of Paris
Co-founder & CSO Software Heritage

Anders F Björklund

unread,
Mar 18, 2026, 3:11:37 PMMar 18
to Stefano Zacchiroli, swhid-...@googlegroups.com
> When I made my SWHID implementation, I implemented support for
> "--write-objects" similar to `git hash-object`
> Right now it only does blobs and trees, but it should not be hard to add
> also the "git" subcommand objects...
>
> Your question is more about the opposite direction, how to translate from
> SWH and to Git objects?
> But so far, I created a .swh/objects directory (with zlib) and was able to
> use that as a $GIT_DIR.
I implemented the git objects too, but without the rest of the repository there will be a lot of dangling links (like for a revision there is a missing "parent", etc).

The per-file compression (of .zip et al) also gives bigger archives, when compared to the per-archive compression (of .tar) - even if it is indexed and hashed.

Will probably complete the implementation anyway, but there is another "official" Go library...

From a different thread I also added the version/hash/encoding flags, and collision detection.

/Anders

Anders F Björklund

unread,
Mar 21, 2026, 1:40:41 PM (14 days ago) Mar 21
to swhid-...@googlegroups.com
Will probably complete the implementation anyway, but there is another "official" Go library...

From a different thread I also added the version/hash/encoding flags, and collision detection.

Now the implementation of directory and database is complete, and in "doc":

https://github.com/afbjorklund/go-swhid/blob/main/doc/storage.md

Also made a simple (one-level) --archive reading flag, without unpacking.

So that you can determine the directory SWHID of a .tar file or a .zip file.

/Anders

Moritz

unread,
Mar 24, 2026, 5:29:47 AM (11 days ago) Mar 24
to SWHID (Software Hash Identifiers) discussions
Thank you. From an end user perspective it seems that this somehow doable, but not a feature that one can use right now, correct?

Some background information (without a direct link to the question):

One concrete use case is our "Research Data Management Planning in Mathematics" guide that instructs mathematicians how to manage their research data (and research software connected to it)

This guide currently only lists GitHub as an platform to store source code

"GitHub: GitHub is a hosting service for git repositories with source code, provided by a sub-
sidiary of Microsoft. It is the de facto standard for the development of open-source software. In
addition to hosting git repositories, GitHub offers various continuous integration tools. Authors can
specify metadata by including a .bib or .cff file in the main folder of the git repository. GitHub
provides long-term preservation mechanisms. In addition, workflows exist for synchronizing it with
platforms, e.g., Zenodo."

We further recommend to "enable tracking by software heritage for mirroring the repository." 

For research software used/referenced we recommend

"All research software sources should be deposited in the universal software archive. From there, stable citations are possible, for example, using the biblatex-software package."

also in general we suggest

"Intrinsic identifiers. Use intrinsic identifiers like git commit hashes and software heritage IDs to increase reproducibility."

For the next revision, I would like to better explain how the swhid can be practically used to reproduce results. For now, you can download the tar file from the webui (see screenshot attached). It takes a while to cook the file and you are notified via email, so this workflow is less convenient. Here a git clone $swhid would be more convenient. If that would be reasonably fast, we could eventually become less dependent on GitHub. I did already try to suggest https://radicle.xyz as an alternative to GitHub, but it does not seem stable enough.

All the best
Moritz
Screenshot 2026-03-24 at 09.37.13.png

Robbie Morrison

unread,
Mar 24, 2026, 6:38:08 AM (11 days ago) Mar 24
to swhid-...@googlegroups.com

Hi Moritz, all

On 24/03/2026 10.29, Moritz wrote:
For the next revision, I would like to better explain how the swhid can be practically used to reproduce results.

This information would be of significant interest to the Open Energy Modelling Initiative community.  Particularly now that open source models are being used to support sector planning (Ten Year Network Development Plans) and public policy (decarbonization pathways).  I would appreciate any updates.

best, Robbie

-- 
Robbie Morrison
Address: Schillerstrasse 85, 10627 Berlin, Germany
Phone: +49.30.612-87617

Anders F Björklund

unread,
Mar 30, 2026, 12:27:31 PM (5 days ago) Mar 30
to swhid-...@googlegroups.com
Now the implementation of directory and database is complete, and in "doc":

https://github.com/afbjorklund/go-swhid/blob/main/doc/storage.md

Added an alternative database schema, using libgit2 instead of .git objects.

Using integers for the type, with GIT_OBJ__EXT2 (= 5) for the "snapshot"
and no compression on the data instead of the zlib compression used before.

And naturally, again I found that this idea had already been discussed :-)
(with postgresql rather than sqlite and for .git not .swh, but anyway...)


/Anders

Moritz

unread,
Apr 1, 2026, 2:00:08 PM (3 days ago) Apr 1
to SWHID (Software Hash Identifiers) discussions
Dear Robbie,

I will certainly keep you posted. I eventually got the download link on March 28th 19:54. The screenshot was taken on March 24th 09:37. That is 4 days and 10 hours of "cooking" time. So this method will not be very helpful to make git cloning more reliable.

I would guess that at least 1 million developers have already dealt with speed bumps like this when cloning from git in 2025. Most of these "429 Too Many Requests" errors happen in automated CI/CD environments, where thousands of runners behind a single IP address hammer a central server like GitHub for the same data rather than using a local cache.

By moving toward content-based addressing (SWHID) for these automated fetches, we can mitigate this. Beyond the long-term reproducibility, it would immediately reduce the number of re-trigger events to failed CI runs.

All the best
Moritz

Moritz

unread,
Apr 2, 2026, 8:58:57 AM (2 days ago) Apr 2
to SWHID (Software Hash Identifiers) discussions
Dear all,

I continued the discussion with Maxence during lunch. It seems that, regardless of the type of SWHID we use, we'll always need to specify the origin. At least if we want to be able to move forward in time.

With current technology, what we can do instead of 

git clone swh://swh:1:rev:676fe44740a14c4f0e09ef4a6dc335864e1727ca;origin=https://github.com/wikimedia/mediawiki

run

git clone --revision 676fe44740a14c4f0e09ef4a6dc335864e1727ca https://github.com/wikimedia/mediawiki

If the origin is broken, we can instruct people to instead use


In terms of recommendations, we recommend checking that both work and developing helper scripts to support people in checking their manuscripts prior to submission.

With approaches like https://github.com/seeraven/gitcache, one can add additional logic that fetches from the best source before contacting the specified origin.

After some discussions, we finally decided that we will take a closer look into radicle.xyz ... as the skepticism towards github did rise considerably in our bubble during the last months. 

We'll keep you posted
Moritz



Reply all
Reply to author
Forward
0 new messages