The Software Heritage project (
https://softwareheritage.org) has asked
me to investigate adding Git Protocol version 2 support to Dulwich.
This project would allow me to contribute to both Software Heritage and
Dulwich at the same time. I have already made contributions to
Software Heritage in the past, but I am new to Dulwich development.
I already wrote a short off-list introduction to Jelmer. I would like
to discuss the technical aspects of this project on this mailing list.
My goal for this discussion is to mediate between the requirements of
Software Heritage and the requirements of the Dulwich project.
The corresponding ticket on the Dulwich side is relatively empty:
https://github.com/jelmer/dulwich/issues/628
Based on brief off-list discussion with Jelmer, I have already learned
that protocol code should be usable for a Git client which uses the
asyncio python module. As far as I understand, this means APIs exposed
by the protocol implementation should not result in blocking system calls
being made. I suppose the APIs could pass data in memory buffers, or make
use of callbacks which allow the API user to decide whether to block.
Is the existing protocol code already compatible with asyncio?
If it is not, should the existing protocol code be adjusted for this,
or would it make more sense to create an additional protocol module?
From the Software Heritage perspective, the two most interesting features
available with Git Protocol version 2 are "list-refs", for the obvious
reason of a faster initial handshake with the server, and object filtering
in the "fetch" command. The latter is interesting because Software Heritage
is ingesting as much source code as possible, and there are some very large
Git repositories out there which they have to deal with.
Reducing Git protocol overhead and size of pack files fetched into such
repositories would help reduce the amount of work done by Software Heritage
Git-Loader machines, which are continuously scraping sites such as Github
for ever more source code history stored in Git repositories.
Regarding implementation steps in Dulwich, I have come up with the plan
below. Some of the steps on my list could be re-ordered, and some of them
could be omitted or left for later work (especially the optional ones).
This depends on my available time (Software Heritage provides funding for
this effort) and on common ground between the preferences expressed by
Software Heritage and Dulwich.
A full implementation of protocol v2 features would take a lot of effort.
Fortunately, many features can be added independently of one another, and
some of them are optional. The protocol v2 feature set can be split up
very roughly into 4 independent implementation compartments:
client/server side, and the git-tcp/http transport layers.
For Software Heritage, client and http are most important, but it is
reasonable to expect that an implementation which limits itself to
these aspects would be lacking from the Dulwich point of view.
Dulwich's criteria for feature-parity and quality are important because
Software Heritage would like to see protocol v2 support shipped in newer
Dulwich releases; having to maintain local patches in their dependencies
would be a huge distraction from their actual mission.
I feel I still need to learn a bit more about Dulwich's criteria in order
to finalize my plan. Could anyone expand on what is considered critically
important when such a feature would be shipped in a Dulwich release?
What are the guarantees which Dulwich users should be able to take for
granted?
Work on this project will likely progress in stages, adding more features
incrementally over time. I will work with the Software Heritage team to
evaluate new code in their infrastructure, both to get new code tested
and to determine how much impact protocol v2 features have on day-to-day
operation of the Software Heritage archive.
I will likely have to iterate on APIs a bit, since testing new code
with the SWH Git Loader will require me to expose APIs to it early, which
would perhaps undergo review at Dulwich only afterwards. Over time, I hope
to learn a lot about Dulwich API conventions in order to avoid repeated
friction on such aspects. I hope I can occasionally ask API design questions
on this list, in case I am unsure about something.
Regards,
Stefan
Git Protocol v2 implementation steps:
Each goalpost below implies support in the protocol layer, and the client
or server layer, the addition of corresponding regression tests, and
exposure of new features at the dulwich CLI where useful.
Ignored v2 features: server-options, object-format, session-id
I am not sure what these are used for, and they seem to trivial to add
later in case someone has a use case for them.
Milestone 1a: client/git-tcp-transport
Goalpost 1: announce v2 support and understand initial server response
Goalpost 2: support ls-refs command
Goalpost 3: fetch works with minimal command set: want, have, done
Goalpost 4: fetch supports thin-pack
Goalpost 5: fetch supports shallow
Goalpost 6: fetch supports deepen
Goalpost 7: fetch supports filter
optional Goalpost 8: fetch supports include-tag
optional Goalpost 9: fetch supports want-ref
optional Goalpost 10: fetch supports deepen-relative, deepen-since, deepen-not
optional Goalpost 11: fetch supports sideband-all
optional Goalpost 12: fetch supports wait-for-done
optional Goalpost 13: object-info command
Milestone 1b: client/http-transport
repeat goalposts for client/git-tcp-transport
optional Goalpost 14: fetch supports packfile-uris (this feature requires HTTP)
Milestone 2a: server/git-tcp-transport
Goalpost 1: announce v2 support to connecting clients, send initial v2 capability advertisement
Goalpost 2: support ls-refs command response
Goalpost 3: fetch-response works with minimal command set: want, have, done
Goalpost 4: fetch-response supports thin-pack
Goalpost 5: fetch-response supports shallow (in v2, this implies support for deepen, deepen-relative, deepen-since, deepen-not)
Goalpost 6: fetch-response supports filter
optional Goalpost 7: fetch-response supports include-tag
optional Goalpost 8: fetch-response supports want-ref
optional Goalpost 9: fetch-response supports sideband-all
optional Goalpost 10: fetch-response support wait-for-done
Milestone 2b: server/http-transport
repeat goalposts for server/git-tcp-transport