Git protocol v2 and Software Heritage

20 views

Skip to first unread message

Stefan Sperling

unread,

Aug 23, 2022, 7:44:25 AM8/23/22

to dulwich...@googlegroups.com

The Software Heritage project (https://softwareheritage.org) has asked
me to investigate adding Git Protocol version 2 support to Dulwich.
This project would allow me to contribute to both Software Heritage and
Dulwich at the same time. I have already made contributions to
Software Heritage in the past, but I am new to Dulwich development.

I already wrote a short off-list introduction to Jelmer. I would like
to discuss the technical aspects of this project on this mailing list.
My goal for this discussion is to mediate between the requirements of
Software Heritage and the requirements of the Dulwich project.

The corresponding ticket on the Dulwich side is relatively empty:
https://github.com/jelmer/dulwich/issues/628
Based on brief off-list discussion with Jelmer, I have already learned
that protocol code should be usable for a Git client which uses the
asyncio python module. As far as I understand, this means APIs exposed
by the protocol implementation should not result in blocking system calls
being made. I suppose the APIs could pass data in memory buffers, or make
use of callbacks which allow the API user to decide whether to block.

Is the existing protocol code already compatible with asyncio?
If it is not, should the existing protocol code be adjusted for this,
or would it make more sense to create an additional protocol module?

From the Software Heritage perspective, the two most interesting features
available with Git Protocol version 2 are "list-refs", for the obvious
reason of a faster initial handshake with the server, and object filtering
in the "fetch" command. The latter is interesting because Software Heritage
is ingesting as much source code as possible, and there are some very large
Git repositories out there which they have to deal with.
Reducing Git protocol overhead and size of pack files fetched into such
repositories would help reduce the amount of work done by Software Heritage
Git-Loader machines, which are continuously scraping sites such as Github
for ever more source code history stored in Git repositories.

Regarding implementation steps in Dulwich, I have come up with the plan
below. Some of the steps on my list could be re-ordered, and some of them
could be omitted or left for later work (especially the optional ones).
This depends on my available time (Software Heritage provides funding for
this effort) and on common ground between the preferences expressed by
Software Heritage and Dulwich.

A full implementation of protocol v2 features would take a lot of effort.
Fortunately, many features can be added independently of one another, and
some of them are optional. The protocol v2 feature set can be split up
very roughly into 4 independent implementation compartments:
client/server side, and the git-tcp/http transport layers.

For Software Heritage, client and http are most important, but it is
reasonable to expect that an implementation which limits itself to
these aspects would be lacking from the Dulwich point of view.
Dulwich's criteria for feature-parity and quality are important because
Software Heritage would like to see protocol v2 support shipped in newer
Dulwich releases; having to maintain local patches in their dependencies
would be a huge distraction from their actual mission.
I feel I still need to learn a bit more about Dulwich's criteria in order
to finalize my plan. Could anyone expand on what is considered critically
important when such a feature would be shipped in a Dulwich release?
What are the guarantees which Dulwich users should be able to take for
granted?

Work on this project will likely progress in stages, adding more features
incrementally over time. I will work with the Software Heritage team to
evaluate new code in their infrastructure, both to get new code tested
and to determine how much impact protocol v2 features have on day-to-day
operation of the Software Heritage archive.

I will likely have to iterate on APIs a bit, since testing new code
with the SWH Git Loader will require me to expose APIs to it early, which
would perhaps undergo review at Dulwich only afterwards. Over time, I hope
to learn a lot about Dulwich API conventions in order to avoid repeated
friction on such aspects. I hope I can occasionally ask API design questions
on this list, in case I am unsure about something.

Regards,
Stefan

Git Protocol v2 implementation steps:

Each goalpost below implies support in the protocol layer, and the client
or server layer, the addition of corresponding regression tests, and
exposure of new features at the dulwich CLI where useful.

Ignored v2 features: server-options, object-format, session-id
I am not sure what these are used for, and they seem to trivial to add
later in case someone has a use case for them.

Milestone 1a: client/git-tcp-transport
Goalpost 1: announce v2 support and understand initial server response
Goalpost 2: support ls-refs command
Goalpost 3: fetch works with minimal command set: want, have, done
Goalpost 4: fetch supports thin-pack
Goalpost 5: fetch supports shallow
Goalpost 6: fetch supports deepen
Goalpost 7: fetch supports filter
optional Goalpost 8: fetch supports include-tag
optional Goalpost 9: fetch supports want-ref
optional Goalpost 10: fetch supports deepen-relative, deepen-since, deepen-not
optional Goalpost 11: fetch supports sideband-all
optional Goalpost 12: fetch supports wait-for-done
optional Goalpost 13: object-info command

Milestone 1b: client/http-transport
repeat goalposts for client/git-tcp-transport
optional Goalpost 14: fetch supports packfile-uris (this feature requires HTTP)

Milestone 2a: server/git-tcp-transport
Goalpost 1: announce v2 support to connecting clients, send initial v2 capability advertisement
Goalpost 2: support ls-refs command response
Goalpost 3: fetch-response works with minimal command set: want, have, done
Goalpost 4: fetch-response supports thin-pack
Goalpost 5: fetch-response supports shallow (in v2, this implies support for deepen, deepen-relative, deepen-since, deepen-not)
Goalpost 6: fetch-response supports filter
optional Goalpost 7: fetch-response supports include-tag
optional Goalpost 8: fetch-response supports want-ref
optional Goalpost 9: fetch-response supports sideband-all
optional Goalpost 10: fetch-response support wait-for-done

Milestone 2b: server/http-transport
repeat goalposts for server/git-tcp-transport

Jelmer Vernooij

unread,

Aug 30, 2022, 7:43:39 PM8/30/22

to Stefan Sperling, dulwich...@googlegroups.com

Hi Stefan,

On Tue, 23 Aug 2022 at 12:44, Stefan Sperling <st...@stsp.name> wrote:

The Software Heritage project (https://softwareheritage.org) has asked
me to investigate adding Git Protocol version 2 support to Dulwich.
This project would allow me to contribute to both Software Heritage and
Dulwich at the same time. I have already made contributions to
Software Heritage in the past, but I am new to Dulwich development.

Thanks again for working on protocol v2 support!

The corresponding ticket on the Dulwich side is relatively empty:
https://github.com/jelmer/dulwich/issues/628
Based on brief off-list discussion with Jelmer, I have already learned
that protocol code should be usable for a Git client which uses the
asyncio python module. As far as I understand, this means APIs exposed
by the protocol implementation should not result in blocking system calls
being made. I suppose the APIs could pass data in memory buffers, or make
use of callbacks which allow the API user to decide whether to block.

The main reason I brought up asyncio support is that I think it would be fairly easy to separate out the serialization/deserialization code from the networking logic now, where it will take a lot of time later to tear it apart when adding asyncio support.

For example, I imagine that there'd be a blocking function for wrapping a string in a pktline, but that that gets invoked by both the sync and the (future) async code. This is already how dulwich.client works to some extent.

Is the existing protocol code already compatible with asyncio?
If it is not, should the existing protocol code be adjusted for this,
or would it make more sense to create an additional protocol module?

The existing protocol isn't asyncio compatible. I don't think it necessarily makes sense to create one big class that supports both async and sync operation, but it may make sense to factor out some of the code that could be shared between the AsyncProtocol and Protocol classes - and have them both call that shared code.

For Software Heritage, client and http are most important, but it is
reasonable to expect that an implementation which limits itself to
these aspects would be lacking from the Dulwich point of view.
Dulwich's criteria for feature-parity and quality are important because
Software Heritage would like to see protocol v2 support shipped in newer
Dulwich releases; having to maintain local patches in their dependencies
would be a huge distraction from their actual mission.
I feel I still need to learn a bit more about Dulwich's criteria in order
to finalize my plan. Could anyone expand on what is considered critically
important when such a feature would be shipped in a Dulwich release?
What are the guarantees which Dulwich users should be able to take for
granted?

Work on this project will likely progress in stages, adding more features
incrementally over time. I will work with the Software Heritage team to
evaluate new code in their infrastructure, both to get new code tested
and to determine how much impact protocol v2 features have on day-to-day
operation of the Software Heritage archive.

I will likely have to iterate on APIs a bit, since testing new code
with the SWH Git Loader will require me to expose APIs to it early, which
would perhaps undergo review at Dulwich only afterwards. Over time, I hope
to learn a lot about Dulwich API conventions in order to avoid repeated
friction on such aspects. I hope I can occasionally ask API design questions
on this list, in case I am unsure about something.

That all sounds reasonable.

Git Protocol v2 implementation steps:

Each goalpost below implies support in the protocol layer, and the client
or server layer, the addition of corresponding regression tests, and
exposure of new features at the dulwich CLI where useful.

Ignored v2 features: server-options, object-format, session-id
I am not sure what these are used for, and they seem to trivial to add
later in case someone has a use case for them.

Milestone 1a: client/git-tcp-transport
Goalpost 1: announce v2 support and understand initial server response

Would the v2 support be inside of the existing classes, or something separate?

If you're including it in the existing classes, we need to make sure that we don't have any regressions in terms of what features are supported (unless you're hiding v2 support behind a flag of some sort).

Goalpost 2: support ls-refs command
Goalpost 3: fetch works with minimal command set: want, have, done
Goalpost 4: fetch supports thin-pack
Goalpost 5: fetch supports shallow
Goalpost 6: fetch supports deepen
Goalpost 7: fetch supports filter
optional Goalpost 8: fetch supports include-tag
optional Goalpost 9: fetch supports want-ref
optional Goalpost 10: fetch supports deepen-relative, deepen-since, deepen-not
optional Goalpost 11: fetch supports sideband-all
optional Goalpost 12: fetch supports wait-for-done
optional Goalpost 13: object-info command

Milestone 1b: client/http-transport
repeat goalposts for client/git-tcp-transport
optional Goalpost 14: fetch supports packfile-uris (this feature requires HTTP)

Milestone 2a: server/git-tcp-transport
Goalpost 1: announce v2 support to connecting clients, send initial v2 capability advertisement

Wouldn't you want to run these in parallel? A lot of the tests that dulwich has are run against its own server (as well as the git server).

Stefan Sperling

unread,

Sep 16, 2022, 6:50:13 AM9/16/22

to Jelmer Vernooij, dulwich...@googlegroups.com

On Wed, Aug 31, 2022 at 12:43:27AM +0100, Jelmer Vernooij wrote:
> On Tue, 23 Aug 2022 at 12:44, Stefan Sperling <st...@stsp.name> wrote:
> > Based on brief off-list discussion with Jelmer, I have already learned
> > that protocol code should be usable for a Git client which uses the
> > asyncio python module. As far as I understand, this means APIs exposed
> > by the protocol implementation should not result in blocking system calls
> > being made. I suppose the APIs could pass data in memory buffers, or make
> > use of callbacks which allow the API user to decide whether to block.
> >
> The main reason I brought up asyncio support is that I think it would be
> fairly easy to separate out the serialization/deserialization code from the
> networking logic now, where it will take a lot of time later to tear it
> apart when adding asyncio support.
>
> For example, I imagine that there'd be a blocking function for wrapping a
> string in a pktline, but that that gets invoked by both the sync and the
> (future) async code. This is already how dulwich.client works to some
> extent.

An async client would need to be able to do something like this?

await proto.write_pkt_line(data)

and similarly, for reading:

data = await proto.read_pkt_line()

Would such functions (coroutines) even be useful in a non-asyncio context?
As far as I understand, an event loop is required to make use of coroutines,
which cannot be called directly. Or maybe the non-async Git client would do
something like treating the coroutine as an iterator, similarly to the
existing read_pkt_seq() function? Something like:

pkt = next(proto.read_pkt_line())

> > Is the existing protocol code already compatible with asyncio?
> > If it is not, should the existing protocol code be adjusted for this,
> > or would it make more sense to create an additional protocol module?
> >
> The existing protocol isn't asyncio compatible. I don't think it
> necessarily makes sense to create one big class that supports both async
> and sync operation, but it may make sense to factor out some of the code
> that could be shared between the AsyncProtocol and Protocol classes - and
> have them both call that shared code.

Right, that makes sense. Supporting both async and sync consumers of
the same API is not trivial.

If the shared code was only operating on data stored in memory buffers
it would be trivial to share between the two classes. The maximum amount
of data in a packet line is about 64k, so at least adding and stripping
pkt-line headers could be done entirely in memory. I am not sure how to
handle cases where an I/O operation happens in shared code. But I do not
have prior experience with writing asyncio code, so maybe there is an
idiom people use that I don't know about?

It would help to understand how you envision the partitioning of high-level
tasks performed by a Git client, i.e. what the main event loop would do.
Is one goal to be able to start indexing the pack file before the whole
pack file has been received? This would require reading a pktline worth
of packfile-data sideband (or some fixed-size chunk of packfile data in
the non-sideband case), queuing the packfile data in chunks, and then
while we are waiting for more packfile data to arrive over the network we
could have a consumer pull packfile data from the queue and feed it to
the pack indexing process. I can see how asyncio would be useful here.

> > Each goalpost below implies support in the protocol layer, and the client
> > or server layer, the addition of corresponding regression tests, and
> > exposure of new features at the dulwich CLI where useful.
> >
> > Ignored v2 features: server-options, object-format, session-id
> > I am not sure what these are used for, and they seem to trivial to add
> > later in case someone has a use case for them.
> >
> > Milestone 1a: client/git-tcp-transport
> > Goalpost 1: announce v2 support and understand initial server response
>
> Would the v2 support be inside of the existing classes, or something
> separate?
>
> If you're including it in the existing classes, we need to make sure that
> we don't have any regressions in terms of what features are supported
> (unless you're hiding v2 support behind a flag of some sort).

Modifying existing code would be much easier for me, simply because it would
make it easy to conform to the style of existing code, and to become aware
of design choices that have been made over time.
But given that adding asyncio to existing code can be complicated, I am
open to adding new code in order to make use of asyncio possible in the
future. It seems like a good idea, and I can learn something new :)
I can probably pick a lot of parts from existing code in any case.

It is certain that v2 support must be hidden behind a flag in parts of
the implementation because dulwich must keep working with protocol-v1
peers. This doesn't need to be a user-visible switch as use of
protocol-v2 can be transparently negotiated with the peer. For tests,
though, it must be possible to enforce a particular protocol version.

I would like to use the existing regression tests in order to check for
newly introduced breakage, and be able to run the existing test suite
against git-deamon with either protocol v0/1 enabled, or protocol v2
enabled. All tests should keep passing, perhaps with small adjustments.
Protocol v2 is a true superset of protocol v1 in terms of features.
In cases where v2 requires a collection of features of which dulwich only
implements a subset ("deepen" comes to mind?), we could of course skip
affected tests while testing with v2, or add the missing features if that
doesn't delay progress on the core goals too much.

> > Milestone 1b: client/http-transport
> > repeat goalposts for client/git-tcp-transport
> > optional Goalpost 14: fetch supports packfile-uris (this feature requires
> > HTTP)
> >
> > Milestone 2a: server/git-tcp-transport
> > Goalpost 1: announce v2 support to connecting clients, send initial v2
> > capability advertisement
> >
>
> Wouldn't you want to run these in parallel? A lot of the tests that dulwich
> has are run against its own server (as well as the git server).

Possibly yes, though adding protocol-v2 support to both http and git-tcp
in parallel might be more challenging than doing one after the other.
Since protocol-v2 is enabled during the initial capabilities exchange,
doing http and git-tcp one after the other is indeed feasible.

We have to start somewhere. Initially, for working on client-side support,
I would rely on git-daemon as the v2-protocol capable server to test
against, and use dulwich's own server only during v1-protocol test runs
in order to catch regressions. Overall, it is the client which benefits
most from protocol-v2. On the server's end, serving a protocol-v1 client
implies less cpu-bound work than serving a v2 client does; the main benefit
of v2 is that the server might end up sending less data over the network.

Reply all

Reply to author

Forward

0 new messages