Idea for future Gerrit

39 views

Skip to first unread message

Gergely Kis

unread,

Nov 14, 2008, 1:22:21 PM11/14/08

to repo-d...@googlegroups.com

Hi,

On Thu, Nov 13, 2008 at 8:52 PM, Shawn Pearce <s...@google.com> wrote:
> You read my mind. I have not yet had time to put together some short docs
> on what Gerrit 2 is and what we need to implement feature wise, but I
> planned to get that all posted this weekend, along with a new development
> branch.
>
> Gerrit 2 will use a GWT frontend and a Java backend. We are moving it off
> GAE as a lot of people have asked us to make it availble for them to run
> locally without using the Google servers.
>
> Our decision to use GAE for Gerrit 1 wasn't a policy choice. We did it
> because we forked from Rietveld and we didn't want to rewrite everything
> from the ground up, given how little time we had to put together the entire
> repo and Gerrit chain.
>
> So yea, it sounds like we want the same thing, and i'd love to have you
> involved. I'll post more by this weekend, including the initial code tree.
Great, thanks for the update.

In the meantime I looked at how the communication flow works currently
between the local repositories and Gerrit.
If I understand it correctly, it roughly does the following:
- repo (or git-upload.py) creates a git bundle with the commit(s) that
is being sent to gerrit.
- Then that bundle is examined by the server-side and the information
necessary for the review is stored in the GAE part.
- When the review process ends with a "let's merge" decision, Gerrit
tries to apply the bundle to the original repository.

I was thinking of a slightly different method:
Preconditions:
- there are the public repositories, with the merged code.
- there is also a set of repositiories used by the Gerrit server
cloned form the public repository.

The developer works in his local repos on a topic branch, and when he
is ready, repo basically pushes his branches in the different
repository to the Gerrit speicifc git repositories. To actually have a
common id for these branches, a branchname (the current Gerrit change
id) is requested from the Gerrit server prior to pushing.

This way the new Gerrit database only contains the data related to the
review, the proposed changes themselves are in regular git
repositories.

If the review process decides to push the change into the official
sources, Gerrit will push the commits from its own git repositories to
the main ones.

One advantage of this method is that e.g. if a change is updated as a
result of the review process, the developer could just push the same
topic branch again. Also, these Gerrit git repositories would contain
all proposed source code changes at one place. With an appropriate
visualization tool it could become a valuable asset.

Also, I think that this method would decrease the complexity of Gerrit
and repo, especially when we are thinking about supporting multiple
development branches in Gerrit. Since the Gerrit review source DB is
just a regular git repository, almost any workflow imaginable is easy
to implement, like staging branches, branches to collect proposed
commits for continuous integration ... etc.

At least, so I think. However, since I am not a git expert, please
feel free to point out the mistakes / wrong assumptions I made.

What do you think about this idea?

Best Regards,
Gergely

Shawn Pearce

unread,

Nov 14, 2008, 1:59:53 PM11/14/08

to repo-d...@googlegroups.com

So Gerrit almost works as you describe already.

The current flow is actually more like this:

- "repo upload" creates a Git bundle file of the commits to upload.
- The bundle is sent over HTTP POST to Gerrit. Large bundles are fragmented into 1 MB chunks.

- Gerrit stores the bundle into the GAE data store (the ReceivedBundle entity).

- Every 10 seconds the Gerrit mgrapp downloads new ReceivedBundle entities.
- The bundle is unpacked into a "private-to-Gerrit" Git repository.
- A Change entity is created in the GAE data store.
- The file diffs are uploaded to GAE as Patch entities within that Change.
- A "refs/changes/BB/AABB/P" ref is created pointing at the commit.

Syntax is: AABB is the change number; BB being the last two digits.
P is the Patch Set number, typically 1, but with "repo upload --replace" can be 2, 3, ...

Run a "git ls-remote korg" sometime to see what I'm talking about, especially in gerrit itself where there is a lot of activity. :-)

- A sweeper process in mgrapp cleans out ReceivedBundle entities which have been successfully unpacked into Change entities.

- The "private-to-Gerrit" Git repository is rsync'd onto the public mirror servers of the git.kernel.org mirror farm. I think there are 4 such servers right now in different parts of the world. The rsync is a continuous sync. I think the latency right now is about 15 minutes from when Gerrit creates the ref to when it is available world-wide through the mirror servers. So "private-to-Gerrit" isn't really private; everything in the Gerrit repository is public within 15 minutes. I want to reduce that latency as much as we can, but Gerrit doesn't hide anything, its all exposed.

- "repo download" is actually doing a "git fetch korg refs/changes/BB/AABB/P && git checkout FETCH_HEAD".

- Submitting a change through the web interface causes the mgrapp to update the "private-to-Gerrit" repository's branch. That update is rsync'd over the mirror farm a short while later.

So we are already creating the topic branches you were suggesting, just they are in a namespace that "git clone" and "git remote add" wouldn't normally access. However you can force Git to access it by giving the name directly, or by editing .git/config and adding additional fetch lines to get other parts of the namespace.

We chose to put change refs into a namespace outside of "refs/heads/" so they wouldn't clutter up the average client. Most people don't need every single change. But they may want one or two, in which case they can easily download them with standard Git tools.

You can also download an entire "topic branch" at once by getting the top-most change in the series. Since the ancestry is preserved by the bundle upload and the ref creation downloading the last change in a series will automatically fetch all prior changes.

The "/P" suffix is about making all iterations of any given change available, so you can download them and diff them locally, or run builds, etc. So even if a patch set has been replaced with a newer one, the predecessors are still available. Again, people won't want the 8 different drafts of most changes, they just want the final change. But if you are looking at that change, why shouldn't you be able to get all 8 drafts at once if you want? (hint: `git fetch korg refs/changes/BB/AABB/*:refs/heads/change-AABB/*` would get you all 8 drafts of that one patch if there were 8 drafts).

The decision to use "HTTP POST" rather than say "git://" or "git+ssh://" directly to upload new changes is related to a few items:

First, the connection everyone has is to read-only mirror servers in the git.kernel.org farm. We cannot accept writes through those servers, as the kernel.org maintainers refuse write access to *everyone*. So that was a no-starter. Second, the mgrapp part of Gerrit is running inside a firewall, so we couldn't connect to it directly. Opening the firewall was more complex than the leap-frogging through the Gerrit web interface and the GAE data store.

Second, more data beyond just the Git bundle is included in the upload. Look at proto/upload_bundle.proto in gerrit.git. We also are embedding a mapping of new commit SHA-1 to prior change number (the new ReplacePatchSet message). This data is used by Gerrit to connect new commits into prior changes to add new Patch Sets to them. There isn't technically room for this sort of extension data in a git:// or git+ssh:// network connection. We probably could hack something in by pushing this data in a blob or something to some special ref which the server side doesn't actually update, but that's as messy (if not messier) than using bundle over HTTP POST.

Third, many existing contributors wanted to make sure a contributor's license agreement was in place before they offered code to the open source project. This is to protect the project leads from later lawsuits about using code they weren't given permission to use. So the approach Gerrit takes is it refuses to accept (and in turn publish code on the web and via git) something unless the contributor has submitted a valid CLA stating it is permissible to publish those changes. This choke point is in the web code that accepts the HTTP POST. Again, under a git:// or git+ssh:// protocol we would need to implement a similar check and refuse it if the CLA is not yet in place for the contributor.

Gergely Kis

unread,

Nov 14, 2008, 3:09:26 PM11/14/08

to repo-d...@googlegroups.com

Hi,

Thank you for the detailed description.
Actually I was not thinking about git:// or git+ssh://, but the HTTP
based git interface, which could have been just as easily
authenticated.

But given the other administrative issues with kernel.org repos and
everything, I think you basically got everything you could out of the
situation.

Regarding the hacking of the git protocol: I actually thought that
repo would still be talking to Gerrit directly as a kind of control
channel, and only push the actual source content directly to the git
repo using the HTTP access protocol. However, this transport protocol
part is not really an important part of the idea, and since you
already did what I was thinking about...:)