Gerrit and git-filter-branch

Mark

unread,

May 4, 2010, 12:35:28 PM5/4/10

to Repo and Gerrit Discussion

Hi all,

We're using gerrit with a git tree that was converted from svn. Our
svn repository had some largish binary objects - not massive but they
take up far more space than the actual code.

Some of our developers are working remotely without access to reliable
broadband and so doing a full clone is quite a slow process. On top of
that it does seem unnecessary as we have those files stored elsewhere
now. I've read up about how to do a git-filter-branch to remove all
versions of an object from history etc. I have a few questions about
how this will work with Gerrit if it can...

The problem, as I see it, is that all the commit ids going back a very
long way will have changed and Gerrit will not be able to reconcile
it's database with the Git repository. Assuming that this does work
(and I'd like to know how...) then if I do the necessary git filter-
branch changes to my local repository will it be sufficient to simply
push the rewritten history to gerrit? Would I also need to force the
expiry of unreachable references and run gc on the server repository?

So then, assuming that it works and is possible, everyone else's local
repositories will be so far out of sync with the server that they'd
probably have to clone again (which should suddenly be much much
faster). Right?

Assuming that I'm on the right track to this point what would happen
to existing reviews etc in Gerrit? Would Gerrit be able to 'find' them
based on the ChangeIds or something like that? If there was an
existing change in Gerrit awaiting review I'd assume we'd need to
rebase it onto the new master before trying to merge... What about
changes that have already been merged with Gerrit - will Gerrit still
be able to attach the review history to those changes?

Or should I forget trying to maintain all this in the same repository
and just create a new repository in Gerrit and push the rewritten
history into that? (And then, later, remove the existing repository
from Gerrit as we won't be using it any more). I don't really want to
do this but it would be a possible option.

Sorry if all this comes off as confused and uninformed - I've read
what I can find but found almost nothing that was obviously related to
Gerrit....

thanks for any help,

-mark

--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

Shawn Pearce

unread,

May 4, 2010, 1:21:48 PM5/4/10

to Mark, Repo and Gerrit Discussion

Mark <mark.b...@gmail.com> wrote:
> We're using gerrit with a git tree that was converted from svn. Our
> svn repository had some largish binary objects - not massive but they
> take up far more space than the actual code.
>
> Some of our developers are working remotely without access to reliable
> broadband and so doing a full clone is quite a slow process. On top of
> that it does seem unnecessary as we have those files stored elsewhere
> now. I've read up about how to do a git-filter-branch to remove all
> versions of an object from history etc. I have a few questions about
> how this will work with Gerrit if it can...

Yup. I've done this sort of rewrite exactly once... for the
JGit project. We were forced to rewrite our history through
filter-branch, but I wanted to save the Gerrit Code Review records.

> The problem, as I see it, is that all the commit ids going back a very
> long way will have changed and Gerrit will not be able to reconcile
> it's database with the Git repository.

That's true.

> Assuming that this does work
> (and I'd like to know how...) then if I do the necessary git filter-
> branch changes to my local repository will it be sufficient to simply
> push the rewritten history to gerrit?

If you push the rewritten history to Gerrit, it'll update the Git
branches to point to them, but all reviews will still be looking
at the old pre-rewritten versions. So effectively you will have
two copies of every reviewed-and-submitted change:

* one from before filter-branch, attached to the review;
* one from after filter-branch, merged into the branch(es)

> Would I also need to force the
> expiry of unreachable references and run gc on the server repository?

To completely discard these, yes, you would need to delete everything
from the refs/changes/ area on the Gerrit repository, and also delete
those change records from the SQL database. That's not very pretty,
but it can be done.

If you go this route, make sure to check not just $project.git/refs/changes
but also $project.git/packed-refs to purge out the refs/changes/ names. At
that point you also can probably just delete every change record from the
database for the affected projects.

> So then, assuming that it works and is possible, everyone else's local
> repositories will be so far out of sync with the server that they'd
> probably have to clone again (which should suddenly be much much
> faster). Right?

Yes.

> Assuming that I'm on the right track to this point what would happen
> to existing reviews etc in Gerrit? Would Gerrit be able to 'find' them
> based on the ChangeIds or something like that? If there was an
> existing change in Gerrit awaiting review I'd assume we'd need to
> rebase it onto the new master before trying to merge... What about
> changes that have already been merged with Gerrit - will Gerrit still
> be able to attach the review history to those changes?

A rewrite is *painful*.

The "naive just run filter-branch and push it up to Gerrit" process
will disconnect the review data from the actual commits that are
in the rewritten branch now. It won't correlate the way you want
it to. But it might be OK that the commit SHA-1s don't match.
Or it might not be.

If you need them to match, you also need to rewrite the Gerrit
SQL data.

I used the following script to execute filter-branch when I rewrote
JGit's history:

http://egit.eclipse.org/w/?p=jgit.git;a=blob;f=tools/rewrite-history.sh;hb=d011a377cbf30738a1a2d9b156cf869346adb537

For your needs, the only part that matter is the commit-filter,
it saves a mapping of old commit SHA-1 to new commit SHA-1 so we
can update Gerrit later:

MAP_OF_COMMITS=$(pwd)/commit.map
export MAP_OF_COMMITS

: >$MAP_OF_COMMITS

git filter-branch --commit-filter '
n=$(git commit-tree "$@")
echo $GIT_COMMIT,$n >>$MAP_OF_COMMITS
echo $n
' \
$(git for-each-ref --format='%(refname)' refs/heads refs/changes)

I ran this on a bare repository that was a full clone from
the server, so we would have both refs/heads and refs/changes
present. That is:

git clone --bare ssh://...:29418/project.git project.git

After I was happy with the rewrite:

I loaded that commit.map file into my SQL database into a
temporary table:

CREATE TEMPORARY TABLE commit_map (
old_id VARCHAR(40) NOT NULL,
new_id VARCHAR(40) NOT NULL)

Loaded commit.map into that table, and then updated the information
tables in Gerrit's database:

CREATE INDEX commit_map_idx ON commit_map(old_id);

UPDATE patch_sets SET revision =
(SELECT new_id FROM commit_map WHERE old_id = revision)
WHERE change_id IN (SELECT change_id
FROM changes
WHERE dest_project_name = 'project');

UPDATE patch_set_ancestors SET ancestor_revision =
(SELECT new_id FROM commit_map WHERE old_id = ancestor_revision)
WHERE change_id IN (SELECT change_id
FROM changes
WHERE dest_project_name = 'project');

And then I updated all of the refs/heads/ and refs/changes/ behind
Gerrit's back by directly replacing the entire Git repository with
the rewritten one.

I did all of the above with the Gerrit server shutdown, so nobody
could modify the repository or the SQL database while I was doing it.
This turned out to be quite necessary with MySQL for example, the
above queries basically need to lock the entire tables to execute.

> Or should I forget trying to maintain all this in the same repository
> and just create a new repository in Gerrit and push the rewritten
> history into that? (And then, later, remove the existing repository
> from Gerrit as we won't be using it any more). I don't really want to
> do this but it would be a possible option.

Its not pretty to try and preserve the existing data. Its not
exactly a case we designed Gerrit to support. I've only done it
once myself, and it was painful, but I managed to make it work.

Mark

unread,

May 4, 2010, 2:45:19 PM5/4/10

to Repo and Gerrit Discussion

Shawn,

Thanks for the incredibly detailed and useful response.

Considering the pain you predict and which I definitely want to avoid
I'm going to see if we can find an alternative. In our case that would
be to keep the existing repository as it is but create a new
repository to take it's place. Say we have repositories named:

idod/core
idod/native
...

We could leave them in place and create new ones with different name
to hold the repositories with rewritten history.

I guess something that would be better (seeing as we're happy with the
existing naming scheme) would be to rename the existing repositories -
something like idod/core -> archive/idod/core etc. Then we could
create new projects with the old names and re-import the history --
obviously making sure that all our developers are aware that their
local clones are about to be invalidated!

I found this snippet:

UPDATE projects SET name = 'git/' || SUBSTR(name,
LENGTH('projectsC/'))
WHERE name <> (SELECT wild_project_name FROM system_config);

UPDATE ref_rights SET project_name = 'git/' || SUBSTR(project_name,
LENGTH('projectsC/'))
WHERE project_name <> (SELECT wild_project_name FROM
system_config);

UPDATE changes SET dest_project_name = 'git/' ||
SUBSTR(dest_project_name, LENGTH('projectsC/'));

UPDATE account_project_watches SET project_name = 'git/' ||
SUBSTR(project_name, LENGTH('projectsC/'));

(from here: http://groups.google.com/group/repo-discuss/msg/e606661847e83f16)

Which looks reasonably adaptable to rename idod to archive/idod. I
guess the way to do this would be shutdown gerrit, move the existing
git repository, update database and restart gerrit.

Am I heading for a different world of pain or is this going to be
relatively safe?

thanks again,

-mark

On May 4, 10:21 am, Shawn Pearce <s...@google.com> wrote:

> http://egit.eclipse.org/w/?p=jgit.git;a=blob;f=tools/rewrite-history....

Shawn Pearce

unread,

May 4, 2010, 3:12:55 PM5/4/10

to Mark, Repo and Gerrit Discussion

Mark <mark.b...@gmail.com> wrote:
> I guess something that would be better (seeing as we're happy with the
> existing naming scheme) would be to rename the existing repositories -
> something like idod/core -> archive/idod/core etc. Then we could
> create new projects with the old names and re-import the history --
> obviously making sure that all our developers are aware that their
> local clones are about to be invalidated!
>
> I found this snippet:
>
> UPDATE projects SET name = 'git/' || SUBSTR(name,
> LENGTH('projectsC/'))
> WHERE name <> (SELECT wild_project_name FROM system_config);

Well, you might want to further qualify this UPDATE statement with
a condition to select which projects you are renaming. Unless you
want to rename *all* of them. :-)

> UPDATE ref_rights SET project_name = 'git/' || SUBSTR(project_name,
> LENGTH('projectsC/'))
> WHERE project_name <> (SELECT wild_project_name FROM
> system_config);
>
> UPDATE changes SET dest_project_name = 'git/' ||
> SUBSTR(dest_project_name, LENGTH('projectsC/'));
>
> UPDATE account_project_watches SET project_name = 'git/' ||
> SUBSTR(project_name, LENGTH('projectsC/'));

Same with these. :-)

> Am I heading for a different world of pain or is this going to be
> relatively safe?

This is a bit less painful than doing the full history rewrite,
because it is just those handful of statements above.

And your plan to shutdown Gerrit, do those updates, rename the
repository on disk, and start it back up is the right one.

Pretty simple.

Reply all

Reply to author

Forward