how to decrease the size of git metadata/ gerrit repo size

562 views
Skip to first unread message

nitish kumar

unread,
Sep 21, 2019, 7:57:31 AM9/21/19
to Repo and Gerrit Discussion
Hi,


I am trying below use case.

I have a repo of size 1.5 GB and .git is alone 1.4 GB. we know that it contains lot of history.
our project is 8 years old. I tried Git gc but still same size.

Now  how solve this.

1) one way we can do this is pushing a commit as initial commit and recreating the branches.

    but in this case we have to delete all tags and branches which are pointing to deleted history 


2)  lets say if we want to  keep last 1 year history , is it possible to make a certain commit as my initial commit ??


please let me know if anyone had done something similar to this.Or share your ideas.


Thanks,
Nitish

Gert van Dijk

unread,
Sep 21, 2019, 10:10:13 AM9/21/19
to nitish kumar, Repo and Gerrit Discussion
Good question; I have some ideas for you, but I have to tell you that
this won't be a straightforward or one-size-fits-all solution.

First of all, gc doesn't happen in cases where you still have actual
references to those git objects. In some cases only
experimental/feature (now stale) branches contain large objects and
deleting them, then running gc helps, but the majority is on the
'main' branches/tags as well, then you'll have to take other measures.

The simplest option, client only:
As a client cloning the project, use *shallow* clones/fetches [1]. The
server has to operate still on a full repository, so it doesn't help
to reduce the size of the repository there, but clients cloning it can
get more of a truncated version of the history. However,
unfortunately, while shallow clones using --depth are supported from
Gerrit, --shallow-since=<date>/--shallow-exclude=<rev> isn't yet. See
Issue 11564 [2] which I just filed as feature request.
Also doing shallow clones only proves to be making sense for
single-branch clones, which is the default when doing this. Ie.
speicifying --no-single-branch on a --depth=100 still shows me full
history on other (not-HEAD) branches.

For fixing this on the Gerrit server, this is much harder. Your server
probably has references to many of those commits if you've used Gerrit
changes over all those years. If I am correct, removing the history
will result in an inconsistency in the data (either ReviewDb or
NoteDb) and cause more problems than you want to deal with.
A nice solution for that doesn't really exist, but I have a few ideas:

Set up replication, and make it only replicate the currently active
branches and pattern of tags. Before doing so, make sure to have
created a manual shallow clone. Then have users clone from the replica
instead of directly from Gerrit. Clones should then only include
active branches, and if needed they can add the original Gerrit as
remote with their git client to fetch full history.
This is basically a user friendly set up from the first option, but
the major downside is the complicated replication set up, and the
Gerrit server still having full history.

A third option could work in case you have not that many commits, but
you simply don't care about most of the large files anymore (e.g.
images). Try setting up a local clone and filter the history with a
git filter-branch [3] script (deleting bloat data on each commit)
until you're satisfied. As this rewrites history and thus affects
commit hashes, you'll lose the references to the original revisions
and signatures will not validate any longer. Moreover, you'll have to
push to a new project on Gerrit and lose all the review metadata as
well. You may want to try to include a change in the commit message of
the filter-branch actions to include the original commit and then push
to refs/for/<branch> with the 'submit' push option [4] (perhaps also
include 'skip-validation') to let it create 'merged' changes and
include all data in the index, making it 'searchable' at least (but
you won't have original review data any longer).
An alternative to git filter-branch is CopyBara [5]. It allows you to
set up transformations (in your case to reduce the size, ignore some
paths via origin_files/destination_files [6]) and it supports adding
the original commit revision in the footer message ('GitOrigin-RevId'
footer via set_rev_id workflow setting). Then when finding a reference
to a git commit sha1 hash in the history, you should be able to find
the new change using Gerrit search. It's not ideal, but it could work.
Hardest part is is handing plethora of branches/tags here...

For the future, consider limiting the size of the commits to avoid
accepting bloat in your repositories ("Maximum Git object size limit"
project setting).

HTH

Gert

[1]: https://git-scm.com/docs/git-clone/2.23.0#Documentation/git-clone.txt---depthltdepthgt
[2]: https://bugs.chromium.org/p/gerrit/issues/detail?id=11564
[3]: https://git-scm.com/docs/git-filter-branch/2.23.0
[4]: https://gerrit-documentation.storage.googleapis.com/Documentation/3.0.2/user-upload.html#auto_merge
[5]: https://github.com/google/copybara
[6]: https://github.com/google/copybara/blob/15f519039076ff49562ee96420c970356fea48b6/docs/reference.md#coreworkflow

Matthias Sohn

unread,
Sep 21, 2019, 4:09:02 PM9/21/19
to Gert van Dijk, nitish kumar, Repo and Gerrit Discussion
another option is to seed a new repository with the latest version of master as a new orphaned commit.
Then make the old repository read-only. On the client side you can use git replace [1] and alternates [2] to stitch together
a combined history over the old and the new repository. This technique was used e.g. by Linux kernel [3] which is split into
multiple repositories. After a while the need to look at the older part of the history in the old humongous repository will fade.

Set the max object size limit [4] to limit the size of files which can be pushed and use the upload-validator plugin [5]
to keep out unwanted files like e.g. large binary files.


-Matthias 

nitish kumar

unread,
Sep 22, 2019, 10:22:12 AM9/22/19
to Repo and Gerrit Discussion
hi  Gert and Matthias,

Thanks for the response.

I tried below method [1] practically with one branch .
I have to do this for all branches and also need to check how gerrit  behaves if I  create "a project with less history".
will update more later


Martin Fick

unread,
Sep 23, 2019, 2:04:36 PM9/23/19
to repo-d...@googlegroups.com, nitish kumar
On Saturday, September 21, 2019 4:57:31 AM MDT nitish kumar wrote:

> I have a repo of size 1.5 GB and .git is alone 1.4 GB. we know that it
> contains lot of history.
> our project is 8 years old. I tried Git gc but still same size.

There are many reasons that git gc may not reduce the size of your repo, even
if it could, so I would not give up so easily trying to shrink it. git gc has
heuristics that may prevent it from repacking at all, therefor I would use the
git repack command directly, and verify that it creates a single new pack file
when it is done, and that there are no other pack files or loose objects left
when it is done. There are also many repacking settings that are worth
trying, and at a bare minimum it is probably worth ensuring that you do NOT
reuse old deltas, when trying to shrink your repo, since old deltas could
potentially be very inefficiently packed.

Can you do some forensic work and figure out where the space is being used, is
it pack files? Maybe use gitsizer or another git repo analysis tool to get
information about your repo?

-Martin


--
The Qualcomm Innovation Center, Inc. is a member of Code
Aurora Forum, hosted by The Linux Foundation

Matthias Sohn

unread,
Sep 23, 2019, 3:03:35 PM9/23/19
to Martin Fick, Repo and Gerrit Discussion, nitish kumar
You may also try to use this script [1] to identify the largest files in your repository

nitish kumar

unread,
Sep 24, 2019, 6:33:40 AM9/24/19
to Repo and Gerrit Discussion
I tried   'git gc --prune='31-07-2019' --aggressive'  also not problem with existing change requests in gerrit.

which reduced 1/2 size.



$ git count-objects -v
count: 0
size: 0
in-pack: 924517
packs: 1
size-pack: 584949
prune-packable: 0
garbage: 0
size-garbage: 0
[gerrit2@gerrit UI.git]$ du -sh
2.1G .
[gerrit2@gerrit UI.git]$ git count-objects -v
count: 451
size: 1912
in-pack: 1618765
packs: 2
size-pack: 1567472
prune-packable: 0
garbage: 0
[gerrit2@gerrit UI.git]$ date
Tue Sep 24 05:57:11 EDT 2019
[gerrit2@gerrit UI.git]$ git gc --prune='31-07-2019' --aggressive
Counting objects: 1608244, done.
Delta compression using up to 16 threads.
Compressing objects:  89% (1246463/1400520)   
Compressing objects: 100% (1400520/1400520), done.
Writing objects: 100% (1608244/1608244), done.
Total 1608244 (delta 1075261), reused 0 (delta 0)
[gerrit2@gerrit UI.git]$ 
[gerrit2@gerrit UI.git]$ 
[gerrit2@gerrit UI.git]$ 
[gerrit2@gerrit UI.git]$ 
[gerrit2@gerrit UI.git]$ date
Tue Sep 24 06:23:38 EDT 2019
[gerrit2@gerrit UI.git]$ git count-objects -v
count: 512
size: 2180
in-pack: 1608244
packs: 1
size-pack: 679587
prune-packable: 0
garbage: 0
[gerrit2@gerrit UI.git]$ du -sh
1.2G .

Martin Fick

unread,
Sep 24, 2019, 11:59:18 AM9/24/19
to repo-d...@googlegroups.com, nitish kumar
On Tuesday, September 24, 2019 3:33:40 AM MDT nitish kumar wrote:
> I tried 'git gc --prune='31-07-2019' --aggressive' also not problem with
> existing change requests in gerrit.
>
> which reduced 1/2 size.
>
> $ git count-objects -v
> count: 0
> size: 0
> in-pack: 924517
> packs: 1
> size-pack: 584949
> prune-packable: 0
> garbage: 0
> size-garbage: 0
> *[gerrit2@gerrit UI.git]$ du -sh*
> *2.1G *.
> [gerrit2@gerrit UI.git]$ git count-objects -v
> count: 451
> size: 1912
> in-pack: 1618765
> packs: 2
> size-pack*: 1567472*
> prune-packable: 0
> garbage: 0
> [gerrit2@gerrit UI.git]$ date
> Tue Sep 24 05:57:11 EDT 2019
> [gerrit2@gerrit UI.git]$ git gc --prune='31-07-2019' --aggressive
> Counting objects: 1608244, done.
> Delta compression using up to 16 threads.
> Compressing objects: 89% (1246463/1400520)
> Compressing objects: 100% (1400520/1400520), done.
> Writing objects: 100% (1608244/1608244), done.
> Total 1608244 (delta 1075261), reused 0 (delta 0)
> [gerrit2@gerrit UI.git]$
> [gerrit2@gerrit UI.git]$
> [gerrit2@gerrit UI.git]$
> [gerrit2@gerrit UI.git]$
> [gerrit2@gerrit UI.git]$ date
> Tue Sep 24 06:23:38 EDT 2019
> [gerrit2@gerrit UI.git]$ git count-objects -v
> count: 512

It's a bit strange that you still have loose objects after repacking, this can
be improved. Maybe your prune isn't working?

> size: 2180
> in-pack: 1608244

This is larger than before repacking (924517)? Something doesn't make sense.

> packs: 1
> size-pack: *679587*
> prune-packable: 0
> garbage: 0
> *[gerrit2@gerrit UI.git]$ du -sh*
> *1.2G .*
Reply all
Reply to author
Forward
0 new messages