Large files in git? git-sparse-checkout & git-refs-filter

98 views

Skip to first unread message

John de Largentaye

unread,

Feb 19, 2025, 6:59:22 PM2/19/25

to Repo and Gerrit Discussion

Hi repo-discuss, & Luca specifically,

Luca, during today's GerritMeets, I asked about Gerrit LFS future support, and you countered with other approaches to managing the impact of large files in git, mentioning git-sparse-checkout [1] & git-refs-filter [2] as possible tools.

Could you elaborate further on the topic, perhaps giving an example sequence from a developer's perspective, for limiting the impact of one commit's large files on the workspaces of the rest of the team?

I'm also interested in the community's experience with the above tools.

In particular, I'm scared by this all-caps message in the "git-sparse-checkout" manpage:

"THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE BEHAVIOR OF OTHER COMMANDS IN THE PRESENCE OF SPARSE-CHECKOUTS, WILL LIKELY CHANGE IN THE FUTURE."

To explain where I'm coming from, my team's main code repo is 6.6 GB (.git only), going back about 9 years [3], and each year we add about 5 Data submodules between 2 GB - 28 GB, for ~50 GB / year, currently totaling 220 GB. These Data submodules are each relevant for about 18 months, after which they are all but archived.

I am very reluctant to let all that "short-lived" data be stored in our main repo and forever pay the clone and storage cost for each developer. That said, submodule usage is full of sharp edges that my developers keep cutting themselves on [4], so I'm actively looking for alternatives. In particular, I value simple setups that are close to those widely used in the outside world, and I'm wary of custom scripts or configs that each developer must apply.

[1] https://git-scm.com/docs/git-sparse-checkout

[2] https://gerrit.googlesource.com/modules/git-refs-filter/

[3] only 9 years because the team previously had the habit of cutting off and restarting the repo every year specifically to limit the size impact of history!

[4] just last week I had to fix a broken submodule reference, when a junior developer updated in Gerrit the commit message of a submodule commit, but didn't know to update the main repo commit to refer to the updated submodule commit! "oh I didn't know changing the commit message changed the SHA!"

Matthias Sohn

unread,

Feb 19, 2025, 7:39:47 PM2/19/25

to John de Largentaye, Repo and Gerrit Discussion

On Thu, Feb 20, 2025 at 12:59 AM John de Largentaye <jlarg...@gmail.com> wrote:

Hi repo-discuss, & Luca specifically,

Luca, during today's GerritMeets, I asked about Gerrit LFS future support, and you countered with other approaches to managing the impact of large files in git, mentioning git-sparse-checkout [1] & git-refs-filter [2] as possible tools.

Could you elaborate further on the topic, perhaps giving an example sequence from a developer's perspective, for limiting the impact of one commit's large files on the workspaces of the rest of the team?
I'm also interested in the community's experience with the above tools.

We developed both LFS support in JGit and the first version of the Gerrit lfs plugin using filesystem or AWS S3 for storing
the large files on the server side. After some trial phase piloting LFS usage at SAP we decided to not go productive with LFS

since we think it adds quite some complexity and risk to run it large-scale in production. In addition replacing the actual blobs

by placeholder files in git makes the decision which objects are stored where non-transparent which is pretty intrusive

since reconsidering this decision requires rewriting history.

And LFS doesn't have built-in transport support for the large files which means

it cripples a distributed versioning system to a centralized versioning system.

And as you can see in the GitHub pricelist storing many large binary files also doesn't come for free.

We taught our users to not store large binary files in git/gerrit and implemented

https://gerrit-review.googlesource.com/Documentation/config-gerrit.html#receive.maxObjectSizeLimit

and the uploadvalidator plugin

https://gerrit.googlesource.com/plugins/uploadvalidator/+/refs/heads/master/src/main/resources/Documentation/config.md

to help them blocking large binary files.

In particular, I'm scared by this all-caps message in the "git-sparse-checkout" manpage:
"THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE BEHAVIOR OF OTHER COMMANDS IN THE PRESENCE OF SPARSE-CHECKOUTS, WILL LIKELY CHANGE IN THE FUTURE."

To explain where I'm coming from, my team's main code repo is 6.6 GB (.git only), going back about 9 years [3], and each year we add about 5 Data submodules between 2 GB - 28 GB, for ~50 GB / year, currently totaling 220 GB. These Data submodules are each relevant for about 18 months, after which they are all but archived.
I am very reluctant to let all that "short-lived" data be stored in our main repo and forever pay the clone and storage cost for each developer. That said, submodule usage is full of sharp edges that my developers keep cutting themselves on [4], so I'm actively looking for alternatives. In particular, I value simple setups that are close to those widely used in the outside world, and I'm wary of custom scripts or configs that each developer must apply.

[1] https://git-scm.com/docs/git-sparse-checkout
[2] https://gerrit.googlesource.com/modules/git-refs-filter/
[3] only 9 years because the team previously had the habit of cutting off and restarting the repo every year specifically to limit the size impact of history!
[4] just last week I had to fix a broken submodule reference, when a junior developer updated in Gerrit the commit message of a submodule commit, but didn't know to update the main repo commit to refer to the updated submodule commit! "oh I didn't know changing the commit message changed the SHA!"

In Gerrit you can configure a submodule subscription to automate the superproject automatically when the corresponding submodule branch is updated.

Maybe this helps. See https://gerrit-review.googlesource.com/Documentation/user-submodules.html

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/repo-discuss/9b5393b3-ae51-4970-b302-590bf9011e31n%40googlegroups.com.

John de Largentaye

unread,

Feb 19, 2025, 8:45:58 PM2/19/25

to Matthias Sohn, Repo and Gerrit Discussion

Thanks Matthias,

On Wed, Feb 19, 2025 at 4:39 PM Matthias Sohn <matthi...@gmail.com> wrote:

On Thu, Feb 20, 2025 at 12:59 AM John de Largentaye <jlarg...@gmail.com> wrote:
Hi repo-discuss, & Luca specifically,

Luca, during today's GerritMeets, I asked about Gerrit LFS future support, and you countered with other approaches to managing the impact of large files in git, mentioning git-sparse-checkout [1] & git-refs-filter [2] as possible tools.

Could you elaborate further on the topic, perhaps giving an example sequence from a developer's perspective, for limiting the impact of one commit's large files on the workspaces of the rest of the team?
I'm also interested in the community's experience with the above tools.

We developed both LFS support in JGit and the first version of the Gerrit lfs plugin using filesystem or AWS S3 for storing
the large files on the server side. After some trial phase piloting LFS usage at SAP we decided to not go productive with LFS
since we think it adds quite some complexity and risk to run it large-scale in production. In addition replacing the actual blobs
by placeholder files in git makes the decision which objects are stored where non-transparent which is pretty intrusive
since reconsidering this decision requires rewriting history.
And LFS doesn't have built-in transport support for the large files which means
it cripples a distributed versioning system to a centralized versioning system.
And as you can see in the GitHub pricelist storing many large binary files also doesn't come for free.

We taught our users to not store large binary files in git/gerrit and implemented
https://gerrit-review.googlesource.com/Documentation/config-gerrit.html#receive.maxObjectSizeLimit
and the uploadvalidator plugin
https://gerrit.googlesource.com/plugins/uploadvalidator/+/refs/heads/master/src/main/resources/Documentation/config.md
to help them blocking large binary files.

I heartily agree about the problems with git-lfs, in fact they're the reason I chose submodules instead of git-lfs in the first place! However, after spilling so much of my poor, delicate user's blood, I am desperately seeking alternatives ;)

How have you solved large file storage if you neither use LFS nor allow large files in your repo? We critically need to track and match data files with git commits.

In particular, I'm scared by this all-caps message in the "git-sparse-checkout" manpage:
"THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE BEHAVIOR OF OTHER COMMANDS IN THE PRESENCE OF SPARSE-CHECKOUTS, WILL LIKELY CHANGE IN THE FUTURE."

To explain where I'm coming from, my team's main code repo is 6.6 GB (.git only), going back about 9 years [3], and each year we add about 5 Data submodules between 2 GB - 28 GB, for ~50 GB / year, currently totaling 220 GB. These Data submodules are each relevant for about 18 months, after which they are all but archived.
I am very reluctant to let all that "short-lived" data be stored in our main repo and forever pay the clone and storage cost for each developer. That said, submodule usage is full of sharp edges that my developers keep cutting themselves on [4], so I'm actively looking for alternatives. In particular, I value simple setups that are close to those widely used in the outside world, and I'm wary of custom scripts or configs that each developer must apply.

[1] https://git-scm.com/docs/git-sparse-checkout
[2] https://gerrit.googlesource.com/modules/git-refs-filter/
[3] only 9 years because the team previously had the habit of cutting off and restarting the repo every year specifically to limit the size impact of history!
[4] just last week I had to fix a broken submodule reference, when a junior developer updated in Gerrit the commit message of a submodule commit, but didn't know to update the main repo commit to refer to the updated submodule commit! "oh I didn't know changing the commit message changed the SHA!"

In Gerrit you can configure a submodule subscription to automate the superproject automatically when the corresponding submodule branch is updated.
Maybe this helps. See https://gerrit-review.googlesource.com/Documentation/user-submodules.html