Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[gentoo-user] Portage, git and shallow cloning

117 views
Skip to first unread message

Davyd McColl

unread,
Jul 6, 2018, 1:40:03 AM7/6/18
to
I'm not sure if there's a better place to put this, so please feel free to tell me I should report it elsewhere (:

After the recent GitHub fun, I changed from using the GitHub git source to git://anongit.gentoo.org/repo/sync/gentoo.git, as suggested by some on this mailing list. I completely nuked /usr/portage/* and set off with an `emerge --sync`, which looked like it was going to take ages and clone about a gig. Reading https://wiki.gentoo.org/wiki/Project:Portage/Sync, I figured there was nothing much I could do about it, since the page speaks of the `sync-depth` config option and states that 1 is "shallow clone, only current state (default if option is absent)" (emphasis mine).
After multiple failures to clone (other side hangs up after a few minutes -- I only have a 4mbps line, maxing out at around 450Kb/s), I thought I'd try explicitly setting `sync-depth` in my repo config and found:

1) `sync-depth` has been deprecated (should now use `clone-depth`)
2) with the option missing, portage was fetching the entire history -- after adding the option (and nuking /usr/portage/* again), a new clone happened in short order, bringing down only around 65Mb (according to git)

So I'd like to ask how to assist in rectifying the above:
1) the docs need to be updated to refer to `clone-depth`
2) I believe that the original intent of defaulting to a shallow clone was a good idea -- perhaps that can be investigated. If the intent has changed for some reason, the docs should be updated to reflect the change.

Thanks
-d

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
If you say that getting the money is the most important thing
You will spend your life completely wasting your time
You will be doing things you don't like doing
In order to go on living
That is, to go on doing things you don't like doing

Which is stupid.

- Alan Watts
https://www.youtube.com/watch?v=-gXTZM_uPMY

Quidquid latine dictum sit, altum sonatur.

Martin Vaeth

unread,
Jul 6, 2018, 3:40:02 AM7/6/18
to
Davyd McColl <dav...@gmail.com> wrote:
>
> 1) `sync-depth` has been deprecated (should now use `clone-depth`)

The reason is that sync-depth was meant to be effective for
every sync, i.e. that with sync-depth=1 the clone should stay shallow.
However, it turned out that this caused frequent/occassional errors
with git syncing when earlier chunks are needed.
So they decided to drop this, and the value is only used for the
initial cloning and ignored from then on. Due to this change of
effect, it has been renamed.

> 2) with the option missing, portage was fetching the entire history

Yes, but even with this option, your history will fill up over time.
Only the initial cloning will go faster and need less space.

> 2) I believe that the original intent of defaulting to a shallow clone was
> a good idea

Due to the point mentioned above, this is not very useful anymore.
Moreover, now that full checksumming is supported for rsync, the only
advantage of using git is that you get the history (in particular
ChangeLogs).

Mick

unread,
Jul 6, 2018, 4:10:02 AM7/6/18
to
The lack of disk space on some of my systems, metered and slow bandwidth and
no need to know what every individual commit and reason for it was, had me
sticking to using rsync, after a short sting on using git.

I don't think anyone recommended git unless good reasons for one's use case
make it an optimal choice.
--
Regards,
Mick
signature.asc

Davyd McColl

unread,
Jul 6, 2018, 4:40:03 AM7/6/18
to
Part of the original intent of the mail was just to bring to light the
disparity between the documentation and experience (wrt the default
value) -- I had no configured value and portage was trying to clone the
entire history of the repo instead of a shallow start. Since I really
appreciate the Gentoo documentation and have relied on it for
installation and any system maintenance, I just wanted to bring this to
light.

I understand that git history will build over time -- I'm less concerned
with (eventual) disk usage than I am with the speed of `emerge --sync`,
which (and perhaps I'm sorely mistaken) appeared to be faster using git
than rsync -- hence my choice of git over rsync (the discussion at
https://forums.gentoo.org/viewtopic-t-1009562.html shows me to not be
alone in this experience).

Having the changelogs available also comes off as a positive for me --
I'm just plain curious.

-d

Rich Freeman

unread,
Jul 6, 2018, 7:50:03 AM7/6/18
to
On Fri, Jul 6, 2018 at 4:34 AM Davyd McColl <dav...@gmail.com> wrote:
>
> I understand that git history will build over time -- I'm less concerned
> with (eventual) disk usage than I am with the speed of `emerge --sync`,
> which (and perhaps I'm sorely mistaken) appeared to be faster using git
> than rsync -- hence my choice of git over rsync (the discussion at
> https://forums.gentoo.org/viewtopic-t-1009562.html shows me to not be
> alone in this experience).
>

From what I've generally seen/heard git is much more efficient as long
as you sync frequently.

rsync has the advantage that it only transfers the minimum necessary
to get you from the tree you have now to the tree that is current. To
do this it has to stat every file (using default settings - you can
make it even slower if you want to), which is a lot of file I/O.

git has the advantage that it can just read the current HEAD and from
that know exactly what commits are missing, so there is way less
effort spent figuring out what changed. It has the disadvantage that
it sends everything that happened since your last sync, which could
include files that were created and subsequently removed. If you sync
often there won't be much of that, but if you're syncing monthly or
even less frequently then you probably will spend a lot of time
transmitting churn.

It is possible to trim down a repository, and as long as nobody is
doing force pushes on the main repo you should still be able to sync.
However, that is not something that just involves a git one-liner.
Personally I don't mind the space tradeoff, especially in exchange for
the IO tradeoff. A sync is always a VERY fast operation.

I'll also note that the stable branch (which is always free of obvious
issues caused by devs not running repoman) is only available via git.
There is no reason that couldn't be replicated via rsync, but right
now we only have one set of mirrors.

I'm still syncing from github after enabling signature checking.
There is a patch that will make that more secure but in the meantime
my scripts keep an eye on exit status when I sync. IMO signature
checking is more important than where you sync from - as long as gpg
says I'm good it really doesn't matter who has the ability to play
with the data enroute. But, it certainly doesn't hurt to sync from
infra (I do have concerns for whether infra could handle everybody
doing it though - github is MS's problem to worry about).

--
Rich

Davyd McColl

unread,
Jul 6, 2018, 8:00:03 AM7/6/18
to
@Rich: if I understand the process correctly, the same commits are
pushed to infra and GitHub by the CI bot?

I ask because prior to the GitHub incident, I didn't have signature
verification enabled (I hadn't read about it and it didn't even occur to
me). So my plan was to (whilst GitHub was being sorted out) switch to
the gentoo git repo and enable verification and, once I'd seen that that
was working (because I'd also seen intermediate emails on this list from
people having issues getting signing keys working), perhaps switch back
to GitHub to put less strain on the Gentoo servers.

So if the same commits are just pushed to two remotes (gentoo and
GitHub), then I should (in theory) be able to change my repo.conf
settings, fiddle the remote in /usr/portage, and switch seamlessly from
gentoo to GitHub? Alternatively, I could start with a clean /usr/portage
again, once I'm happy that I have signature verification working on my
machine.

I do sync frequently (I'm a bit of an update enthusiast) -- at least
once a week, though I prefer more often as I find that the longer I
leave between syncs and world-updates, the more effort I have to
overcome issues (few though they are). So git is a better fit for me, I
think.

-d

------ Original Message ------

Rich Freeman

unread,
Jul 6, 2018, 8:30:04 AM7/6/18
to
On Fri, Jul 6, 2018 at 7:57 AM Davyd McColl <dav...@gmail.com> wrote:
>
> @Rich: if I understand the process correctly, the same commits are
> pushed to infra and GitHub by the CI bot?
>

I'm pretty sure the repos are identical (well, aside from whatever
order they're updated in).

> I ask because prior to the GitHub incident, I didn't have signature
> verification enabled (I hadn't read about it and it didn't even occur to
> me). So my plan was to (whilst GitHub was being sorted out) switch to
> the gentoo git repo and enable verification and, once I'd seen that that
> was working (because I'd also seen intermediate emails on this list from
> people having issues getting signing keys working), perhaps switch back
> to GitHub to put less strain on the Gentoo servers.

I never had issues with the signing keys, but git syncing works
differently from webrsync (which makes those threads a bit of a mess
as you have people offering advice to people using a different sync
method). It is probably best to view them as completely different
implementations, though I'm sure they have elements in common.

Biggest issue with git signature verification is that right now it
will still do a full pull/checkout before verifying, which means that
if it fails you still have a bad /usr/portage (you get an error, but
that's it, and subsequent emerge commands will act on the bad repo).
For that reason alone it might be best to stick with infra's version
until the patch makes its way into release (the patch will do a fetch
and verify before it does a checkout, so while you might have bad git
commits in the history the actual contents of /usr/portage will be
known-good unless you go manually running git commands without doing
your own verification).

Now, in the recent attack a git sync would still have been safe
because the attacker was dumb and did a force push, which will make
git complain loudly if you try to pull (unless you stick --force in
your pull, which probably isn't a great idea for scripts and portage
doesn't do this). But, that was just dumb luck because a smart
attacker would have rebased the nefarious commits so that they'd
seamlessly pull. Really the attack was more of a defacement than
anything as they made a bunch of mistakes that showed they weren't
very serious, but any wakeup call is worth acting on.

> So if the same commits are just pushed to two remotes (gentoo and
> GitHub), then I should (in theory) be able to change my repo.conf
> settings, fiddle the remote in /usr/portage, and switch seamlessly from
> gentoo to GitHub? Alternatively, I could start with a clean /usr/portage
> again, once I'm happy that I have signature verification working on my
> machine.

As far as I can tell if you edit the repo URL in repos.conf and
probably also .git/config it should just seamlessly work, but I
haven't tried it. Since it only accepts fast-forward pulls it
shouldn't do anything if the histories don't match. If you do a sync
immediately before/after the change maybe you'll find that one repo is
behind the other and you just won't get any updates until the new repo
catches up, but I don't think portage will revert anything (that is an
advantage of git - it has a concept of directionality, though it looks
like portage is looking to add support to prevent replay attacks with
rsync as well).

> I do sync frequently (I'm a bit of an update enthusiast) -- at least
> once a week, though I prefer more often as I find that the longer I
> leave between syncs and world-updates, the more effort I have to
> overcome issues (few though they are). So git is a better fit for me, I
> think.

Honestly, I think git is a good fit for a lot of Gentoo users. Yes,
it is different, but all the history/etc is the sort of thing I think
would appeal to many here. Also, git is something that is becoming
increasingly unavoidable, and mostly for reasons that have universal
appeal. Once you grok it you'll be using it everywhere.

Security is obviously getting a renewed focus across the board, so I
think we'll see improvements no matter how you use Gentoo, ideally
using defaults (for whatever reason git sig checking isn't a default
today). Besides improving verification on the end-user side there is
also a lot of interest in improving security on the developer side,
and with infra (hardware tokens, maybe E2E signature checking, etc).
As usual this involves a certain amount of debate (authentication
isn't actually all that easy of a problem).


--
Rich

Davyd McColl

unread,
Jul 6, 2018, 8:30:04 AM7/6/18
to
@Rich thanks for taking the time to formulate that in-depth response.
Appreciated.

-d

------ Original Message ------
From: "Rich Freeman" <ri...@gentoo.org>
To: gento...@lists.gentoo.org

Martin Vaeth

unread,
Jul 7, 2018, 1:40:03 AM7/7/18
to
Rich Freeman <ri...@gentoo.org> wrote:
>
> Biggest issue with git signature verification is that right now it
> will still do a full pull/checkout before verifying

Biggest issue is that git signature happens by the developer who
last commited which means that in practice you need dozens/hundreds
of keys. No package is available for this, and the only tool which
I know which was originally developed to manage these (app-crypt/gkeys)
is not ready for usage for verifaction (gkeys-gpg --verify was
apparently never run by its developer since its python code breaks
already for argument parsing), and its developmant has stalled.

Moreover, although I have written a dirty substitute for gkeys-gpg, it
is not clear how to use gkeys to update signatures and remove staled
ones: It appears that for each usage you have to fetch all seeds and
keys anew. (And I am not even sure whether the seeds it fetches are
really still maintained).

So currently, it is impossible to do *any* automatic tree verification,
unless you manually fetch/update all of the developer keys.

Safest bet if you are a git user is to verify manually whether the
"Verify" field of the latest commit in github really belongs to a
gentoo devloper and is not a fake account. (Though that may be hard
to decide.)

> until the patch makes its way into release (the patch will do a fetch
> and verify before it does a checkout

This helps nothing to get all the correct keys (and no fake keys!)
you need to verify the signature.

> unless you stick --force in your pull

Unfortunately, it is not that simple: git pull --force only works if
the checked out tree is old enough (in which case git pull without --force
would have worked also, BTW).
The correct thing to do if git pull failed is:

git update-index --refresh -q --unmerged # -q is important here!
git fetch
git reset --hard $(git rev-parse --abbrev-ref \
--symbolic-full-name @{upstream})

(The first command is needed to get rid of problems caused by filesystems
like overlayfs).

(If you are a developer and do not want to risk that syncing overrides
your uncommited changes, you might want to replace --hard by --merge).

> not a great idea for scripts and portage doesn't do this).

I think it is a very great idea. In fact, portage did do this previously
*always* (with --merge instead of --hard) and the only reason this was
removed is that the
git update-index --refresh -q --unmerge
takes quite some time which is not necessary for people who do not
use a special filesystem like overlayfs for the portage tree.
The right thing to do IMHO is that portage would use this anyway as
a fallback if "git pull" fails. I usually patch portage to do this.

> that was just dumb luck

Exactly. That's why using "git pull" should not be considered as
a security measurement. It is only a safety measurement if you are
a developer and want to avoid loosing local changes at any price
if you mistakenly sync before committing (although the mentioned
--merge instead of --hard should be safe here, too).

> Honestly, I think git is a good fit for a lot of Gentoo users.

At least since the ChangeLogs have been removed.
IMHO it was the wrong decision to not keep them in the rsync tree
(The tool to regenerate them from git was/is available).

> it is different, but all the history/etc is the sort of thing I think
> would appeal to many here.

Having the ChangeLogs would certainly be sufficient for the majority
of users. It is very rare that a user really needs to access the
older version of the file, and in that case it is simple enough
to fetch it manually from e.g. github.

> Also, git is something that is becoming increasingly unavoidable

If you learn something about git from using it through portage,
this only indicates a bug in portage. (Like e.g. using "git pull" is).

> Security is obviously getting a renewed focus across the board

Unfortunately, due to the mentioned keys problem, git is
currently the *unsafest* method for syncing. The "git pull" bug
of portage is not appealing for normal usage, either.
(BTW, due to the number of committers the portage tree has a quite
strict policy w.r.t. forced pushes. Overlays, especially of single
users, might have different policies and thus can fail quite often
due to the "git pull" bug.)

Martin Vaeth

unread,
Jul 7, 2018, 1:50:02 AM7/7/18
to
Rich Freeman <ri...@gentoo.org> wrote:
>
> git has the advantage that it can just read the current HEAD and from
> that know exactly what commits are missing, so there is way less
> effort spent figuring out what changed.

I don't know the exact protocol, but I would assume that git is
even more efficient: I would assume

1. git transfers only changes between similar files
(in contrast: rsync could only do this if the filename has not
changed, and even that is switched off for portage syncing).

2. git transfers compressed data.

(Both are assumptions which perhaps some git guru might confirm.)

Martin Vaeth

unread,
Jul 7, 2018, 2:00:03 AM7/7/18
to
Davyd McColl <dav...@gmail.com> wrote:
> @Rich: if I understand the process correctly, the same commits are
> pushed to infra and GitHub by the CI bot?

Yes, the repositories are always identical (up to a few seconds delay).

> I ask because prior to the GitHub incident, I didn't have signature
> verification enabled

Currently, it is not practical to change this, see my other posting.

> then I should (in theory) be able to change my repo.conf
> settings, fiddle the remote in /usr/portage, and switch seamlessly from
> gentoo to GitHub?

If by "fiddle the remote in /usr/portage" you mean to edit
the .git/config file you are right.
Note that just changing the remote in repos.conf has only any
effect if you completely removed /usr/portage, and portage has
to clone anew.

Rich Freeman

unread,
Jul 7, 2018, 5:30:03 AM7/7/18
to
On Sat, Jul 7, 2018 at 1:51 AM Martin Vaeth <mar...@mvath.de> wrote:
>
> Davyd McColl <dav...@gmail.com> wrote:
>
> > I ask because prior to the GitHub incident, I didn't have signature
> > verification enabled
>
> Currently, it is not practical to change this, see my other posting.
>

You clearly don't understand what it actually checks. It is
completely practical to enable this today (though not as secure as it
could be). I'll elaborate in a reply to the other email.

--
Rich

Rich Freeman

unread,
Jul 7, 2018, 5:40:03 AM7/7/18
to
On Sat, Jul 7, 2018 at 1:34 AM Martin Vaeth <mar...@mvath.de> wrote:
>
> Rich Freeman <ri...@gentoo.org> wrote:
> >
> > Biggest issue with git signature verification is that right now it
> > will still do a full pull/checkout before verifying
>
> Biggest issue is that git signature happens by the developer who
> last commited which means that in practice you need dozens/hundreds
> of keys.

This is untrue. The last git signature is made by infra or the
CI-bot, and this is the signature that portage checks.

Portage will NOT accept a developer key, or any other key in your
keychain, as being valid.

It will, of course, not work on the regular git repo used for
committing for this reason. You need to use a repo that is signed by
infra (which typically includes metadata/etc as well).

I'll trim most of the rest of your email and only reply to significant
bits, because you seem to not understand the point above which
invalidates almost everything you wrote. The concerns you raise would
be an issue if you were checking individual developer keys.

>
> So currently, it is impossible to do *any* automatic tree verification,
> unless you manually fetch/update all of the developer keys.
>

As noted, you don't need to fetch any developer keys, and if you do
fetch them, portage will ignore them.

>
> > unless you stick --force in your pull
>
> Unfortunately, it is not that simple: git pull --force only works if
> the checked out tree is old enough (in which case git pull without --force
> would have worked also, BTW).

You completely trimmed the context around my quote. I was talking
about the malicious commits in the recent attack. They were
force-pushed, so it doesn't matter how complete your repository is -
they simply would not be pulled without --force.

You seem to be providing advice for how to do a pull with a shallow
repository, which I'm not talking about.

> > Honestly, I think git is a good fit for a lot of Gentoo users.
>
> At least since the ChangeLogs have been removed.
> IMHO it was the wrong decision to not keep them in the rsync tree
> (The tool to regenerate them from git was/is available).

Changelogs are redundant with git, and they take a ton of space (which
of late everybody seems to be super-concerned about). I don't get
that on one hand people get twitchy about /usr/portage taking more
than 1GB, and on the other hand they want a bazillion text files
dumped all over the place, and as a bonus they want them prepended to
instead of appended so that rsync resends the whole thing instead of
just the tail...

But, this was endlessly debated before the decision was made. Trust
me, I read every post before voting to have them removed.

>
> > it is different, but all the history/etc is the sort of thing I think
> > would appeal to many here.
>
> Having the ChangeLogs would certainly be sufficient for the majority
> of users. It is very rare that a user really needs to access the
> older version of the file, and in that case it is simple enough
> to fetch it manually from e.g. github.

It is very rare that somebody would want to use Gentoo at all. My
point is that the sorts of people who like Gentoo would probably tend
to like git. But, to each their own...

>
> > Security is obviously getting a renewed focus across the board
>
> Unfortunately, due to the mentioned keys problem, git is
> currently the *unsafest* method for syncing.

The "keys problem" has nothing to do with the security of git
verification, because those keys are not used by git verification on
the end-user side. An infra-controlled key is used for verification
whether you sync with git or rsync. Either way you're relying on
infra checking the developer keys at time of commit.

Now, as I already mentioned git syncing is currently less safe due to
it doing the checkout before the verification, and they are in the
process of fixing this.

> (BTW, due to the number of committers the portage tree has a quite
> strict policy w.r.t. forced pushes. Overlays, especially of single
> users, might have different policies and thus can fail quite often
> due to the "git pull" bug.)

It probably should be a configurable option in repos.conf, but
honestly, forced pushes are not something that should be considered a
good practice. There are times that it is the best option, but those
are rare, IMO.

--
Rich

Martin Vaeth

unread,
Jul 7, 2018, 5:40:02 PM7/7/18
to
Rich Freeman <ri...@gentoo.org> wrote:
> On Sat, Jul 7, 2018 at 1:34 AM Martin Vaeth <mar...@mvath.de> wrote:
>>
>> Biggest issue is that git signature happens by the developer who
>> last commited which means that in practice you need dozens/hundreds
>> of keys.
>
> This is untrue. [...]
> It will, of course, not work on the regular git repo [...]
> You need to use a repo that is signed by infra
> (which typically includes metadata/etc as well).

I was speaking about gentoo's git repository, of course
(the one which was attacked on github), not about a Frankensteined one
with metadata history filling megabytes of disk space unnecessarily.
Who has that much disk space to waste?

For the official git repository your assertions are simply false,
as you apprently admit: It is currently not possible to use the
official git repo (or the github clone of it which was attacked)
in a secure manner.

>> > unless you stick --force in your pull
>>
>> Unfortunately, it is not that simple: git pull --force only works if
> [...]
> You completely trimmed the context around my quote. [...]
> they simply would not be pulled without --force.

I was saying that they would not be pulled *with* --force either,
because pull --force is not as strong as you think it is (it would
have shown you conflicts to resolve manually).
You would have to use the commands that I have posted.

> You seem to be providing advice for how to do a pull with a shallow
> repository

No, what I said is not related to a shallow repository. It has to do
with pulling a forced push, in general.

>> At least since the ChangeLogs have been removed.
>> IMHO it was the wrong decision to not keep them in the rsync tree
>> (The tool to regenerate them from git was/is available).
>
> Changelogs are redundant with git, and they take a ton of space (which
> of late everybody seems to be super-concerned about)

Compared to the git history, they take very little space.
If you squash the portage tree, it is hardly measurable.
And with the ChangeLogs, rsync would still be a sane option for
most users. Without ChangeLogs many users are unnecessarily forced
to change and to sacrifice the space for git history.

> and as a bonus they want them prepended to
> instead of appended so that rsync resends the whole thing instead of
> just the tail...

Your implicit claim is untrue. rsync - as used by portage - always
transfers whole files, only.

> But, this was endlessly debated before the decision was made.

The decision was about removing the ChangeLogs from the git
repository. This was certainly the correct decision, because -
as you said - the ChangeLogs *can* be regenerated from the
git history and thus it makes no sense to modify/store them
redundantly.

But I was speaking about the distribution of ChangeLogs in rsync:
Whenever the infrastructure uses egencache to generate the metadata,
it could simply pass --update-changelogs so that rsync users
still would have ChangeLogs: They cannot get them from git history.

> My
> point is that the sorts of people who like Gentoo would probably tend
> to like git.

"Liking" git does not mean that one has to use it also for things
for which it brings nothing. And for most users none of its features
is useful for the portage tree. With one exception: ChangeLogs.
That's why I am adverising to bring them back to the rsync tree.

> The "keys problem" has nothing to do with the security of git
> verification, because those keys are not used by git verification on
> the end-user side.

Whoever is that git/developer affine that he prefers git despite
it costs more disk space will certainly want to use the actual
source repository and not a worse rsync-clone repository.

> It probably should be a configurable option in repos.conf, but
> honestly, forced pushes are not something that should be considered a
> good practice.

1. portage shouldn't decide about practices of overlays.
2. also in the official gentoo repository force pushes happen
occassionally. Last occurrence was e.g. when undoing the
malevolent forced push ;)
3. git pull fails not only for forced pushes but also in several
other occassions; for instance, if your filesystem changed inodes
numbers (e.g. squash + overlayfs after a resquash+remount).
4. Even if the user made the mistake to edit a file, portage should
not just die on syncing.

Martin Vaeth

unread,
Jul 7, 2018, 6:40:03 PM7/7/18
to
Rich Freeman <ri...@gentoo.org> wrote:
> On Sat, Jul 7, 2018 at 1:51 AM Martin Vaeth <mar...@mvath.de> wrote:
>> Davyd McColl <dav...@gmail.com> wrote:
>>
>> > I ask because prior to the GitHub incident, I didn't have signature
>> > verification enabled
>>
>> Currently, it is not practical to change this, see my other posting.
>
> You clearly don't understand what it actually checks.

Davyd and I were obviously speaking about the gentoo repository
(the official one and the one on github which got hacked).
For these repositories verification is practically not possible.
(That there are also *other* repositories - with huge metadata history -
which might be easier to verify is a different story).

Perversely, the official comments after the hack had
suggested that you should have enabled signature verification for
the hacked repository which was simply practically not possible.

Rich Freeman

unread,
Jul 7, 2018, 7:10:03 PM7/7/18
to
On Sat, Jul 7, 2018 at 5:29 PM Martin Vaeth <mar...@mvath.de> wrote:
>
> Rich Freeman <ri...@gentoo.org> wrote:
> > On Sat, Jul 7, 2018 at 1:34 AM Martin Vaeth <mar...@mvath.de> wrote:
> >>
> >> Biggest issue is that git signature happens by the developer who
> >> last commited which means that in practice you need dozens/hundreds
> >> of keys.
> >
> > This is untrue. [...]
> > It will, of course, not work on the regular git repo [...]
> > You need to use a repo that is signed by infra
> > (which typically includes metadata/etc as well).
>
> I was speaking about gentoo's git repository, of course
> (the one which was attacked on github), not about a Frankensteined one
> with metadata history filling megabytes of disk space unnecessarily.
> Who has that much disk space to waste?

Doesn't portage create that metadata anyway when you run it, negating
any space savings at the cost of CPU to regenerate the cache?

>
> For the official git repository your assertions are simply false,
> as you apprently admit: It is currently not possible to use the
> official git repo (or the github clone of it which was attacked)
> in a secure manner.
>

Sure, but this also doesn't support signature verification at all (at
least not by portage - git can of course manually verify any commit),
so your points still don't apply.

> > and as a bonus they want them prepended to
> > instead of appended so that rsync resends the whole thing instead of
> > just the tail...
>
> Your implicit claim is untrue. rsync - as used by portage - always
> transfers whole files, only.

rsync is capable of transferring partial files. I can't vouch for how
portage is using it, but both the rsync command line program and
librsync can do partial file transfers. However, this is based on
offsets from the start of the file, so appending to a file will result
in the first part of the file being identical, but prepending will
break rsync's algorithm.

>
> > But, this was endlessly debated before the decision was made.
>
> The decision was about removing the ChangeLogs from the git
> repository. This was certainly the correct decision, because -
> as you said - the ChangeLogs *can* be regenerated from the
> git history and thus it makes no sense to modify/store them
> redundantly.

There were two decisions:

https://projects.gentoo.org/council/meeting-logs/20141014-summary.txt

"do we need to continue to create new ChangeLog entries once we're
operating in git?" No.

https://projects.gentoo.org/council/meeting-logs/20160410-summary.txt

"The council does not require that ChangeLogs be generated or
distributed through the rsync system. It is at the discretion of our
infrastructure team whether or not this service continues."
Accepted (4 yes, 1 no, 2 abstention)

> > It probably should be a configurable option in repos.conf, but
> > honestly, forced pushes are not something that should be considered a
> > good practice.
>
> 1. portage shouldn't decide about practices of overlays.

Hence the reason I suggested it should be a repos.conf option.

> 2. also in the official gentoo repository force pushes happen
> occassionally. Last occurrence was e.g. when undoing the
> malevolent forced push ;)

Sure, but that was a fast-forward from the last good commit, so it
wouldn't require a force pull unless a user had done a force pull on
the bad repo.

> 3. git pull fails not only for forced pushes but also in several
> other occassions; for instance, if your filesystem changed inodes
> numbers (e.g. squash + overlayfs after a resquash+remount).

If you're using squashfs git pull probably isn't the right solution for you.

> 4. Even if the user made the mistake to edit a file, portage should
> not just die on syncing.

emerge --sync won't die in a situation like in general. Maybe it will
if there is a merge conflict, but I don't think the correct default in
this case should be to just wipe out the user's changes. I'm all for
making that an option, however.

--
Rich

Martin Vaeth

unread,
Jul 8, 2018, 4:40:03 AM7/8/18
to
Rich Freeman <ri...@gentoo.org> wrote:
>> I was speaking about gentoo's git repository, of course
>> (the one which was attacked on github), not about a Frankensteined one
>> with metadata history filling megabytes of disk space unnecessarily.
>> Who has that much disk space to waste?
>
> Doesn't portage create that metadata anyway when you run it

You should better have it created by egencache in portage-postsyncd;
and even more you should download some other repositories as well
(news announcements, GLSA, dtd, xml-schema) which are maintained
independently, see e.g.
https://github.com/vaeth/portage-postsyncd-mv

It is the Gentoo way: Download only the sources and build it from there.
That's also a question of mentality and why I think most gentoo users
who use git would prefer that way.

> negating any space savings at the cost of CPU to regenerate the cache?

It's the *history* of the metadata which matters here:
Since every changed metadata file requires a fraction of a second,
one can estimate rather well that several ten thousand files are
changed hourly/daily/weekly (the frequency depending mainly on eclass
changes: One change in some eclass requires a change for practically
every version of every package) so that the history of metadata changed
produced by this over time is enormous. This history, of course,
is completely useless and stored completely in vain.
One of the weaknesses of git is that it is impossible, by design,
to omit such superfluous history selectively (once the files *are*
maintained by git).

>> For the official git repository your assertions are simply false,
>> as you apprently admit: It is currently not possible to use the
>> official git repo (or the github clone of it which was attacked)
>> in a secure manner.
>
> Sure, but this also doesn't support signature verification at all
> [...] so your points still don't apply.

Hu? This actually *was* my point.

BTW, portage might easily support signature verification if just
distribution of the developers' public keys would be properly
maintained (e.g. via gkeys or simpler via some package):
After all, gentoo infra should always have an up-to-date list of
these keys anyway.
(If they don't, it would make it even more important to use the
source repo instead of trusting a signature which is given
without sufficient verification)

>> Your implicit claim is untrue. rsync - as used by portage - always
>> transfers whole files, only.
>
> rsync is capable of transferring partial files.

Yes, and portage is explicitly disabling this. (It costs a lot of
server CPU time and does not save much transfer data if the files
are small, because a lot of hashes have to be transferred
(and calculated - CPU-time!) instead.)

> However, this is based on offsets from the start of the file

There are new algorithms which support also detection of insertions
and deletions via rolling hashes (e.g. for deduplicating filesystems).
Rsync is using quite an advanced algorithm as well, but I would
need to recheck its features.

Anyway, it plays no role for our discussion, because for such
small files it hardly matters, and portage is disabling
said algorithm anyway.

> "The council does not require that ChangeLogs be generated or
> distributed through the rsync system. It is at the discretion of our
> infrastructure team whether or not this service continues."

The formulation already makes it clear that one did not want to
put pressure on infra, and at that time it was expected that
every user would switch to git anyway.
At that time also the gkeys project was very active, and git was
(besides webrsync) the only expected way to get checksums for the
full tree. In particular, rsync was inherently insecure.

The situation has changed meanwhile on both sides: gkeys was
apparently practically abandoned, and instead gemato was introduced
and is actively supported. That suddenly the gentoo-mirror repository
is more secure than the git repository is also a side effect of
gemato, because only for this the infra keys are now suddenly
distributed in a package.

> If you're using squashfs git pull probably isn't the right solution for you.

Exactly. That's why I completely disagree with portage's regression
of replacing the previously working solution by the only partially
working "git pull".

>> 4. Even if the user made the mistake to edit a file, portage should
>> not just die on syncing.
>
> emerge --sync won't die in a situation like in general.

It does: git push refuses to start if there are uncommitted changes.

> but I don't think the correct default in this case should be
> to just wipe out the user's changes.

I do: Like for rsync a user should not do changes to the distributed
tree (unless he makes a PR) but in an overlay; otherwise he will
permanently have outdated files which are not correctly updated.
*If* a user wants such changes, he should correctly use git and commit.

But I am not against to make this an opt-in option for enabling it
by a developer (or advanced user) who is afraid to eventually lose
a change for the case that he forgot to commit before syncing.

Anyway, this has nothing to do with "git pull" vs. "git fetch + git reset",
but is only a question whether the option "--hard" or "--merge" should be
used for "git reset".

One certainly could also live with "reset --merge" as the default (or even
only option), as it was previously in portage, but the change to "pull"
was IMHO a regression.

Rich Freeman

unread,
Jul 8, 2018, 5:20:03 AM7/8/18
to
On Sun, Jul 8, 2018 at 4:28 AM Martin Vaeth <mar...@mvath.de> wrote:
>
> Rich Freeman <ri...@gentoo.org> wrote:
>
> It's the *history* of the metadata which matters here:

You make a reasonable point here.

> > "The council does not require that ChangeLogs be generated or
> > distributed through the rsync system. It is at the discretion of our
> > infrastructure team whether or not this service continues."
>
> The formulation already makes it clear that one did not want to
> put pressure on infra, and at that time it was expected that
> every user would switch to git anyway.

The use of git for history, and yes, in general the Council tries not
to forbid projects from providing services. The intent was to
communicate that it was simply not an expectation that they do so.

> At that time also the gkeys project was very active, and git was
> (besides webrsync) the only expected way to get checksums for the
> full tree. In particular, rsync was inherently insecure.

Honestly, I don't think gkeys really played any part in this, but
there was definitely an intent for signature checking in the tree to
become more robust. As you point out (in a part I trimmed) it ought
to be possible to do this. Indeed, git support for signing commits
was considered a requirement for git implementation.

> >> 4. Even if the user made the mistake to edit a file, portage should
> >> not just die on syncing.
> >
> > emerge --sync won't die in a situation like in general.
>
> It does: git push refuses to start if there are uncommitted changes.
>

I did a test before I made my post. emerge --sync works just fine if
there are uncommitted changes in your repository, whether they are
indexed or otherwise. I didn't test merge conflicts but I'd hope it
would fail if these exist.

--
Rich

Martin Vaeth

unread,
Jul 8, 2018, 2:40:03 PM7/8/18
to
Rich Freeman <ri...@gentoo.org> wrote:
> emerge --sync works just fine if
> there are uncommitted changes in your repository, whether they are
> indexed or otherwise.

You are right. It seems to be somewhat "random" when git pull
refuses to work and when not. I could not detect a common scheme.
Maybe this has mainly to do with using overlayfs and git becoming
confused.

Peter Humphrey

unread,
Jul 14, 2018, 4:40:03 AM7/14/18
to
On Friday, 6 July 2018 06:34:01 BST Davyd McColl wrote:

> 1) `sync-depth` has been deprecated (should now use `clone-depth`)

But to what value should clone-depth be set? And why is the recent news item
referring to instructions to use sync-depth?

--
Regards,
Peter.

Rich Freeman

unread,
Jul 14, 2018, 6:50:03 AM7/14/18
to
On Sat, Jul 14, 2018 at 4:30 AM Peter Humphrey <pe...@prh.myzen.co.uk> wrote:
>
> On Friday, 6 July 2018 06:34:01 BST Davyd McColl wrote:
>
> > 1) `sync-depth` has been deprecated (should now use `clone-depth`)
>
> But to what value should clone-depth be set?

That comes down to personal taste. Do you want any history to be able
to browse it? More depth means more history. If all you want is the
current tree without history then you want a depth of 1, and of course
you'll need to set up a cron job or something to go cleaning up past
history (you never NEED more than the last commit). If you browse the
online git repo you can see about how many commits there are in a day
and estimate how many you want based on how many days you want.

Also, this value only matters for the first sync. After that portage
currently doesn't try to discard past commits, and it will always
fetch all commits between your current state and the new head.

If you want you could set up a script to manually purge history, and
then do an initial sync with 1 depth. Then anytime you sync you could
review the history since the last time you synced, and then run the
purge command to discard all history up to the current commit. In
doing this you'll always see all the history since the last time you
reviewed it.

--
Rich

Peter Humphrey

unread,
Jul 14, 2018, 8:10:03 AM7/14/18
to
On Saturday, 14 July 2018 11:40:03 BST Rich Freeman wrote:
> On Sat, Jul 14, 2018 at 4:30 AM Peter Humphrey <pe...@prh.myzen.co.uk>
wrote:
> > On Friday, 6 July 2018 06:34:01 BST Davyd McColl wrote:
> > > 1) `sync-depth` has been deprecated (should now use `clone-depth`)
> >
> > But to what value should clone-depth be set?
>
> That comes down to personal taste. Do you want any history to be able
> to browse it? More depth means more history. If all you want is the
> current tree without history then you want a depth of 1...

That's all I need for the portage tree, unless removing everything at lower
depths will remove the change records.

> ...and of course you'll need to set up a cron job or something to go
> cleaning up past history (you never NEED more than the last commit). If you
> browse the online git repo you can see about how many commits there are in a
> day and estimate how many you want based on how many days you want.
>
> Also, this value only matters for the first sync. After that portage
> currently doesn't try to discard past commits, and it will always
> fetch all commits between your current state and the new head.
>
> If you want you could set up a script to manually purge history, and
> then do an initial sync with 1 depth. Then anytime you sync you could
> review the history since the last time you synced, and then run the
> purge command to discard all history up to the current commit. In
> doing this you'll always see all the history since the last time you
> reviewed it.

Is there something in git to do that purging? If not, perhaps a simple monthly
script to delete /usr/portage/* - but not packages or distfiles, which are on
separate partitions here - would do the trick.

--
Regards,
Peter.

Martin Vaeth

unread,
Jul 14, 2018, 11:00:03 AM7/14/18
to
Peter Humphrey <pe...@prh.myzen.co.uk> wrote:
> On Friday, 6 July 2018 06:34:01 BST Davyd McColl wrote:
>
>> 1) `sync-depth` has been deprecated (should now use `clone-depth`)
>
> [...] And why is the recent news item referring to instructions
> to use sync-depth?

Things have changed with portage-2.3.42:
sync-depth is again supported (in addition to clone-depth).
clone-depth is for the first cloning,
sync-depth is for the runnnig system.

The strategy which I had mentioned (with --merge) is used if sync-depth
is set (and nonzero).

Only the git index for people with overlayfs is still missing, although
there are plans to introduce this also, optionally.

Rich Freeman

unread,
Jul 14, 2018, 11:10:02 AM7/14/18
to
On Sat, Jul 14, 2018 at 8:00 AM Peter Humphrey <pe...@prh.myzen.co.uk> wrote:
>
> That's all I need for the portage tree, unless removing everything at lower
> depths will remove the change records.

If you clone with a depth of one you'll see the current state of the
tree, and a commit message from the CI bot, and that is it. You'll
have zero change history for anything.

If you clone with a dept of 10 you'll see one or two CI bot messages,
and then the last 8 or so actual changes to the tree. You'll also
have access to what the tree looked like when each of those changes
was made.

Note that git uses COW and compression, so the cost of increasing your
depth isn't very high. A depth of 1 costs you about 670M, and a depth
of 236000 costs you 1.5G. I'd expect the cost to be roughly linear
between these.

>
> Is there something in git to do that purging? If not, perhaps a simple monthly
> script to delete /usr/portage/* - but not packages or distfiles, which are on
> separate partitions here - would do the trick.

That delete would certainly work, though it would cost you a full sync
(which would go back to your depth setting). I'd suggest moving
distfiles outside of the repo if you're going to do that (really, it
shouldn't be inside anyway), just to make it easier.

git has no facilities to do this automatically, probably because it
isn't something Linus does and git is very much his thing. However, I
found that this works for me:

git rev-parse HEAD >! .git/shallow
git reflog expire --expire=all --all
git gc --prune=now

(This is a combination of:
https://stackoverflow.com/a/34829535 (which doesn't work)
and
https://stackoverflow.com/a/46004595 (which is incomplete))

It runs in about 14s for me in a tmpfs.

Another option would be to a local shallow clone and swap the repositories.

You'll find tons of guides online for throwing out history that
involve rebasing. You do NOT want to do this here. These will change
the hash of the HEAD, which means that the next git pull won't be a
fast-forward, and it will be a mess in general. You just want to
discard local history, not rewrite the repository to say that there
never was any history.

Also note that the first line in this little script depends somewhat
on git internals and may or may not work in the distant future.

In any case, I suggest trying it. If it somehow eats your repo for
breakfast just delete it and the next sync will re-fetch.





--
Rich

Rich Freeman

unread,
Jul 14, 2018, 1:00:03 PM7/14/18
to
On Sat, Jul 14, 2018 at 11:06 AM Rich Freeman <ri...@gentoo.org> wrote:
>
> git rev-parse HEAD >! .git/shallow
> git reflog expire --expire=all --all
> git gc --prune=now
>

Before anybody bangs their head against the wall too much I did end up
having syncing issues with this. I suspect the fix is a one-liner,
but which one-liner has defied a fair bit of messing around with it.

In general the git authors aren't really big on supporting this sort
of thing, so it is just a big hack. Doing a local sync to discard
history might be an option. Just deleting the repo and re-syncing is
another option.

But, if somebody comes up with a good fix I'm all ears.

--
Rich
0 new messages