Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[gentoo-dev] rfc: why are we still distributing the portage tree via rsync?

288 views
Skip to first unread message

William Hubbs

unread,
Jul 3, 2018, 11:30:02 AM7/3/18
to

All,

Mostly because of the recent "trustless infrastructure" thread, I am
wondering why we are still distributing the portage tree primarily
via rsync instead of git?

Can someone educate me on that, and is it worth considering moving away
from rsync distribution?

Thanks,

William

signature.asc

Rich Freeman

unread,
Jul 3, 2018, 11:40:02 AM7/3/18
to
Here are the pros/cons that I've seen come up in the past:

1. emerge-webrsync is probably more secure at the moment, because
emerge --sync with git leaves the tree corrupt if it doesn't verify.
That seems like something that could be fixed, and which should be
fixed regardless (presumably somebody just has to do the work - I
can't imagine the portage team would turn away patches).

2. git seems to be more efficient for frequent syncing, while rsync
seems to be more efficient for infrequest syncing. I'd guess the
crossover is somewhere around a week or few, but I don't have data to
support that.

3. we have more rsync mirrors, though with the possibility of using
mirrors like github I don't see why this matters (as long as we
actually secure distribution).

4. by default git tends to accumulate history, which can eat up disk
space. I imagine this could be automatically trimmed if users wanted,
though during syncing it would at least need to store all the commits
between the last fetched and next-fetched, and that means fetching
things that might have been subsequently removed/changed

Personally I stick with git. I want the history anyway, and since I
sync frequently it involves WAY less disk IO and seems to be very
network-efficient as well.

--
Rich

William Hubbs

unread,
Jul 3, 2018, 11:40:03 AM7/3/18
to
On Tue, Jul 03, 2018 at 08:32:55AM -0700, Brian Dolbec wrote:
> because:
>
> 1) it is still the most bandwidth economical means of distributing the
> tree

Even more so than http or https?

Thanks,

William

signature.asc

Brian Dolbec

unread,
Jul 3, 2018, 11:40:03 AM7/3/18
to
On Tue, 3 Jul 2018 10:22:35 -0500
William Hubbs <will...@gentoo.org> wrote:

because:

1) it is still the most bandwidth economical means of distributing the
tree

2) we have a large infrastructure of rsync mirrors, which we do not for
git.

3) see #1
--
Brian Dolbec <dolsen>

Rich Freeman

unread,
Jul 3, 2018, 11:50:02 AM7/3/18
to
On Tue, Jul 3, 2018 at 11:32 AM Brian Dolbec <dol...@gentoo.org> wrote:
>
> 1) it is still the most bandwidth economical means of distributing the
> tree

Is this true? If I do two syncs 10min apart, I have to imagine that
less data will get transferred for git. Certianly there will be less
disk IO. I think the main issue is when does the crossover happen
because if I sync a year apart git is going to send every file that
was ever added and then removed from the tree in that time.

Also, do we care about bandwidth when there are mirrors that offer it for free?

> 2) we have a large infrastructure of rsync mirrors, which we do not for
> git.
>

Do we need them. I've yet to see somebody complain about poor syncing
performance from github. I imagine we could just use that and a few
other free mirroring services to distribute the tree.

While I appreciate all the donors giving us mirrors/etc, it seems like
we would be much more resilient if we didn't require them in the first
place.

--
Rich

Matt Turner

unread,
Jul 3, 2018, 12:30:02 PM7/3/18
to
On Tue, Jul 3, 2018 at 11:38 AM Rich Freeman <ri...@gentoo.org> wrote:
> 4. by default git tends to accumulate history, which can eat up disk
> space. I imagine this could be automatically trimmed if users wanted,
> though during syncing it would at least need to store all the commits
> between the last fetched and next-fetched, and that means fetching
> things that might have been subsequently removed/changed

This is why I have not switched to git. I have /usr/portage on a
separate 1GB partition (with distfiles and packages stored elsewhere).
The ebuild tree is 600MB with rsync and cannot fit on the partition
with git.

I'd be happy to switch if the space requirements were similar.

William Hubbs

unread,
Jul 3, 2018, 12:40:02 PM7/3/18
to
On Tue, Jul 03, 2018 at 11:40:53AM -0400, Rich Freeman wrote:
> On Tue, Jul 3, 2018 at 11:32 AM Brian Dolbec <dol...@gentoo.org> wrote:
> > 2) we have a large infrastructure of rsync mirrors, which we do not for
> > git.
> >
>
> Do we need them. I've yet to see somebody complain about poor syncing
> performance from github. I imagine we could just use that and a few
> other free mirroring services to distribute the tree.

I don't feel comfortable relying on github as a primary means of
distributing the tree due to our social contract. It is a value-added
kind of service, but we should not rely on it.

William

signature.asc

Rich Freeman

unread,
Jul 3, 2018, 12:40:02 PM7/3/18
to
git clone https://github.com/gentoo-mirror/gentoo.git . --depth 1
...
du -sh .
662M .

So, with a shallow clone it seems comparable.

The issue is getting git to constantly trim, probably along the lines of:
https://stackoverflow.com/a/34829535

--
Rich

Matthias Maier

unread,
Jul 3, 2018, 12:40:03 PM7/3/18
to

On Tue, Jul 3, 2018, at 11:22 CDT, Matt Turner <matt...@gentoo.org> wrote:

> I'd be happy to switch if the space requirements were similar.

$ git clone --depth=1 https://github.com/gentoo-mirror/gentoo

occupies 662M on my machine (just tested). With full history
(i.e. without --depth=1) I am at 1.1GB.

Best,
Matthias
signature.asc

Rich Freeman

unread,
Jul 3, 2018, 12:50:03 PM7/3/18
to
Do you know that all our existing mirrors are 100% FOSS?

It is a mirror. You upload something. Somebody else downloads the same thing.

If we were distributing tarballs via http would we really care if the
mirror is running apache vs IIS? Do we even check our existing
mirrors for such things? Do we care that they're running on coreboot
too, without an IME?

Hey, I'm all for having all the mirrors we can, and it isn't like
mirroring git is particularly difficult. I just think that there is a
double-standard being applied when it comes to get. I completely get
the argument when it comes to things like issues/PRs/etc since those
aren't distributed, but for git itself you really just need something
that supports the protocol and it is trivial to replace. Certainly
for anything we host we should use FOSS because it is the cleanest
solution anyway.

--
Rich

Kristian Fiskerstrand

unread,
Jul 3, 2018, 12:50:03 PM7/3/18
to
I would expect as much. But my primary argument would be key management related, it is simply impossible to present a raw copy of our repo to end-users and have them verify each commit

Rich Freeman

unread,
Jul 3, 2018, 1:00:02 PM7/3/18
to
On Tue, Jul 3, 2018 at 12:41 PM Kristian Fiskerstrand <k...@gentoo.org> wrote:
>
> I would expect as much. But my primary argument would be key management related, it is simply impossible to present a raw copy of our repo to end-users and have them verify each commit
>

While related, I think that the question of distribution is still a
fair one. We can still check an infra key on the head commit with git
distribution. Granted, if we want to go further than that then the
implementation will vary between git vs rsync distribution because the
signed git metadata is only available easily in git.

--
Rich

Matt Turner

unread,
Jul 3, 2018, 10:30:03 PM7/3/18
to
Exactly. I'm not sure git can automatically trim out history on git
pull and I'm even less sure it would be able to do it without
temporarily exceeding 1GB of storage.

Matt Turner

unread,
Jul 3, 2018, 10:30:03 PM7/3/18
to
Wait a week and emerge --sync again; it won't fit.

gro...@gentoo.org

unread,
Jul 4, 2018, 12:20:01 AM7/4/18
to
Same here. One cannot avoid 3 things: death, taxes and insufficient
hard-disk space.

Andrey

Martin Vaeth

unread,
Jul 5, 2018, 8:10:03 AM7/5/18
to
Matt Turner <matt...@gentoo.org> wrote:
> The ebuild tree is 600MB with rsync and cannot fit on the partition
> with git.
>
> I'd be happy to switch if the space requirements were similar.

If one git repacks every few syncs one needs currently about 800 MB.

With additionally squashfs (zstd) (+ overlayfs) the full
archive size is currently <600 MB.

In both cases, the temporary disk space is slightly more, of course.
For a 1GB reserved partition I'd use the partition for the temporary
mounting and store the archive somewhere else, but I think chances are
good that you also come through with only a git repack after
every sync. A difficulty might be the very first git sync.

Gerion Entrup

unread,
Jul 5, 2018, 8:00:02 PM7/5/18
to
Would it possible to take the bare repo (< 600 MB) and only mount the latest checkout (with fuse eg)?

Gerion

Kent Fredric

unread,
Jul 5, 2018, 8:30:03 PM7/5/18
to
On Fri, 06 Jul 2018 01:55:32 +0200
Gerion Entrup <gerion...@flump.de> wrote:

> Would it possible to take the bare repo (< 600 MB) and only mount the latest checkout (with fuse eg)?

That would incur performance problems, because packed objects are
stored as differences to other objects ( similar to how later pieces in
a gzip stream are dependent on earlier pieces in the stream ).

Subsequently, having a real checkout substantially improves performance.

For a FUSE module to compete with this, it would need a lot of special
mechanics, including keeping lots of memory reserved for state.

So you might end up trading that additional 800mb disk space for
500mb-1.5gb of memory utilization.
0 new messages