Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

archive rebuilds wrt Lucas' victory

0 views
Skip to first unread message

Adam Borowski

unread,
Apr 14, 2013, 7:20:02 PM4/14/13
to
On Sun, Apr 14, 2013 at 12:00:04AM +0000, dev...@vote.debian.org wrote:
> The winners are:
> Option 3 "Lucas Nussbaum"

So this means that if you are messing with an archive rebuild¹, you risk that
being viewed as sucking up to the elect?

Kind of reminds me of a time when, at a keysigning, I mentioned that "I
wonder why do we care about a name printed on an easily faked id, shouldn't
we instead prefer »this is a guy whom a bunch of debconf attendees recognize
as The Dear Leader«", and the last words were instead taken as a personal
compliment².


[1]. I got a fast armhf box that turned out useless for its intended purpose
due to GFX card/monitor issues. Around two years ago, Lucas mentioned you
need around 340 core-hours for a rebuild, this box compiles roughly half as
fast (no idea how I/O will scale), so with 4 cores it should be a week in
the best case. I have a hunch it'd be more like a month, but hey, a rebuild
can be paused if the box finds an use after all.

Too bad, I see what seems to be most of the time being spent in dpkg
installing dependencies -- how could this be avoided? One of ideas would
be to reformat as btrfs (+eatmydata) and find some kind of a tree of
packages with similar build-depends, snapshotting nodes of the tree to
quickly reset to a wanted state -- but I guess you guys have some kind
of a solution already.

For now, I'm doing problematic libraries like reoffice by hand, thinking
how to go full auto.


[2]. Not that zack doesn't deserve those. Like, for wearing a kilt when
facing govt officials.
--
ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ


--
To UNSUBSCRIBE, email to debian-curi...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/20130414223...@angband.pl

Goswin von Brederlow

unread,
Apr 16, 2013, 5:40:02 AM4/16/13
to
On Mon, Apr 15, 2013 at 12:30:43AM +0200, Adam Borowski wrote:
> Too bad, I see what seems to be most of the time being spent in dpkg
> installing dependencies -- how could this be avoided? One of ideas would
> be to reformat as btrfs (+eatmydata) and find some kind of a tree of
> packages with similar build-depends, snapshotting nodes of the tree to
> quickly reset to a wanted state -- but I guess you guys have some kind
> of a solution already.

I think snapshoting is a good idea there. Or rather forking the
filesystem. Say you have 2 packages:

Package: A
Build-Depends: X, Y

Package: B
Build-Depends: X, Z

You would start with the bare build chroot and install X. Then you
create snapshots SA and SB from that. In SA you install Y and in SB
you install Z. Now both packages can be built.

BUT:

- Easy with 2 packages. But how do you do that with 30000?
- Y and Z may both depend on W. So initialy we should have
installed X and W.
- Package C may Build-Conflicts: X but depend on most of the stuff X
depends on. So taking the filesystem with X installed and purging X
will be faster than starting from scratch.
- Doing multiple apt/dpkg runs is more expensive than a combined one.
A single run will save startup time and triggers.
- Could we install packages without running triggers and only trigger
them at the end of each chain? Or somewhere in the middle?
- There will be multiple ways to build the tree. We might install U
first and then V or V first and then U. Also we might have to install
V in multiple branches and V can not be installed in a commong root.
Unless we install V in a common root and then uninstall V again for a
subtree. This probably needs a heuristic for how long installing (or
uninstalling) a package takes. Package size will be a major factor
but postinst scripts can take a long time to run (update-texmf anyone?).
- Build chroots, even as snapshots, take space. You can only have so
many of them in parallel. A depth first traversal would be best there.
Building packages against locally build packages (instead of the
existing official ones) gives a better test. But that would require a
more width first ordering. Some compromise between the two would be
needed.

Note:

With multiple cores it is better run multiple builds in parallel,
given enough ram, than trying to build a single package on multiple
cores. Most packages don't support parallel building (even if they
could) and some break if you force the issue. Now if you have multiple
builds that are based around the same snapshot then common header
files and libraries will be cached only once. So cache locality (and
therefor efficiency) should increase.


So I'm looking forward to someone taking up this idea and implementing
an algorithm that will sort sources into a tree structure for
installing and snapshoting / forking the filesystem at each node.
Optimized for reducing the number of dpkg runs, snapshots needed and
installing the same package in multiple branches.

MfG
Goswin


--
To UNSUBSCRIBE, email to debian-curi...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/20130416092943.GA23900@frosties

Adam Borowski

unread,
Apr 16, 2013, 4:30:02 PM4/16/13
to
On Tue, Apr 16, 2013 at 11:29:43AM +0200, Goswin von Brederlow wrote:
> On Mon, Apr 15, 2013 at 12:30:43AM +0200, Adam Borowski wrote:
> > Too bad, I see what seems to be most of the time being spent in dpkg
> > installing dependencies -- how could this be avoided? One of ideas would
> > be to reformat as btrfs (+eatmydata) and find some kind of a tree of
> > packages with similar build-depends, snapshotting nodes of the tree to
> > quickly reset to a wanted state
>
> I think snapshoting is a good idea there. Or rather forking the
> filesystem. Say you have 2 packages:
>
> Package: A
> Build-Depends: X, Y
>
> Package: B
> Build-Depends: X, Z
>
> You would start with the bare build chroot and install X. Then you
> create snapshots SA and SB from that. In SA you install Y and in SB
> you install Z. Now both packages can be built.

So you would include intermediate states as nodes in the graph as well?
Interesting -- this could indeed optimize cases like that, at the cost of
making the problem a good deal harder algorithmically.

> BUT:
>
> - Easy with 2 packages. But how do you do that with 30000?

You mean, an algorithmical challenge? In our kind of crowd? That's the fun
stuff!

If we reduce the problem by two simplifications:

* can snapshot only before building a package (no intermediate states)
* the cost of purging a package is same as installing it

a solution is to find a minimal spanning tree, possibly with a
constraint on the tree's height. And with the graph not being malicious,
I have a hunch the tree would behave nicely, not requiring too many
snapshots (random graphs tend to produce short trees).

The full problem may take a bit more thinking.

> - Y and Z may both depend on W. So initialy we should have
> installed X and W.
> - Package C may Build-Conflicts: X but depend on most of the stuff X
> depends on. So taking the filesystem with X installed and purging X
> will be faster than starting from scratch.

Ie, edges that purge should have a lesser cost than edges that install.

> - Doing multiple apt/dpkg runs is more expensive than a combined one.
> A single run will save startup time and triggers.

Again, parameters to the edge cost function.

> - Could we install packages without running triggers and only trigger
> them at the end of each chain? Or somewhere in the middle?

Could be worth looking at. Not sure how many triggers can work
incrementally, and how many rebuild everything every time like man-db.
Of course, this is moot if we snapshot only before package build.

> - There will be multiple ways to build the tree. We might install U
> first and then V or V first and then U. Also we might have to install
> V in multiple branches and V can not be installed in a commong root.
> Unless we install V in a common root and then uninstall V again for a
> subtree. This probably needs a heuristic for how long installing (or
> uninstalling) a package takes. Package size will be a major factor
> but postinst scripts can take a long time to run (update-texmf anyone?).

What about something akin to: log(size) + size/X ? For smallish packages,
dpkg churn is the dominant factor, for big ones it's actual I/O and
registering individual files.

> - Build chroots, even as snapshots, take space. You can only have so
> many of them in parallel. A depth first traversal would be best there.

I guess the tree won't be tall enough for disk space to be a concern.

> Building packages against locally build packages (instead of the
> existing official ones) gives a better test. But that would require a
> more width first ordering. Some compromise between the two would be
> needed.

Bootstrapping is a whole new can of worms, and I'm afraid I haven't followed
Wookey and co closely enough.

> With multiple cores it is better run multiple builds in parallel,
> given enough ram, than trying to build a single package on multiple
> cores. Most packages don't support parallel building (even if they
> could) and some break if you force the issue.

And others spend a good deal of the time on a single core, even if they
allow concurrency in a part. Watching libreoffice, it was at 25%
utilization more time than at 100%.

"Given enough ram" is another problem: a vast majority of packages need
quite little, yet some can be pretty hungry. I don't think we have
per-package data about memory needed, do we? In my case, libreoffice
-j4 OOMed near the end.

The box has only 2GB, non-extensible. Someone had a bright idea of shipping
a kernel with CONFIG_SWAP unset; the source for all needed out-of-tree
drivers is there, but I need to figure out why that uboot thingy doesn't
work, without being able to view the screen.

> Now if you have multiple builds that are based around the same snapshot
> then common header files and libraries will be cached only once. So cache
> locality (and therefor efficiency) should increase.

Bad news: both btrfs and lvm (and I'm 99% sure zfs) fail to tell the VM
system that a page is shared on the disk and thus can be shared in the
memory as well. This means, snapshots and even cp --reflink won't utilize
the similarity. This limitation seems quite fundamental on lvm; in btrfs
thanks to rampant layering violations it's fixable but hasn't been fixed
yet. Somewhere among vserver patches there's IUNLINK -- a file attribute
that duplicates hardlinked files when you try to open them for writing,
that's the only cow I know that lets the page cache work.

One could run builds with exactly the same Build-Depends together, but as
they would need to run on the same mount, this doesn't seem to be worth the
hassle:
grep ^Build-Depends […]debian_dists_unstable_main_source_Sources|sort|uniq|wc -l
15564
grep ^Package: […]debian_dists_unstable_main_source_Sources|sort|uniq|wc -l
18555


--
ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ


--
To UNSUBSCRIBE, email to debian-curi...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/20130416202...@angband.pl

Goswin von Brederlow

unread,
Apr 18, 2013, 4:30:02 AM4/18/13
to
On Tue, Apr 16, 2013 at 10:22:20PM +0200, Adam Borowski wrote:
> On Tue, Apr 16, 2013 at 11:29:43AM +0200, Goswin von Brederlow wrote:
> > On Mon, Apr 15, 2013 at 12:30:43AM +0200, Adam Borowski wrote:
> > > Too bad, I see what seems to be most of the time being spent in dpkg
> > > installing dependencies -- how could this be avoided? One of ideas would
> > > be to reformat as btrfs (+eatmydata) and find some kind of a tree of
> > > packages with similar build-depends, snapshotting nodes of the tree to
> > > quickly reset to a wanted state
> >
> > I think snapshoting is a good idea there. Or rather forking the
> > filesystem. Say you have 2 packages:
> >
> > Package: A
> > Build-Depends: X, Y
> >
> > Package: B
> > Build-Depends: X, Z
> >
> > You would start with the bare build chroot and install X. Then you
> > create snapshots SA and SB from that. In SA you install Y and in SB
> > you install Z. Now both packages can be built.
>
> So you would include intermediate states as nodes in the graph as well?
> Interesting -- this could indeed optimize cases like that, at the cost of
> making the problem a good deal harder algorithmically.
>
> > BUT:
> >
> > - Easy with 2 packages. But how do you do that with 30000?
>
> You mean, an algorithmical challenge? In our kind of crowd? That's the fun
> stuff!

Anything exponential will not work. And for O(n^c) the c must be
rather small to still find a solution in time. Fun stuff indeed. After
all, if we search 4 weeks for an optimal solution we might as well
just build everything like now and be quicker.

> If we reduce the problem by two simplifications:
>
> * can snapshot only before building a package (no intermediate states)

I don't think that is a good simplification. The chance that for two
packges like A and B there is a third package C that only
Build-Depends on X is rather small. And then you wouldn't get a common
node for A and B.

> * the cost of purging a package is same as installing it

That somewhat fixes what the first one broke. Since now B can be a
child of A by purging Y and installing Z. Still wastefull though.

It makes the graph a lot smaller though.

> a solution is to find a minimal spanning tree, possibly with a
> constraint on the tree's height. And with the graph not being malicious,
> I have a hunch the tree would behave nicely, not requiring too many
> snapshots (random graphs tend to produce short trees).

Luckily minimal spanning tree is well researched. :)

Note: the graph would be directed and should have weights according to
the (estimated) complexity of installing/purging a package.

> The full problem may take a bit more thinking.
>
> > - Y and Z may both depend on W. So initialy we should have
> > installed X and W.
> > - Package C may Build-Conflicts: X but depend on most of the stuff X
> > depends on. So taking the filesystem with X installed and purging X
> > will be faster than starting from scratch.
>
> Ie, edges that purge should have a lesser cost than edges that install.
>
> > - Doing multiple apt/dpkg runs is more expensive than a combined one.
> > A single run will save startup time and triggers.
>
> Again, parameters to the edge cost function.
>
> > - Could we install packages without running triggers and only trigger
> > them at the end of each chain? Or somewhere in the middle?
>
> Could be worth looking at. Not sure how many triggers can work
> incrementally, and how many rebuild everything every time like man-db.
> Of course, this is moot if we snapshot only before package build.

Even if the trigger is incremental we don't loose anything if delay
running the trigger. It will do more work for that later run but not
more than the individual runs put together. On the other hand non
incremental triggers will add up to a lot more.

> > - There will be multiple ways to build the tree. We might install U
> > first and then V or V first and then U. Also we might have to install
> > V in multiple branches and V can not be installed in a commong root.
> > Unless we install V in a common root and then uninstall V again for a
> > subtree. This probably needs a heuristic for how long installing (or
> > uninstalling) a package takes. Package size will be a major factor
> > but postinst scripts can take a long time to run (update-texmf anyone?).
>
> What about something akin to: log(size) + size/X ? For smallish packages,
> dpkg churn is the dominant factor, for big ones it's actual I/O and
> registering individual files.

That would be something to tune experimentally I guess. Possible
factors I can see would be:

- fixed cost for apt/dpkg startup time (or cost relative to number of
installed packages/files, i.e. database size)
- number of dpkg runs needed for a set of packages (multiplier for the first)
- package size in bytes and number of files
- cost for triggers
- cost for preinst/postinst/prerm/postrm scripts

The last two would be harder to get. The rest all comes from the
Packages files.

> > - Build chroots, even as snapshots, take space. You can only have so
> > many of them in parallel. A depth first traversal would be best there.
>
> I guess the tree won't be tall enough for disk space to be a concern.

No, the height won't be a problem. Ignoring package purges even a
height of 100 should be trivial. Each node would only add packages and
a new dpkg DB. So the space needed would be roughly |size of leaf| +
100 * |size of dpkg DB|.

But a snapshot should only be deleted once all its children have been
built. If you build package in order of dependencies (bootstrapping)
then you would have to keep a loot of snapshots around untill you
eventually reach a leaf package and can delete the snapshot again. And
those snapshots will be hugely different.

> > Building packages against locally build packages (instead of the
> > existing official ones) gives a better test. But that would require a
> > more width first ordering. Some compromise between the two would be
> > needed.
>
> Bootstrapping is a whole new can of worms, and I'm afraid I haven't followed
> Wookey and co closely enough.

Yeah, I wouldn't enforce that. Building against existing packages is
fine for an archive wide rebuild. But if a package has already been
build it could be used to test that the package actually works. So I
would prefer a build order where more local packages can used. But
that should be balanced against the number of snapshots (and their
size) required to build everything.

> > With multiple cores it is better run multiple builds in parallel,
> > given enough ram, than trying to build a single package on multiple
> > cores. Most packages don't support parallel building (even if they
> > could) and some break if you force the issue.
>
> And others spend a good deal of the time on a single core, even if they
> allow concurrency in a part. Watching libreoffice, it was at 25%
> utilization more time than at 100%.
>
> "Given enough ram" is another problem: a vast majority of packages need
> quite little, yet some can be pretty hungry. I don't think we have
> per-package data about memory needed, do we? In my case, libreoffice
> -j4 OOMed near the end.

Some builds have an exclude list for HUGE packages because they are
known to swap to death (or take considerably longer than others). So I
guess we could come up with a list of special packages that must not
be build in parallel with others.

> The box has only 2GB, non-extensible. Someone had a bright idea of shipping
> a kernel with CONFIG_SWAP unset; the source for all needed out-of-tree
> drivers is there, but I need to figure out why that uboot thingy doesn't
> work, without being able to view the screen.

But think of a nice ARM big box with 192 cores and 192GB ram. Now that
would be something to do archive wide rebuilds on.

> > Now if you have multiple builds that are based around the same snapshot
> > then common header files and libraries will be cached only once. So cache
> > locality (and therefor efficiency) should increase.
>
> Bad news: both btrfs and lvm (and I'm 99% sure zfs) fail to tell the VM
> system that a page is shared on the disk and thus can be shared in the
> memory as well. This means, snapshots and even cp --reflink won't utilize
> the similarity. This limitation seems quite fundamental on lvm; in btrfs
> thanks to rampant layering violations it's fixable but hasn't been fixed
> yet. Somewhere among vserver patches there's IUNLINK -- a file attribute
> that duplicates hardlinked files when you try to open them for writing,
> that's the only cow I know that lets the page cache work.

Ugh? I would think that the page would be in the block cache,
addressed by device and block offset. In the case of lvm I can see how
you would get the same block twice since you have different devices.
But why btrfs?

> One could run builds with exactly the same Build-Depends together, but as
> they would need to run on the same mount, this doesn't seem to be worth the
> hassle:
> grep ^Build-Depends [???]debian_dists_unstable_main_source_Sources|sort|uniq|wc -l
> 15564
> grep ^Package: [???]debian_dists_unstable_main_source_Sources|sort|uniq|wc -l
> 18555

Looks likes 1/6th of all packages. I wouldn't call that ignorable.

MfG
Goswin


--
To UNSUBSCRIBE, email to debian-curi...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/20130418082108.GA24658@frosties
0 new messages