New article: "Reproducible builds for Debian: a big step forward" by Frédéric Pierret

11 views
Skip to first unread message

Andrew David Wong

unread,
Oct 8, 2021, 5:50:29 PM10/8/21
to qubes-devel, qubes-users
Dear Qubes Community,

A new article has just been published on the Qubes website:

"Reproducible builds for Debian: a big step forward" by Frédéric Pierret
https://www.qubes-os.org/news/2021/10/08/reproducible-builds-for-debian-a-big-step-forward/

For your convenience, the original Markdown text is reproduced below.

========================================================================

---
layout: post
title: "Reproducible builds for Debian: a big step forward"
categories: articles
author: Frédéric Pierret
---

_This is the second article in the "reproducible builds" series.
Previously: [Improvements in testing and building: GitLab CI and
reproducible
builds](https://www.qubes-os.org/news/2021/02/28/improvements-in-testing-and-building/)._

In the previous article, [Improvements in testing and building: GitLab
CI and reproducible
builds](https://www.qubes-os.org/news/2021/02/28/improvements-in-testing-and-building/#reproducible-builds),
we discussed reproducible builds and our current short-term goals for
them in Qubes OS. Notably, we aimed to start by building our Debian
templates such that packages can be installed only when configured
rebuilders confirm that they really came from the source code we
publish. Today, we go beyond this expectation.

Reproducible builds: retrieve the past
--------------------------------------

The challenge in reproducible builds lies in rebuilding a package in the
same environment in which it was officially published. This means that
we need to retrieve every single package version that was used as
dependency to rebuild a given package. For Debian, some packages in the
current release were built several releases in the past but not
necessarily with the exact same dependencies. In order to retrieve them,
there is only one solution: a Debian service called
`snapshot.debian.org`, which is an archive acting as a [Wayback
Machine](https://web.archive.org/) that allows access to old packages
based on dates and version numbers. It contains all past and present
packages that the Debian archive provides. Unfortunately, this service
is known to suffer significant blocking issues on usability. For
example, watch the DebConf 2021 talk [Making use of snapshot.debian.org
for fun and
profit](https://debconf21.debconf.org/talks/22-making-use-of-snapshotdebianorg-for-fun-and-profit/)
and have a look at some related Debian issues like
[#977653](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=%977653),
[#960304](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=%960304),
[#969906](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=%969906),
[#969603](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=%969603),
and
[#782857](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=%782857). To
summarize: There are throttling limits and availability issues such as
repeatedly cutting off connections, returning partial content, etc. As
announced in our previous article, we developed our own rebuilder tool,
[debrebuild](https://github.com/fepitre/debrebuild), which is able to
rebuild a single Debian package together with a rebuilder orchestrator
[PackageRebuilder](https://github.com/fepitre/package-rebuilder). We
started to put it in production in order to actively rebuild Qubes OS
and Debian packages, but it quickly ceased to function, as the
`snapshot.debian.org` service was unable to sustain the load of
rebuilding even a single Debian package. That said, the question was:
How should we proceed in order to make it work? Clearly, those issues
are critical and make the `snapshot.debian.org` service awful or useless
for reproducible builds.

Is rebuilding Debian really possible?
-------------------------------------

The `snapshot.debian.org` issues have still not been addressed even
after several years. The service has existed for more than a decade, yet
it still suffers from the aforementioned limitations. It's either a
design problem or a lack of resources, but we still had to do something.

That's why we decided to create our own
[snapshot](https://github.com/fepitre/debian-snapshot) service. Easy to
say, but not to do. First, the original snapshot service from Debian is
roughly 90 TB of repository data. Second, we cannot download files
easily because only HTTP(S) is available, and downloading multiple files
means we are impeded by availability issues.
In order to work around the huge volume of data, we decided to get
repositories from 2017 to today (which corresponds approximately to when
Debian "Buster" was released) and only related architectures `amd64`,
`source`, and `all`. (`all` indicates no specific architecture in the
Debian world.) For the download part itself, we needed to parse the
metadata of each Debian repository in order to get the list of files to
download for every timestamp for which a snapshot had been made. Then,
we developed `resume` and `retry` download functions, which
unfortunately are brute force download functions. For storing the data,
a simple approach has been employed: storing files as SHA-256 names,
then creating symlinks to reconstruct the repository layout. In order to
get file information (package and repository metadata), we rely on
simply reading a symlink. It took 3-4 months to get 4.2 TB of data,
which represents 2017 to the present. Most of the information about the
downloaded files and their source repository is stored in a database. In
parallel, we added --- like the original `snapshot.debian.org` --- an
API, [snapshot-api](https://github.com/fepitre/debian-snapshot#API), to
expose information about repositories. Unlike the original one, we added
much more information that rebuilder software, e.g. `debrebuild`, needs
to have when requesting package information, such as the exact location
of a given package in terms of Debian archive, timestamp, suite,
architecture and component. The service is now publicly exposed at
<https://snapshot.notset.fr> and the API endpoints at
<https://snapshot.notset.fr/mr>. The service is home-hosted by the author.

This is exactly where the dream of **rebuilding Debian packages** in the
same environment in which they were official published became a
**reality**. Thanks to our standalone orchestrator and rebuilder
software `debrebuild`, results of the rebuilding process, links to
reproducible attestations called [in-toto
metadata](https://in-toto.io/), and even why a package is not
reproducible can all be found at <https://rebuild.notset.fr>. As of this
writing, we have successfully rebuilt more than 80% of the latest Debian
packages for the `unstable` release while doing tests. Since it started,
several adjustments have been made, and we have finally reached a stable
rebuilding process. That is why, after a few late improvements during
this almost first full rebuild, we flushed it all and started again for
latest Debian stable release, Bullseye. We will again rebuild `unstable`
after the full rebuild of Bullseye is complete. As time passes, we will
have fewer and fewer pending tasks, as there are a couple thousand
package rebuilds remaining. Please note that, in addition to the initial
package build, the process of rebuilding a package means querying the
`snapshot.notset.fr` API multiple times to get package information and
location, set up the same environment as the original published one, and
finally, actually build it. All of this is possible thanks to several
servers, home-hosted by the author, that intensively build packages
non-stop for more than a month.

What's next?
------------

For Qubes OS, we already track reproducibility status in our continuous
integration (CI) tests (see the [previous
article](https://www.qubes-os.org/news/2021/02/28/improvements-in-testing-and-building/)
for details), and they are also rebuilt independently like Debian
packages in the same Package Rebuilder instance. We already have most of
the reproducible attestations for our specific Debian packages (see
<https://rebuild.notset.fr/qubesos.html>), and we will soon have all the
needed ones for Debian. In consequence, we are happy to announce that we
have already started the process of integrating the rebuild check status
both at the build phase of our Debian templates and when later
installing a package in the template itself. That's the reason we
restarted the whole process of a full rebuild for Bullseye.

There is preliminary work for integrating Fedora into the orchestrator,
but that deserves a separate effort. The rebuilder
[rpmreproduce](https://github.com/fepitre/rpmreproduce) can be used to
rebuild Fedora packages, but some discussions with RPM upstream are
still needed (see
<https://github.com/rpm-software-management/rpm/pull/1532>). Also, we
plan to support input other than a `buildinfo` file for RPM, such as a
Koji build description (which is the build infrastructure used by Fedora
and CentOS) or any description piece that would make it clear how an RPM
package was built. We also plan to add other distributions pretty easily
and quickly, like Arch Linux, which we are going to ship officially soon.

Conclusion
----------

Improved documentation for the orchestrator is in progress to make it
easier for others who want to rebuild Qubes OS or Debian in the same way
that we are currently doing it. Having more independent rebuilders
publishing reproducibility attestations would be especially good for the
community.

In all of these efforts, we are really satisfied that the [Reproducible
Builds Project](https://reproducible-builds.org/) has decided to use our
work and results as an example of what it has been advocating for years,
notably for Debian. The official website
<https://beta.tests.reproducible-builds.org> currently mirrors our
results website <https://rebuild.notset.fr>.

_The author warmly thanks Marta Marczykowska-Górecka and Marek
Marczykowski-Górecki for their moral support and technical discussions
throughout this rough and intensive journey while juggling other projects._

OpenPGP_signature
Reply all
Reply to author
Forward
0 new messages