Re: [RFC] initoverlayfs - a scalable initial filesystem

31 views

Skip to first unread message

Demi Marie Obenour

unread,

Dec 11, 2023, 11:28:47 AM12/11/23

to Lennart Poettering, Eric Curtin, Yariv Rachmani, init...@vger.kernel.org, system...@lists.freedesktop.org, Stephen Smoogen, Douglas Landgraf, Qubes OS Development Mailing List

On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> On Fr, 08.12.23 17:59, Eric Curtin (ecu...@redhat.com) wrote:
>
> > Here is the boot sequence with initoverlayfs integrated, the
> > mini-initramfs contains just enough to get storage drivers loaded and
> > storage devices initialized. storage-init is a process that is not
> > designed to replace init, it does just enough to initialize storage
> > (performs a targeted udev trigger on storage), switches to
> > initoverlayfs as root and then executes init.
> >
> > ```
> > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> >
> > fw -> bootloader -> kernel -> storage-init -> init ----------------->
> > ```
>
> I am not sure I follow what these chains are supposed to mean? Why are
> there two lines?
>
> So, I generally would agree that the current initrd scheme is not
> ideal, and we have been discussing better approaches. But I am not
> sure your approach really is useful on generic systems for two
> reasons:
>
> 1. no security model? you need to authenticate your initrd in
> 2023. There's no execuse to not doing that anymore these days. Not
> in automotive, and not anywhere else really.
>
> 2. no way to deal with complex storage? i.e. people use FDE, want to
> unlock their root disks with TPM2 and similar things. People use
> RAID, LVM, and all that mess.
>
> Actually the above are kinda the same problem in a way: you need
> complex storage, but if you need that you kinda need udev, and
> services, and then also systemd and all that other stuff, and that's
> why the system works like the system works right now.
>
> Whenever you devise a system like yours by cutting corners, and
> declaring that you don't want TPM, you don't want signed initrds, you
> don't want to support weird storage, you just solve your problem in a
> very specific way, ignoring the big picture. Which is OK, *if* you can
> actually really work without all that and are willing to maintain the
> solution for your specific problem only.
>
> As I understand you are trying to solve multiple problems at once
> here, and I think one should start with figuring out clearly what
> those are before trying to address them, maybe without compromising on
> security. So my guess is you want to address the following:
>
> 1. You don't want the whole big initrd to be read off disk on every
> boot, but only the parts of it that are actually needed.
>
> 2. You don't want the whole big initrd to be fully decompressed on every
> boot, but only the parts of it that are actually needed.
>
> 3. You want to share data between root fs and initrd
>
> 4. You want to save some boot time by not bringing up an init system
> in the initrd once, then tearing it down again, and starting it
> again from the root fs.
>
> For the items listed above I think you can find different solutions
> which do not necessarily compromise security as much.
>
> So, in the list above you could address the latter three like this:
>
> 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> loader load the erofs into contigous memory, then use memmap=X!Y on
> the kernel cmdline to synthesize a block device from that, which
> you then mount directly (without any initrd) via
> root=/dev/pmem0. This means yout boot loader will still load the
> whole image into memory, but only decompress the bits actually
> neeed. (It also has some other nice benefits I like, such as an
> immutable rootfs, which tmpfs-based initrds don't have.)
>
> 3. Simply never transition to the root fs, don't marke the initrds in
> systemd's eyes as an initrd (specifically: don't add an
> /etc/initrd-release file to it). Instead, just merge resources of
> the root fs into your initrd fs via overlayfs. systemd has
> infrastructure for this: "systemd-sysext". It takes immutable,
> authenticated erofs images (with verity, we call them "DDIs",
> i.e. "discoverable disk images") that it overlays into /usr/. [You
> could also very nicely combine this approach with systemd's
> portable services, and npsawn containers, which operate on the same
> authenticated images]. At MSFT we have a major product that works
> exactly like this: the OS runs off a rootfs that is loaded as an
> initrd, and everything that runs on top of this are just these
> verity disk images, using overlayfs and portable services.
>
> 4. The proposal in 3 also addresses goal 4.
>
> Which leaves item 1, which is a bit harder to address. We have been
> discussing this off an on internally too. A generic solution to this
> is hard. My current thinking for this could be something like this,
> covering the UEFI world: support sticking a DDI for the main initrd in
> the ESP. The ESP is per definition unencrypted and unauthenticated,
> but otherwise relatively well defined, i.e. known to be vfat and
> discoverable via UUID on a GPT disk. So: build a minimal
> single-process initrd into the kernel (i.e. UKI) that has exactly the
> storage to find a DDI on the ESP, and set it up. i.e. vfat+erofs fs
> drivers, and dm-verity. Then have a PID 1 that does exactly enough to
> jump into the rootfs stored in the ESP. That latter then has proper
> file system drivers, storage drivers, crypto stack, and can unlock the
> real root. This would still be a pretty specific solution to one set
> of devices though, as it could not cover network boots (i.e. where
> there is just no ESP to boot from), but I think this could be kept
> relatively close, as the logic in that case could just fall back into
> loading the DDI that normally would still in the ESP fully into
> memory.

I don't think this is "a pretty specific solution to one set of devices"
_at all_. To the contrary, it is _exactly_ what I want to see desktop
systems moving to in the future.

It solves the problem of large firmware images. It solves the problem
of device-specific configuration, because one can use a file on the EFI
system partition that is read by userspace and either treated as
untrusted or TPM-signed. It means that one have a complete set of
recovery tools in the event of a problem, rather than being limited to
whatever one can squeese into an initramfs. One can even include a full
GUI stack (with accessibility support!), rather than just Plymouth. For
Qubes OS, one can include enough of the Xen and Qubes toolstack to even
launch virtual machines, allowing the use of USB devices and networking
for recovery purposes. It even means that one can use a FIDO2 token to
unlock the hard drive without a USB stack on the host. And because the
initramfs _only_ needs to load the boot extension volume, it can be
very, _very_ small, which works great with using Linux as a coreboot
payload.

The only problem I can see that this does not solve is network boot, but
that is very much a niche use case when compared to the millions of
Fedora or Debian desktop installs, or even the tens of thousands of
Qubes OS installs. Furthermore, I would _much_ rather network boot be
handled by userspace and kexec, rather than the closed source UEFI network
stack.

It does require some care when upgrading, as the dm-verity image and the
UKI cannot both be updated atomically, but one can solve that by first
writing the new dm-verity image to a separate location. The UKI will
try both both the old and new locations for the dm-verity image and
rename the new image over the old one on success. The wrong image will
simply fail to mount as its root hash will be wrong.

This even allows Apple-esque boot policies to be implemented on
commodity hardware, provided that the system firmware is sufficiently
hardened. It won't be as good as what Apple does, but it will be a huge
win from what is possible today.

> (If you are focussing on systems lacking UEFI, then replace the word
> "ESP" in the above with a similar concept, i.e. a well discoverable,
> unauthenticated relatively simple file system, such as vfat).
>
> Anyway, I can't tell you how to solve your specific problems, but if
> there's one thing I'd suggest you to keep in mind then it's the
> security angle, i.e. keep in mind from the beginning how
> authentication of every component of your process shall work, how
> unatteneded disk encryption shall operate and how measurement shall
> work. Security must be built into things from the beginning, not be
> added as an afterthought.

As a Qubes OS developer and a security researcher, thank you.
--
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

signature.asc

Demi Marie Obenour

unread,

Dec 11, 2023, 12:46:19 PM12/11/23

to Eric Curtin, Lennart Poettering, Yariv Rachmani, init...@vger.kernel.org, system...@lists.freedesktop.org, Stephen Smoogen, Douglas Landgraf, Qubes OS Development Mailing List

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On Mon, Dec 11, 2023 at 05:03:13PM +0000, Eric Curtin wrote:

> plymouth is very interesting in that it has it's own graphics stack, event loop
> implementations, etc. A lot of the initrd software is like this.
> plymouth is one of
> the examples I think of in my head of something that could benefit from being
> able to use more generic things. At least it's an easy example to explain to
> people.

Indeed so. There is still the concern of startup time, which
GPU-accelerated programs in particular are often not great at.

> > Qubes OS, one can include enough of the Xen and Qubes toolstack to even
> > launch virtual machines, allowing the use of USB devices and networking
> > for recovery purposes. It even means that one can use a FIDO2 token to
> > unlock the hard drive without a USB stack on the host. And because the
> > initramfs _only_ needs to load the boot extension volume, it can be
> > very, _very_ small, which works great with using Linux as a coreboot
> > payload.
> >
> > The only problem I can see that this does not solve is network boot, but
> > that is very much a niche use case when compared to the millions of
> > Fedora or Debian desktop installs, or even the tens of thousands of
> > Qubes OS installs. Furthermore, I would _much_ rather network boot be
> > handled by userspace and kexec, rather than the closed source UEFI network
> > stack.
>

> A generic approach is hard, I think it's worth discussing which type of boots
> you should actually care about milliseconds of performance for. It would be nice
> if we had an init system that contained the binary data to do the minimum for
> standard Fedora, Debian installs and everything else was an extension whether
> that's sysexts, dlopen, a new binary to execute etc.
>
> If the network is ingrained in your boot stack like this, I'm guessing
> you probably
> don't care about boot performance. Should we come up with a new technique?
>
> Automotive has an expectation for really fast boots, like 2 seconds, in standard
> desktops installs there's some expectation as you interface directly
> with a human,
> but for other installs how much expectation is there?
>
> Or can we just fall back to existing techniques for installs like network boot?

I wouldn't say that people doing network boot don't care about boot
performance, mostly because I have been on the other side of similar
arguments before [1]. However, I don't think this technique needs to
support network boot.
- --

Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[1]: Qubes OS doesn't expose GPU acceleration to VMs. This is not
because the developers don't care about graphics performance, but
because GPUs and especially their driver stacks have a very large
attack surface. Work is being done to address this, but even once
Qubes OS does support GPU acceleration, it will need to be off by
default, at least initially.
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCgAdFiEEdodNnxM2uiJZBxxxsoi1X/+cIsEFAmV3SuMACgkQsoi1X/+c
IsHH6RAAhMQl/nw2jdZ4tlwxX/zqib3Tfzdo1p9a5VOkSobrvV7qbG0DWVrqe+vH
NKU1xy6FGqPexKjLoGlxWXgPN5rQKvkFXSgRaRefcqGn190WRjqexF0euu26GYTx
AfOEWC1hywoyXUR2LMygEMpodA0ZvZffIZcovmjjr4OeXiSc5aAUrHQ2PabHZaET
BL4jfeNikjw6sA2UdpviMRzb1OVEGZDD96XDSbVz/8tOBcZZNePz+FQXnHqTpcLk
DrBtx4l5noeUYingzxmw4MQZYYPr3kC4+DQtQr7zxv8D0UE9g8lIcpektqMvgoON
88FwVOa4TgTij7vG2f4BGCrZjE7PiPPo5BRb+MtjlZMtrhwdI4IwXY8q4EANWUnw
8nM+952nffVVQjpBtKRsXPZ3glAjvUuqHT8GzfWYYu8y8Dar9c3U4aQSTCJspkz3
jBsPAatFSjdBvlE6OtmyYco92K3A9g6WXzkw5t+/yaljBOddEkxEAw8+Lo1dCqrn
zK+vSFhcGpYodsHFQY0w9kAZ2+6HBX2nZaEmD6ka3furRussm7D4Z36lx1D/pi68
BL4aAFFLaEQ0jD8jqtjVZ2JYpUQufzwrnsNPTZ97WTEKd2F/zM/S09WjFsaOfVIO
F95Eqk0YMHP+krDEcXvm34EZ3PeRGlVm1fz4ttjw8XEekwwB5QU=
=HR07
-----END PGP SIGNATURE-----

Lennart Poettering

unread,

Dec 12, 2023, 12:50:10 PM12/12/23

to Demi Marie Obenour, Eric Curtin, init...@vger.kernel.org, system...@lists.freedesktop.org, Stephen Smoogen, Qubes OS Development Mailing List, Yariv Rachmani, Douglas Landgraf

On Mo, 11.12.23 11:28, Demi Marie Obenour (de...@invisiblethingslab.com) wrote:

> I don't think this is "a pretty specific solution to one set of devices"
> _at all_. To the contrary, it is _exactly_ what I want to see desktop
> systems moving to in the future.
>
> It solves the problem of large firmware images. It solves the problem
> of device-specific configuration, because one can use a file on the EFI
> system partition that is read by userspace and either treated as
> untrusted or TPM-signed. It means that one have a complete set of
> recovery tools in the event of a problem, rather than being limited to
> whatever one can squeese into an initramfs. One can even include a full
> GUI stack (with accessibility support!), rather than just Plymouth. For
> Qubes OS, one can include enough of the Xen and Qubes toolstack to even
> launch virtual machines, allowing the use of USB devices and networking
> for recovery purposes. It even means that one can use a FIDO2 token to
> unlock the hard drive without a USB stack on the host. And because the
> initramfs _only_ needs to load the boot extension volume, it can be
> very, _very_ small, which works great with using Linux as a coreboot
> payload.

systemd's "system extension" concept ("sysexts") already allow you to
do all that. The stuff I was fantasizing about would only change one
thing: instead of sd-stub from uefi mode already putting the sysexts
you installed into memory for the initrd to consume, it would be some
proto-initrd that would do so. This does not really change what you
can do with this, but mostly is just an optimization, reducing iops
and memory use a bit, and thus boot time latency.

> The only problem I can see that this does not solve is network boot, but
> that is very much a niche use case when compared to the millions of
> Fedora or Debian desktop installs, or even the tens of thousands of
> Qubes OS installs. Furthermore, I would _much_ rather network boot be
> handled by userspace and kexec, rather than the closed source UEFI network
> stack.

Well, somebody's niche is somebody else's common case. In VM/cloud/server
scenarios network booting is not that "niche" as it might be on the desktop.

> It does require some care when upgrading, as the dm-verity image and the
> UKI cannot both be updated atomically, but one can solve that by first
> writing the new dm-verity image to a separate location. The UKI will
> try both both the old and new locations for the dm-verity image and
> rename the new image over the old one on success. The wrong image will
> simply fail to mount as its root hash will be wrong.

systemd-sysext already covers this just fine: you can encode in their
"extension-release" file to which base images they match up, and
systemd-syext will then find the right one to apply, and ignore the
others. Thus just make sure you drop in the sysexts fist, and the UKI
last and things should be perfectly robust.

Lennart

--
Lennart Poettering, Berlin

Lennart Poettering

unread,

Dec 12, 2023, 1:00:37 PM12/12/23

to Eric Curtin, Demi Marie Obenour, Yariv Rachmani, init...@vger.kernel.org, system...@lists.freedesktop.org, Stephen Smoogen, Douglas Landgraf, Qubes OS Development Mailing List

On Mo, 11.12.23 17:03, Eric Curtin (ecu...@redhat.com) wrote:

> A generic approach is hard, I think it's worth discussing which type of boots
> you should actually care about milliseconds of performance for. It would be nice
> if we had an init system that contained the binary data to do the minimum for
> standard Fedora, Debian installs and everything else was an extension whether
> that's sysexts, dlopen, a new binary to execute etc.
>
> If the network is ingrained in your boot stack like this, I'm
> guessing you probably don't care about boot performance.

Uh, I am not sure that's really true. People boot up VMs on demand,
based on network traffic. They sure care about latency and boot
times. I mean people care about firecracker and these things precisely
because it brings the of off-to-IP to a minimum.

> Automotive has an expectation for really fast boots, like 2 seconds, in standard
> desktops installs there's some expectation as you interface directly
> with a human,
> but for other installs how much expectation is there?

AFAIR in particular in cars there's quite som functionality you
probaly want to move very early in boot. Which yells to me that you
want a service manager super early. Which again suggests to me that
the first initrd that runs should probably already cover that.

If I were you I'd probably focus on a design like this: ship a basic
systemd in an initrd. Complete enough to find the harddisk, and to run
the other services that are absolutely necessary this early. Then,
once you found the disk, look for sysext images on it, and apply them
all on top of the initrd's root fs you are already running with. Never
transition anywhere else.

The try to optimize the initrd a bit by making it an erofs/memmap
thing and so on. And make sure the initrd only contains stuff you
always need, so that reading it all into memory is necessary anyway,
and hence any approach that tries to run even the initrd off a disk
image won't be necessary becuase you need to read everything anyway.

Lennart Poettering

unread,

Dec 12, 2023, 4:02:40 PM12/12/23

to Nils Kattenbeck, Eric Curtin, init...@vger.kernel.org, system...@lists.freedesktop.org, Stephen Smoogen, Qubes OS Development Mailing List, Yariv Rachmani, Douglas Landgraf

On Di, 12.12.23 21:34, Nils Kattenbeck (nilsk...@gmail.com) wrote:

> Hi, while I have been following this thread passively for now I also
> wanted to chime in.
>
> > (The main reason why sd-stub doesn't actually support erofs-initrds,
> > is that sd-stub also generates initrd cpios on the fly, to pass
> > credentials and system extension images to the kernel, and you can't
> > really mix erofs and cpio initrds into one)
>
> What prevents one from mixing the two (especially given that the
> hypothetical erofs initrd support does not yet exist)?
> Or are you talking about mixing this with your memmap+root=/dev/pmem
> suggestion?

If you have 7 cpio initrds then the kernel will allocate a tmpfs and
unpack them all into it, one after the other, on top of each other,
and then jumps into the result.

if you have an erofs and 7 cpio initds, what are you going to do? You
cannot extract into an erofs, it's immutable. You'd need something
like overlayfs, but that would require (at least for now) an
additional step in userspace, which is something to avoid.

Alternatively (and preferred by me) would support a mode where it
would unpack any cpios it gets into a tmpfs, and then pass an fsopen()
fd to that to the executable it then invokes from the erofs. the
executable could then mount that somewhere if it wants. But this would
require a kenrel patch.

> Even if everything is the same there are codes paths which might not
> be taken during usual operation. An example would be services similar
> to the new systemd-bsod which are only triggered in emergencies.
> Having these in the cpio means that they will always be read and
> decompressed.

systemd-bsod is tiny though, less than 8K compressed here. Not sure it
is a good example.

> Using sysexts also has the drawback that each and every one of them
> has to be decompressed. I might be mistaken but I expect that this
> will be the case even if the extension-release in the sysext results
> in it being discarded which is obviously another big drawback.

sysexts are erofs or squashfs file systems with verity backing. Only
the sectors you access are decompressed.

Lennart Poettering

unread,

Dec 13, 2023, 4:03:13 AM12/13/23

to Nils Kattenbeck, Eric Curtin, init...@vger.kernel.org, system...@lists.freedesktop.org, Stephen Smoogen, Qubes OS Development Mailing List, Yariv Rachmani, Douglas Landgraf

On Di, 12.12.23 23:01, Nils Kattenbeck (nilsk...@gmail.com) wrote:

> > sysexts are erofs or squashfs file systems with verity backing. Only
> > the sectors you access are decompressed.
>

> Okay I forgot that they were erofs based and mentioned cpio archives
> so I assumed they would be one.
> Do they need to be fully read from disk to generate the cpio archive?

erofs is a file system, cpio is a serialized archive. Two different
things. The discussion here is whether to pass the initrd to the
kernel as one or the other. But noone is suggesting to convert one to
the other at boot time.

Reply all

Reply to author

Forward

0 new messages