Questioning current binfmt_misc setup

217 views

Skip to first unread message

Silvano Cirujano Cuesta

unread,

Mar 26, 2021, 5:15:04 AM3/26/21

to kas-devel

Hi,

current approach of KAS to ensure that binfmt_misc uses modern qemu-user binaries in containerized ISAR builds is configuring the host from within the container.

But this approach has two major drawbacks:

1. Container has to run privileged.

2. It's leaving behind a broken binfmt_misc host configuration (see later for a deeper explanation).

And a minor one:

1. Setup might conflict with the setup required otherwise (by host processes or other containers) in the same system.

These drawbacks might be acceptable in very homogeneous scenarios (e.g. CI host only running kas-container), but unacceptable in more heterogeneous scenarios.

I know at least of one project suffering from the second one in a CI machine, just in case you're thinking it's artificially constructed.

The key is the usage of the "fix_binary" flag [1] on the binfmt_misc registration. With this flag the binary gets immediately loaded after registration and stays loaded for all the system, so the binary doesn't need to be available in the filesystem visible to the calling process (because of a chroot or mount namespace). But it also implies a global configuration for the whole system!

What kas-container is doing currently is registering and loading GLOBALLY it's own provided qemu-user binaries. But the moment the container is left, that registration is fully broken. Meaning that ANY further binfmt_misc qemu-user consumers (not only those containerized!) are getting a broken configuration.

If they fix it (like each new run of kas-container does or using multiarch/qemu-user-static [2]), everything is fine. But if they don't, they inherited a broken configuration and won't be able to run.

So we end up with a setup that is doing what multiarch/qemu-user-static does all over the place... What IMO is a broken system enforcing all binfmt_misc qemu-user consumers to fix it EVERY SINGLE TIME.

Which alternatives do I see?

Is it the only problem that the qemu-user binaries can be too old? I mean, just having qemu-user-static > 5.2 (the version being currently installed with the buster-backport) would be enough? That's at least my assumption for the proposed solutions.

I see two possibilities:

1. Placing the right (new enough) qemu-user statically linked binaries on the host and registering them with the "fix_binary" flag on binfmt_misc.

2. Returning to the pre-buster setup: no "fix_binary" flag and qemu-user binaries have to be made available in the containers (either installing them or bind mounting them).

Detailing both approaches:

RIGHT QEMU-USER BINARIES AND FIX_BINARY

Since the qemu-user binaries being provided by the Debian package qemu-user-static are statically linked (not even depending on glibc), it's easy extracting them from a Debian package and placing them somewhere in the host.

Then only the right binfmt_misc registration is needed and there we go for all binfmt_misc qemu-user consumers throughout the whole system.

If later on newer versions of the qemu-user binaries are needed, then just newer qemu-user binaries have to be obtained.

It has to be ensured that no package in the host is overriding the binfmt_misc configuration!

Probably a package would be needed for approach to be sustainable. For example, a Debian package conflicting with qemu-user-static and qemu-user-binfmt.

It's important to notice that it remains a global configuration, what might become problematic if no global configuration can satisfy the requirements of all the consumers. For example, one container needing new qemu-user binaries and another one needing an older version because its implementation relies on the absence of certain syscalls provided by the newer qemu-user binaries.

NO FIX_BINARY FLAG

This approach is much more flexible, since each chroot and mount namespace in the system has its own configuration.

But this flexibility doesn't come for free. The setup becomes much more complex on the host.

The qemu-user statically linked binaries have to be provisioned on the host, but without the "fix_binary" registration.

Each chroot/mountns has to:

- either bring its own qemu-user binaries fitting the same path the host is using

- or get the qemu-user binaries bind-mounted into their root filesystem

Getting these requirements fulfilled in desktop environments seems to me to be too cumbersome.

On systems automatically preparing execution environments (like CI runners) the setup doesn't seem to be that complex. Only a matching between the execution environment and the container requirements would be needed, what can typically be accomplished with tags or similar mechanisms (CI jobs requesting the tag "binfmt-qemu-52" only run on execution environments bind-mounting qemu-user statically linked binaries > 5.2).

I personally would go for the first approach on desktop systems and consider using the second approach on CI systems. In any case, none of them require any binfmt_misc configurations from within the containers! What's my goal with this lengthy e-mail :-)

[1] https://manpages.debian.org/buster/binfmt-support/update-binfmts.8.en.html#BINARY_FORMAT_SPECIFICATIONS

[2] https://github.com/multiarch/qemu-user-static

Hasta la vista,

Silvano Cirujano Cuesta

--
Siemens AG, T RDA IOT SES-DE
Corporate Competence Center Embedded Linux

Jan Kiszka

unread,

Mar 26, 2021, 5:58:29 AM3/26/21

to [ext] Silvano Cirujano Cuesta, kas-devel

On 26.03.21 10:14, [ext] Silvano Cirujano Cuesta wrote:
> Hi,
>
> current approach of KAS to ensure that binfmt_misc uses modern qemu-user binaries in containerized ISAR builds is configuring the host from within the container.
>
> But this approach has two major drawbacks:
>
> 1. Container has to run privileged.

Yes, though that need will not vanish for Isar very soon when resolving
the binfmt topic, as we know.

This is how things work so far, and how they are proven to work
sufficiently reliable for single-user/single-build desktop environments.
It's a must to preserve that use case.

>
> On systems automatically preparing execution environments (like CI runners) the setup doesn't seem to be that complex. Only a matching between the execution environment and the container requirements would be needed, what can typically be accomplished with tags or similar mechanisms (CI jobs requesting the tag "binfmt-qemu-52" only run on execution environments bind-mounting qemu-user statically linked binaries > 5.2).
>

Right. For CI runners with shared, concurrently builds, we likely want
to provide an alternative strategy, one that "just works", detects when
a host is outdated (though, how?) and otherwise does not stumble when it
cannot update the settings from within the container.

>
> I personally would go for the first approach on desktop systems and consider using the second approach on CI systems. In any case, none of them require any binfmt_misc configurations from within the containers! What's my goal with this lengthy e-mail :-)
>

We will need both, see above.

Thanks,
Jan

>
> [1] https://manpages.debian.org/buster/binfmt-support/update-binfmts.8.en.html#BINARY_FORMAT_SPECIFICATIONS
>
> [2] https://github.com/multiarch/qemu-user-static
>
> Hasta la vista,
>
> Silvano Cirujano Cuesta
>

--
Siemens AG, T RDA IOT

Henning Schild

unread,

Mar 26, 2021, 7:36:11 AM3/26/21

to [ext] Jan Kiszka, [ext] Silvano Cirujano Cuesta, kas-devel

Am Fri, 26 Mar 2021 10:43:27 +0100
schrieb "[ext] Jan Kiszka" <jan.k...@siemens.com>:

There are kernel patches for binfmt namespace support. That seems to be
the most promising solution in the long run. But they never caused
enough traction to actually get merged. One problem was that you do not
just have to wait until the kernel can namespace that feature, but
also until docker and friends make use of it.

https://www.spinics.net/lists/linux-api/msg38517.html

Henning

Silvano Cirujano Cuesta

unread,

Mar 26, 2021, 12:14:44 PM3/26/21

to Jan Kiszka, kas-devel

On 26/03/2021 10:43, Jan Kiszka wrote:
> On 26.03.21 10:14, [ext] Silvano Cirujano Cuesta wrote:

>> ...

>>
>> 1. Container has to run privileged.
> Yes, though that need will not vanish for Isar very soon when resolving
> the binfmt topic, as we know.

If I don't remember wrong, getting rid of this binfmt_misc configuration enables us to get it running granting just one or two capabilities, instead of "--privileged".

>
>> ...

>>
>> Is it the only problem that the qemu-user binaries can be too old? I mean, just having qemu-user-static > 5.2 (the version being currently installed with the buster-backport) would be enough? That's at least my assumption for the proposed solutions.

Can anybody confirm that this is the issue? I mean, would a new (how new? which is the minimal version?) qemu-user binary suffice? The answer to this question is key to understand which problem the binfmt_misc configuration from the container was trying to fix in the first place.
>>
>> ...

>>
>>
>> NO FIX_BINARY FLAG
>>
>> This approach is much more flexible, since each chroot and mount namespace in the system has its own configuration.
>>
>> But this flexibility doesn't come for free. The setup becomes much more complex on the host.
>>
>> The qemu-user statically linked binaries have to be provisioned on the host, but without the "fix_binary" registration.
>>
>> Each chroot/mountns has to:
>>
>> - either bring its own qemu-user binaries fitting the same path the host is using
>>
>> - or get the qemu-user binaries bind-mounted into their root filesystem
>>
>> Getting these requirements fulfilled in desktop environments seems to me to be too cumbersome.
> This is how things work so far, and how they are proven to work
> sufficiently reliable for single-user/single-build desktop environments.
> It's a must to preserve that use case.

What I meant with "these requirements" is not the current status, but the proposal of not using the "fix_binary" flag. What would require:

1. A configuration of the host that can be easily accomplished with some easy scripts.

2. Starting the container bind mounting the corresponding qemu-user binaries.

Using kas-container as it is now isn't really cumbersome, it gets problematic for all other uses of binfmt_misc with qemu-user in the same system.

The kas-container setup can only be reliable for a "single-user/single-build" desktop environment if you are only using binmft_misc with qemu-user for ISAR builds with kas-container. As mentioned above, kas-container is leaving behind "scorched earth"... but since the first thing it does is "fertilizing", you won't notice anything as long as you only use kas-container. See below what I mean with "scorched earth".

Give following a try:

1. Run "kas-container shell..."

2. Reboot

3. Run "docker run --platform arm64 debian:buster-slim uname -m"

The second command will fail because kas-container configured its own qemu-user binary, but the binary disappears the moment the container is removed and the in memory loaded copy disappears with the reboot.

Installing qemu-user-static in the host would partially fix it, although it becomes an unpredictable setup if running other binftm_misc consumers in the system (what you typically do in a development desktop).

First the qemu-user binaries of the host are loaded, until kas-container is run from the first time and its binaries get loaded and remain loaded until a reboot or a binfmt_misc reconfiguration.

If multiarch/qemu-user-static is run ("docker run --rm --privileged multiarch/qemu-user-static --reset -p yes"), then multiarch/qemu-user-static binaries get loaded an remain active until a reboot or the next binftm_misc reconfiguration (possibly done by kas-container).

Either all binfmt_misc consumers reconfigure the system before running (what you typically wouldn't do) or they won't know which binaries are being used...

But why only "partially fix"? Because if any configuration from a container is using different paths than the host, then the binftm_misc configuration left behind will be appointing to a path that only exists inside of the container!

I know at least of one project running on the same host the KAS container is being run and the first thing they have to do is running "docker run --rm --privileged multiarch/qemu-user-static --reset -p yes" because the setup left behind by the KAS container is broken for them :-/

After having learned what kas-container does in my host, I won't be running it again in its pristine form (like I was doing until now). I'll make sure to run a patched version. And I suspect I'm not the only one doing so... I wonder if you are really running kas-container in your own desktop fully aware of what it's doing to your host.

>
>> On systems automatically preparing execution environments (like CI runners) the setup doesn't seem to be that complex. Only a matching between the execution environment and the container requirements would be needed, what can typically be accomplished with tags or similar mechanisms (CI jobs requesting the tag "binfmt-qemu-52" only run on execution environments bind-mounting qemu-user statically linked binaries > 5.2).
>>
> Right. For CI runners with shared, concurrently builds, we likely want
> to provide an alternative strategy, one that "just works", detects when
> a host is outdated (though, how?) and otherwise does not stumble when it
> cannot update the settings from within the container.

Detecting outdated from the host before starting the CI job is trivial with "qemu-<arch>-static -version".

I can investigate how to detect outdated qemu-user binaries from within the emulated architecture. Possibly checking the existence of new syscalls could help on this.

Updating the settings from within the container should be completely forbidden in a CI runner.

>
>> I personally would go for the first approach on desktop systems and consider using the second approach on CI systems. In any case, none of them require any binfmt_misc configurations from within the containers! What's my goal with this lengthy e-mail :-)
>>
> We will need both, see above.

I wonder if a setup with new qemu-user binaries in the host would fit all the hereby described scenarios. But in order to answer that it's critical understanding the root-cause for the binfmt_misc reconfiguration from within kas-container in the first place (see my questions above).

Cheers,

Silvano

>
> Thanks,
> Jan
>
>> [1] https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmanpages.debian.org%2Fbuster%2Fbinfmt-support%2Fupdate-binfmts.8.en.html%23BINARY_FORMAT_SPECIFICATIONS&data=04%7C01%7Csilvano.cirujano-cuesta%40siemens.com%7Cbcaa6ff6790b4e13b7ae08d8f03db4ba%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C637523495098457429%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=xI4CXsWuEGMUx8dFqSk3Q4mg3nfAGW6KekLP8jEhfN0%3D&reserved=0
>>
>> [2] https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmultiarch%2Fqemu-user-static&data=04%7C01%7Csilvano.cirujano-cuesta%40siemens.com%7Cbcaa6ff6790b4e13b7ae08d8f03db4ba%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C637523495098457429%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=zajA4qyg0Dx%2F%2FZE3bh6w9Twf8MkusnHq2ERBooq75WY%3D&reserved=0

>>
>> Hasta la vista,
>>
>> Silvano Cirujano Cuesta
>>
--

Siemens AG, T RDA IOT SES-DE

Jan Kiszka

unread,

Mar 26, 2021, 12:46:47 PM3/26/21

to Silvano Cirujano Cuesta, kas-devel

On 26.03.21 17:14, Silvano Cirujano Cuesta wrote:
>
> On 26/03/2021 10:43, Jan Kiszka wrote:
>> On 26.03.21 10:14, [ext] Silvano Cirujano Cuesta wrote:
>>> ...
>>>
>>> 1. Container has to run privileged.
>> Yes, though that need will not vanish for Isar very soon when resolving
>> the binfmt topic, as we know.
>
> If I don't remember wrong, getting rid of this binfmt_misc configuration enables us to get it running granting just one or two capabilities, instead of "--privileged".
>

Yes, but it would not change the fact that the build could break/attack
the host. It would get us one step closer, true.

>>
>>> ...
>>>
>>> Is it the only problem that the qemu-user binaries can be too old? I mean, just having qemu-user-static > 5.2 (the version being currently installed with the buster-backport) would be enough? That's at least my assumption for the proposed solutions.
> Can anybody confirm that this is the issue? I mean, would a new (how new? which is the minimal version?) qemu-user binary suffice? The answer to this question is key to understand which problem the binfmt_misc configuration from the container was trying to fix in the first place.

Config from the container is first of all addressing the issue that we
have to run on any host distribution, not just Debian, and on Debian
irrespective of the fact if the user installed qemu-user-static or not.
That's relevant for the "Linux beginner can build an image" story. We
only need to tell them to install docker and enable the logged in user
to access it.

>>>
>>> ...
>>>
>>>
>>> NO FIX_BINARY FLAG
>>>
>>> This approach is much more flexible, since each chroot and mount namespace in the system has its own configuration.
>>>
>>> But this flexibility doesn't come for free. The setup becomes much more complex on the host.
>>>
>>> The qemu-user statically linked binaries have to be provisioned on the host, but without the "fix_binary" registration.
>>>
>>> Each chroot/mountns has to:
>>>
>>> - either bring its own qemu-user binaries fitting the same path the host is using
>>>
>>> - or get the qemu-user binaries bind-mounted into their root filesystem
>>>
>>> Getting these requirements fulfilled in desktop environments seems to me to be too cumbersome.
>> This is how things work so far, and how they are proven to work
>> sufficiently reliable for single-user/single-build desktop environments.
>> It's a must to preserve that use case.
>
> What I meant with "these requirements" is not the current status, but the proposal of not using the "fix_binary" flag. What would require:
>
> 1. A configuration of the host that can be easily accomplished with some easy scripts.
>

- install docker on (recent) distro of your choice
- add user to group docker (or whatever grants access -> distro doc)
- run kas-container

That's how projects/products like meta-iot2050 or
{jailhouse,xenomai}-images work.

> 2. Starting the container bind mounting the corresponding qemu-user binaries.
>
> Using kas-container as it is now isn't really cumbersome, it gets problematic for all other uses of binfmt_misc with qemu-user in the same system.
>

Right, but we never heard complaints in the context of those single
desktop user scenarios.

> The kas-container setup can only be reliable for a "single-user/single-build" desktop environment if you are only using binmft_misc with qemu-user for ISAR builds with kas-container. As mentioned above, kas-container is leaving behind "scorched earth"... but since the first thing it does is "fertilizing", you won't notice anything as long as you only use kas-container. See below what I mean with "scorched earth".

kas-container is not aiming at CI, the kas-isar /container/ is.

>
> Give following a try:
>
> 1. Run "kas-container shell..."
>
> 2. Reboot
>
> 3. Run "docker run --platform arm64 debian:buster-slim uname -m"
>
> The second command will fail because kas-container configured its own qemu-user binary, but the binary disappears the moment the container is removed and the in memory loaded copy disappears with the reboot.
>
>
> Installing qemu-user-static in the host would partially fix it, although it becomes an unpredictable setup if running other binftm_misc consumers in the system (what you typically do in a development desktop).

I have no qemu-user-static that configures binfmt_misc the way Debian
does on my SUSE. And I bet that's similar, just different, on Fedora,
Arch, you-name-it.

Requiring a more specific host setup is ok for unisolated CI, it's not
for the desktop.

>
> First the qemu-user binaries of the host are loaded, until kas-container is run from the first time and its binaries get loaded and remain loaded until a reboot or a binfmt_misc reconfiguration.
>
> If multiarch/qemu-user-static is run ("docker run --rm --privileged multiarch/qemu-user-static --reset -p yes"), then multiarch/qemu-user-static binaries get loaded an remain active until a reboot or the next binftm_misc reconfiguration (possibly done by kas-container).
>
> Either all binfmt_misc consumers reconfigure the system before running (what you typically wouldn't do) or they won't know which binaries are being used...
>
> But why only "partially fix"? Because if any configuration from a container is using different paths than the host, then the binftm_misc configuration left behind will be appointing to a path that only exists inside of the container!
>
> I know at least of one project running on the same host the KAS container is being run and the first thing they have to do is running "docker run --rm --privileged multiarch/qemu-user-static --reset -p yes" because the setup left behind by the KAS container is broken for them :-/
>
> After having learned what kas-container does in my host, I won't be running it again in its pristine form (like I was doing until now). I'll make sure to run a patched version. And I suspect I'm not the only one doing so... I wonder if you are really running kas-container in your own desktop fully aware of what it's doing to your host.
>

You are free to do that. The majority of our users won't (...be able to).

>>
>>> On systems automatically preparing execution environments (like CI runners) the setup doesn't seem to be that complex. Only a matching between the execution environment and the container requirements would be needed, what can typically be accomplished with tags or similar mechanisms (CI jobs requesting the tag "binfmt-qemu-52" only run on execution environments bind-mounting qemu-user statically linked binaries > 5.2).
>>>
>> Right. For CI runners with shared, concurrently builds, we likely want
>> to provide an alternative strategy, one that "just works", detects when
>> a host is outdated (though, how?) and otherwise does not stumble when it
>> cannot update the settings from within the container.
> Detecting outdated from the host before starting the CI job is trivial with "qemu-<arch>-static -version".
>
> I can investigate how to detect outdated qemu-user binaries from within the emulated architecture. Possibly checking the existence of new syscalls could help on this.
>
> Updating the settings from within the container should be completely forbidden in a CI runner.
>

I'm fine with that - as long as the desktop use case is not broken.

If there is a way to present kas-isar on a CI runner a fitting setup and
prevent it from changing that, kas-isar can simply validate it and only
fail (or warn) when expectations are not met. In the kas-container case,
configuration (and qemu deployment) should continue to happen via kas-isar.

>>
>>> I personally would go for the first approach on desktop systems and consider using the second approach on CI systems. In any case, none of them require any binfmt_misc configurations from within the containers! What's my goal with this lengthy e-mail :-)
>>>
>> We will need both, see above.
>
> I wonder if a setup with new qemu-user binaries in the host would fit all the hereby described scenarios. But in order to answer that it's critical understanding the root-cause for the binfmt_misc reconfiguration from within kas-container in the first place (see my questions above).

See above, try to think it through from the perspective of a non-expert
and/or non-Debian Linux user.

Jan

--
Siemens AG, T RDA IOT

Silvano Cirujano Cuesta

unread,

Mar 26, 2021, 12:48:36 PM3/26/21

to Henning Schild, [ext] Jan Kiszka, kas-devel

On 26/03/2021 12:31, Henning Schild wrote:
> Am Fri, 26 Mar 2021 10:43:27 +0100
> schrieb "[ext] Jan Kiszka" <jan.k...@siemens.com>:
>
>> On 26.03.21 10:14, [ext] Silvano Cirujano Cuesta wrote:

>> ...

>> Right. For CI runners with shared, concurrently builds, we likely want
>> to provide an alternative strategy, one that "just works", detects
>> when a host is outdated (though, how?) and otherwise does not stumble
>> when it cannot update the settings from within the container.
> There are kernel patches for binfmt namespace support. That seems to be
> the most promising solution in the long run. But they never caused
> enough traction to actually get merged. One problem was that you do not
> just have to wait until the kernel can namespace that feature, but
> also until docker and friends make use of it.
>
> https://www.spinics.net/lists/linux-api/msg38517.html
>
> Henning

I'm aware of those patches, but IMO they don't help us much.

1. Until they get integrated and are available for us, will take a long while.

2. It imposes the usage of user namespaces (userns), what isn't always possible (older Docker setups either use userns everywhere or nowhere) or desired.

3. It originates from a different use-case (see [1]): rootless containers ("I'd like to configure the interpreter without being root."). Though desirable for us, ISAR won't be able to run rootless for a long while...

4. I have the impression that Kernel developers assume [2] that the "fix_binary" fits all other use-cases.

5. Our issue is having different binaries on different root filesystems, where support for mount namespaces (mountns) would help. That's in fact v2 of the patch series [3] that you've mentioned, but apparently it's either not possible or not that easy as that patch was trying it.

Silvano

[1] https://lkml.org/lkml/2018/10/2/1292

[2] https://lkml.org/lkml/2018/10/3/432

[3] https://lkml.org/lkml/2018/10/3/266

>
>>> I personally would go for the first approach on desktop systems and
>>> consider using the second approach on CI systems. In any case, none
>>> of them require any binfmt_misc configurations from within the
>>> containers! What's my goal with this lengthy e-mail :-)
>> We will need both, see above.
>>
>> Thanks,
>> Jan
>>
>>> [1]

>>> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmanpages.debian.org%2Fbuster%2Fbinfmt-support%2Fupdate-binfmts.8.en.html%23BINARY_FORMAT_SPECIFICATIONS&data=04%7C01%7Csilvano.cirujano-cuesta%40siemens.com%7C237298672f854081f00808d8f04b5ae2%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C637523553720007573%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=TvUFJb8elB93QJyY7J6%2F8ItlY%2BW0XFJwbMbc9j1NiSU%3D&reserved=0
>>>
>>> [2] https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmultiarch%2Fqemu-user-static&data=04%7C01%7Csilvano.cirujano-cuesta%40siemens.com%7C237298672f854081f00808d8f04b5ae2%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C637523553720007573%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=D9TFoE5agGAmK%2FdmcbOSvsxQsCZ%2F1oY7JM6EXJ29lFM%3D&reserved=0

>>>
>>> Hasta la vista,
>>>
>>> Silvano Cirujano Cuesta
>>>

--
Siemens AG, T RDA IOT SES-DE

Silvano Cirujano Cuesta

unread,

Mar 26, 2021, 2:01:40 PM3/26/21

to Jan Kiszka, kas-devel

On 26/03/2021 17:46, Jan Kiszka wrote:
> On 26.03.21 17:14, Silvano Cirujano Cuesta wrote:
>> On 26/03/2021 10:43, Jan Kiszka wrote:
>>> On 26.03.21 10:14, [ext] Silvano Cirujano Cuesta wrote:
>>>> ...
>>>>
>>>> 1. Container has to run privileged.
>>> Yes, though that need will not vanish for Isar very soon when resolving
>>> the binfmt topic, as we know.
>> If I don't remember wrong, getting rid of this binfmt_misc configuration enables us to get it running granting just one or two capabilities, instead of "--privileged".
>>
> Yes, but it would not change the fact that the build could break/attack
> the host. It would get us one step closer, true.

AFAIK only the capability SYS_ADMIN and MKNOD where needed. Although I assume that it's somehow possible to make a privilege escalation only with both of them, then level of expertise needed to do so it's widely available... I've investigated the topic a bit for a project and I'm not aware of any technique capable of it.

>
>>>> ...
>>>>
>>>> Is it the only problem that the qemu-user binaries can be too old? I mean, just having qemu-user-static > 5.2 (the version being currently installed with the buster-backport) would be enough? That's at least my assumption for the proposed solutions.
>> Can anybody confirm that this is the issue? I mean, would a new (how new? which is the minimal version?) qemu-user binary suffice? The answer to this question is key to understand which problem the binfmt_misc configuration from the container was trying to fix in the first place.
> Config from the container is first of all addressing the issue that we
> have to run on any host distribution, not just Debian, and on Debian
> irrespective of the fact if the user installed qemu-user-static or not.
> That's relevant for the "Linux beginner can build an image" story. We
> only need to tell them to install docker and enable the logged in user
> to access it.

I can understand that use-case. But breaking the binfmt_misc system of a Linux beginner is not cool. And that's what the current approach potentially does.

My proposals don't assume any distribution at all. Only the statically built qemu-user binaries are required (you can build them yourself or extract them from a package distributing them) and some tools (a simple script like [1] suffice) to register them. Packages provided by the distributions provided by the distros are simply a comfortable way for getting them).

IMHO that "linux beginner" use-case shouldn't be the default, but activated with a flag. And a clear message should make users aware of the consequences (even if they cannot understand it the moment they read it, they might keep it in their head until they stumble upon it).

[1] https://github.com/qemu/qemu/blob/master/scripts/qemu-binfmt-conf.sh

>
>>>> ...
>>>>
>>>>
>>>> NO FIX_BINARY FLAG
>>>>
>>>> This approach is much more flexible, since each chroot and mount namespace in the system has its own configuration.
>>>>
>>>> But this flexibility doesn't come for free. The setup becomes much more complex on the host.
>>>>
>>>> The qemu-user statically linked binaries have to be provisioned on the host, but without the "fix_binary" registration.
>>>>
>>>> Each chroot/mountns has to:
>>>>
>>>> - either bring its own qemu-user binaries fitting the same path the host is using
>>>>
>>>> - or get the qemu-user binaries bind-mounted into their root filesystem
>>>>
>>>> Getting these requirements fulfilled in desktop environments seems to me to be too cumbersome.
>>> This is how things work so far, and how they are proven to work
>>> sufficiently reliable for single-user/single-build desktop environments.
>>> It's a must to preserve that use case.
>> What I meant with "these requirements" is not the current status, but the proposal of not using the "fix_binary" flag. What would require:
>>
>> 1. A configuration of the host that can be easily accomplished with some easy scripts.
>>
> - install docker on (recent) distro of your choice
> - add user to group docker (or whatever grants access -> distro doc)
> - run kas-container
>
> That's how projects/products like meta-iot2050 or
> {jailhouse,xenomai}-images work.

Accomplishing the required host configuration is even easier than installing Docker on most distros :-D

I mean, even kas-container could take care of it requesting root permissions only once from outside of the container and globally for the whole system.

>
>> 2. Starting the container bind mounting the corresponding qemu-user binaries.
>>
>> Using kas-container as it is now isn't really cumbersome, it gets problematic for all other uses of binfmt_misc with qemu-user in the same system.
>>
> Right, but we never heard complaints in the context of those single
> desktop user scenarios.

Most people wouldn't be able to blame the kas-isar container image even when facing issues provoked by it. The needed knowledge about the binfmt_misc mechanisms would probably keep most people puzzled by the kind of issues that you might face.

>
>> The kas-container setup can only be reliable for a "single-user/single-build" desktop environment if you are only using binmft_misc with qemu-user for ISAR builds with kas-container. As mentioned above, kas-container is leaving behind "scorched earth"... but since the first thing it does is "fertilizing", you won't notice anything as long as you only use kas-container. See below what I mean with "scorched earth".
> kas-container is not aiming at CI, the kas-isar /container/ is.

I've mixed up both in some places, but it's the kas-isar container image what I usually mean. In the above sentence replace "kas-container" with "kas-isar container image".

>
>> Give following a try:
>>
>> 1. Run "kas-container shell..."
>>
>> 2. Reboot
>>
>> 3. Run "docker run --platform arm64 debian:buster-slim uname -m"
>>
>> The second command will fail because kas-container configured its own qemu-user binary, but the binary disappears the moment the container is removed and the in memory loaded copy disappears with the reboot.
>>
>>
>> Installing qemu-user-static in the host would partially fix it, although it becomes an unpredictable setup if running other binftm_misc consumers in the system (what you typically do in a development desktop).
> I have no qemu-user-static that configures binfmt_misc the way Debian
> does on my SUSE. And I bet that's similar, just different, on Fedora,
> Arch, you-name-it.

IMO qemu-user-static packages shouldn't be configuring binfmt_misc without a chance to modify it (like Debian does). You're right that my comment is too Debian specific. More generally I meant with "Installing qemu-user-static" => "Installing qemu-user statically built binaries and registering them on binfmt_misc". If your distro provides a package that does it, fine. If not, you can do it yourself.

>
> Requiring a more specific host setup is ok for unisolated CI, it's not
> for the desktop.

If kas-container can require Docker, I don't understand why it cannot require a host binfmt_misc configuration.

>
>> First the qemu-user binaries of the host are loaded, until kas-container is run from the first time and its binaries get loaded and remain loaded until a reboot or a binfmt_misc reconfiguration.
>>
>> If multiarch/qemu-user-static is run ("docker run --rm --privileged multiarch/qemu-user-static --reset -p yes"), then multiarch/qemu-user-static binaries get loaded an remain active until a reboot or the next binftm_misc reconfiguration (possibly done by kas-container).
>>
>> Either all binfmt_misc consumers reconfigure the system before running (what you typically wouldn't do) or they won't know which binaries are being used...
>>
>> But why only "partially fix"? Because if any configuration from a container is using different paths than the host, then the binftm_misc configuration left behind will be appointing to a path that only exists inside of the container!
>>
>> I know at least of one project running on the same host the KAS container is being run and the first thing they have to do is running "docker run --rm --privileged multiarch/qemu-user-static --reset -p yes" because the setup left behind by the KAS container is broken for them :-/
>>
>> After having learned what kas-container does in my host, I won't be running it again in its pristine form (like I was doing until now). I'll make sure to run a patched version. And I suspect I'm not the only one doing so... I wonder if you are really running kas-container in your own desktop fully aware of what it's doing to your host.
>>
> You are free to do that. The majority of our users won't (...be able to).

I'd like to give those users the same alternative I'd like to have :-) One that give them control over their system.

>
>>>> On systems automatically preparing execution environments (like CI runners) the setup doesn't seem to be that complex. Only a matching between the execution environment and the container requirements would be needed, what can typically be accomplished with tags or similar mechanisms (CI jobs requesting the tag "binfmt-qemu-52" only run on execution environments bind-mounting qemu-user statically linked binaries > 5.2).
>>>>
>>> Right. For CI runners with shared, concurrently builds, we likely want
>>> to provide an alternative strategy, one that "just works", detects when
>>> a host is outdated (though, how?) and otherwise does not stumble when it
>>> cannot update the settings from within the container.
>> Detecting outdated from the host before starting the CI job is trivial with "qemu-<arch>-static -version".
>>
>> I can investigate how to detect outdated qemu-user binaries from within the emulated architecture. Possibly checking the existence of new syscalls could help on this.
>>
>> Updating the settings from within the container should be completely forbidden in a CI runner.
>>
> I'm fine with that - as long as the desktop use case is not broken.

Of course, I foster for a solution that enables both use-cases. Breaking one of them is not an alternative for me. I'm clear about the need for something like kas-container (that's why I'm contributing to it right from the beginning).

>
> If there is a way to present kas-isar on a CI runner a fitting setup and
> prevent it from changing that, kas-isar can simply validate it and only
> fail (or warn) when expectations are not met. In the kas-container case,
> configuration (and qemu deployment) should continue to happen via kas-isar.

I think we kind of agree on the containerized CI use-case. My only doubt is how to validate the expectations from inside of the container...

But I think we disagree on the host configuration done by kas-isar in the kas-container use-case. I'd give the user the opportunity (documentation) to configure the system himself. Or offer the configuration from kas-container (but not magically from the container), if strictly required and clearly communicating what will be done. As mentioned above, using an option of kas-container as a flag. What kas-container would basically do is either installing a distro package (like "buster-backport" in a Debian Buster system) or directly obtaining the binaries and registering them on non-supported distros.

>
>>>> I personally would go for the first approach on desktop systems and consider using the second approach on CI systems. In any case, none of them require any binfmt_misc configurations from within the containers! What's my goal with this lengthy e-mail :-)
>>>>
>>> We will need both, see above.
>> I wonder if a setup with new qemu-user binaries in the host would fit all the hereby described scenarios. But in order to answer that it's critical understanding the root-cause for the binfmt_misc reconfiguration from within kas-container in the first place (see my questions above).
> See above, try to think it through from the perspective of a non-expert
> and/or non-Debian Linux user.

I'm thinking from a non-expert and/or non-Debian Linux user. The main goal of this thread is identifying the use-cases, requirements, issues to fix,... before sending a RFC patch.

One important question that is still open for me is which QEMU-User version is required for kas-isar. Do we have a known number? Apparently 3.1.0 (what Debian Buster provides) doesn't fulfill the requirements, but 5.2.0 (what the Debian Bullseye backport to Buster provides) does. If something in between possible provided by other distros suffice remains unclear to me.

Silvano

>
> Jan
>
--
Siemens AG, T RDA IOT SES-DE

Jan Kiszka

unread,

Mar 26, 2021, 3:20:26 PM3/26/21

to Silvano Cirujano Cuesta, kas-devel

On 26.03.21 19:01, Silvano Cirujano Cuesta wrote:
>
> On 26/03/2021 17:46, Jan Kiszka wrote:
>> On 26.03.21 17:14, Silvano Cirujano Cuesta wrote:
>>> On 26/03/2021 10:43, Jan Kiszka wrote:
>>>> On 26.03.21 10:14, [ext] Silvano Cirujano Cuesta wrote:
>>>>> ...
>>>>>
>>>>> 1. Container has to run privileged.
>>>> Yes, though that need will not vanish for Isar very soon when resolving
>>>> the binfmt topic, as we know.
>>> If I don't remember wrong, getting rid of this binfmt_misc configuration enables us to get it running granting just one or two capabilities, instead of "--privileged".
>>>
>> Yes, but it would not change the fact that the build could break/attack
>> the host. It would get us one step closer, true.
> AFAIK only the capability SYS_ADMIN and MKNOD where needed. Although I assume that it's somehow possible to make a privilege escalation only with both of them, then level of expertise needed to do so it's widely available... I've investigated the topic a bit for a project and I'm not aware of any technique capable of it.

We are running --privileged since for 3.5 years (aa3d109f0b0b). Isar
changed a lot since then, so it's hard to say if we could actually
reduce the attack surface significantly only this way, without breaking
users. Again, worth to explore, but that only after binfmt_misc has been
de-privileged in upstream.

>>
>>>>> ...
>>>>>
>>>>> Is it the only problem that the qemu-user binaries can be too old? I mean, just having qemu-user-static > 5.2 (the version being currently installed with the buster-backport) would be enough? That's at least my assumption for the proposed solutions.
>>> Can anybody confirm that this is the issue? I mean, would a new (how new? which is the minimal version?) qemu-user binary suffice? The answer to this question is key to understand which problem the binfmt_misc configuration from the container was trying to fix in the first place.
>> Config from the container is first of all addressing the issue that we
>> have to run on any host distribution, not just Debian, and on Debian
>> irrespective of the fact if the user installed qemu-user-static or not.
>> That's relevant for the "Linux beginner can build an image" story. We
>> only need to tell them to install docker and enable the logged in user
>> to access it.
>
> I can understand that use-case. But breaking the binfmt_misc system of a Linux beginner is not cool. And that's what the current approach potentially does.

Beginners don't use this feature. Even most power users don't feel what
is changed this way by the container.

>
> My proposals don't assume any distribution at all. Only the statically built qemu-user binaries are required (you can build them yourself or extract them from a package distributing them) and some tools (a simple script like [1] suffice) to register them. Packages provided by the distributions provided by the distros are simply a comfortable way for getting them).

If we start pulling random qemu-user versions into build, means more
random than current Debian buster vs. bullseye issues, the situation
will not get better at all.

The binary shipped needs to be under kas-container/isar control. And
that would mean complicated extraction of binaries from the debian
packages, host-side deployment, binfmt configuration, and then container
startup - sounds like 3 additional lines of code in kas-container? Hard
to believe. ;)

As long as current defaults stay (hard requirements with our user base),
we can consider different configuration options when running
kas-container. But I still think you are the complexity and the new
risks different deployment options bring in.

>>
>>>>> I personally would go for the first approach on desktop systems and consider using the second approach on CI systems. In any case, none of them require any binfmt_misc configurations from within the containers! What's my goal with this lengthy e-mail :-)
>>>>>
>>>> We will need both, see above.
>>> I wonder if a setup with new qemu-user binaries in the host would fit all the hereby described scenarios. But in order to answer that it's critical understanding the root-cause for the binfmt_misc reconfiguration from within kas-container in the first place (see my questions above).
>> See above, try to think it through from the perspective of a non-expert
>> and/or non-Debian Linux user.
>
> I'm thinking from a non-expert and/or non-Debian Linux user. The main goal of this thread is identifying the use-cases, requirements, issues to fix,... before sending a RFC patch.
>
> One important question that is still open for me is which QEMU-User version is required for kas-isar. Do we have a known number? Apparently 3.1.0 (what Debian Buster provides) doesn't fulfill the requirements, but 5.2.0 (what the Debian Bullseye backport to Buster provides) does. If something in between possible provided by other distros suffice remains unclear to me.

Generally, you need the one of the newest Distro release you want to
build. As half of the world is now testing bullseye, that is the
reference version for us. That's basically how the situation was when
jumping to buster a few years ago. I wouldn't be surprised it will
happen again.

Jan

--
Siemens AG, T RDA IOT

Silvano Cirujano Cuesta

unread,

Mar 29, 2021, 1:31:29 PM3/29/21

to Jan Kiszka, kas-devel

On 26/03/2021 20:15, Jan Kiszka wrote:
> On 26.03.21 19:01, Silvano Cirujano Cuesta wrote:
>> On 26/03/2021 17:46, Jan Kiszka wrote:
>>> On 26.03.21 17:14, Silvano Cirujano Cuesta wrote:
>>>> On 26/03/2021 10:43, Jan Kiszka wrote:
>>>>> On 26.03.21 10:14, [ext] Silvano Cirujano Cuesta wrote:
>>>>>> ...
>>>>>>
>>>>>> 1. Container has to run privileged.
>>>>> Yes, though that need will not vanish for Isar very soon when resolving
>>>>> the binfmt topic, as we know.
>>>> If I don't remember wrong, getting rid of this binfmt_misc configuration enables us to get it running granting just one or two capabilities, instead of "--privileged".
>>>>
>>> Yes, but it would not change the fact that the build could break/attack
>>> the host. It would get us one step closer, true.
>> AFAIK only the capability SYS_ADMIN and MKNOD where needed. Although I assume that it's somehow possible to make a privilege escalation only with both of them, then level of expertise needed to do so it's widely available... I've investigated the topic a bit for a project and I'm not aware of any technique capable of it.
> We are running --privileged since for 3.5 years (aa3d109f0b0b). Isar
> changed a lot since then, so it's hard to say if we could actually
> reduce the attack surface significantly only this way, without breaking
> users. Again, worth to explore, but that only after binfmt_misc has been
> de-privileged in upstream.

If it's hard to say, then I'll have to challenge it with facts (assuming I remember it right and it hasn't changed).

I agree that not breaking users is paramount in these considerations.

I don't see the need for a de-privileged binfmt_misc in upstream (I suppose you mean the kernel). Privileged (--privileged) root containers are a much bigger risk than lower privileged root containers. Sure, less privileged containers are a much bigger risk than rootless containers, but rootless containers are for the time being IMO out of the discussion.

>
>>>>>> ...
>>>>>>
>>>>>> Is it the only problem that the qemu-user binaries can be too old? I mean, just having qemu-user-static > 5.2 (the version being currently installed with the buster-backport) would be enough? That's at least my assumption for the proposed solutions.
>>>> Can anybody confirm that this is the issue? I mean, would a new (how new? which is the minimal version?) qemu-user binary suffice? The answer to this question is key to understand which problem the binfmt_misc configuration from the container was trying to fix in the first place.
>>> Config from the container is first of all addressing the issue that we
>>> have to run on any host distribution, not just Debian, and on Debian
>>> irrespective of the fact if the user installed qemu-user-static or not.
>>> That's relevant for the "Linux beginner can build an image" story. We
>>> only need to tell them to install docker and enable the logged in user
>>> to access it.
>> I can understand that use-case. But breaking the binfmt_misc system of a Linux beginner is not cool. And that's what the current approach potentially does.
> Beginners don't use this feature. Even most power users don't feel what
> is changed this way by the container.

Beginners are typically using something like multiarch/qemu-user-static to manage it, because they don't understand it. Even many power users don't understand what's going on their host.

But still, the kas-isar container image is making fundamental changes in the host (leaving behind a broken configuration under certain circumpstances). And if you need/what to manage it yourself, you have to wonder what's wrong with the system. You start (de)installing qemu-user-static, qemu-user-binfmt, qemu-user, binfmt-support,... Then you try with multiarch/qemu-user-static... Until it magically works (sometimes only until next reboot). Been there, seen that.

>
>> My proposals don't assume any distribution at all. Only the statically built qemu-user binaries are required (you can build them yourself or extract them from a package distributing them) and some tools (a simple script like [1] suffice) to register them. Packages provided by the distributions provided by the distros are simply a comfortable way for getting them).
> If we start pulling random qemu-user versions into build, means more
> random than current Debian buster vs. bullseye issues, the situation
> will not get better at all.

Agree on that that's why I wanted to better understand which criteria permit deciding which versions are needed. I mean, pulling a specific bullseye package version is pretty much what is happening nowadays under the hood with the container image.

>
> The binary shipped needs to be under kas-container/isar control. And
> that would mean complicated extraction of binaries from the debian
> packages, host-side deployment, binfmt configuration, and then container
> startup - sounds like 3 additional lines of code in kas-container? Hard
> to believe. ;)

I don't think that kas-isar container image needs to control the binary, it only needs to be able to check that the requirements (e.g. version number) are fulfilled.

Are you challenging me? 3 LOCs without length limit? Perl allowed? X-D

>
>> IMHO that "linux beginner" use-case shouldn't be the default, but activated with a flag. And a clear message should make users aware of the consequences (even if they cannot understand it the moment they read it, they might keep it in their head until they stumble upon it).
>>

>> [1] https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fqemu%2Fqemu%2Fblob%2Fmaster%2Fscripts%2Fqemu-binfmt-conf.sh&data=04%7C01%7Csilvano.cirujano-cuesta%40siemens.com%7Cbe66e4dcfd7f4ab5929b08d8f08c3524%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C637523832265484073%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ztRkqxW8mzybiwXU9ZuqgQBlcDntQJCnDXbJXO6jS3I%3D&reserved=0

I've had a look at it and I couldn't find out how OpenSUSE installs qemu-user-static 8-o

I found it for Arch [1] (they extract the binaries from a Debian package) and Fedora [2] and both configure binfmt_misc the same way as Debian (with the "fix_binary" flag).

[1] https://aur.archlinux.org/packages/binfmt-qemu-static/

[2] https://fedora.pkgs.org/32/fedora-updates-aarch64/qemu-user-static-4.2.1-1.fc32.aarch64.rpm.html

But with a very important difference: they use systemd [3] instead of update-binfmts (like Debian does), what gives the possibility to easily override the configuration (what current qemu-user-static in Debian doesn't allow). I disagree with Debian not making the binfmt_misc registration configurable, that's why I've created a bug [4].

[3] https://www.freedesktop.org/software/systemd/man/binfmt.d.html

[4] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=985889

We can start keeping the current default, but offering a "less intrusive" mode via options and later evaluate if we change the defaults.

>
>>>>>> I personally would go for the first approach on desktop systems and consider using the second approach on CI systems. In any case, none of them require any binfmt_misc configurations from within the containers! What's my goal with this lengthy e-mail :-)
>>>>>>
>>>>> We will need both, see above.
>>>> I wonder if a setup with new qemu-user binaries in the host would fit all the hereby described scenarios. But in order to answer that it's critical understanding the root-cause for the binfmt_misc reconfiguration from within kas-container in the first place (see my questions above).
>>> See above, try to think it through from the perspective of a non-expert
>>> and/or non-Debian Linux user.
>> I'm thinking from a non-expert and/or non-Debian Linux user. The main goal of this thread is identifying the use-cases, requirements, issues to fix,... before sending a RFC patch.
>>
>> One important question that is still open for me is which QEMU-User version is required for kas-isar. Do we have a known number? Apparently 3.1.0 (what Debian Buster provides) doesn't fulfill the requirements, but 5.2.0 (what the Debian Bullseye backport to Buster provides) does. If something in between possible provided by other distros suffice remains unclear to me.
> Generally, you need the one of the newest Distro release you want to
> build. As half of the world is now testing bullseye, that is the
> reference version for us. That's basically how the situation was when
> jumping to buster a few years ago. I wouldn't be surprised it will
> happen again.

Ok, then my starting point will be current bullseye qemu-user-static release. But in a way that it can be easily updated.

Silvano

>
> Jan
>
--
Siemens AG, T RDA IOT SES-DE

Reply all

Reply to author

Forward

0 new messages