[kubevirt-dev] Unified control of privileged resources access and operations with seccomp notifiers in KubeVirt

66 views
Skip to first unread message

Alice Frosi

unread,
Dec 16, 2022, 11:19:58 AM12/16/22
to kubevirt-dev, Stefano Brivio, David Vossel, Vladik Romanovsky, Luboslav Pivarc
Hi everyone,

we would like to introduce a new project called seitan [1] and our idea of working with KubeVirt.
I opened a PR [1] with the design proposal, and there you can find plenty of details.  Here, I'd like to outline the main idea.

Seitan project aims to provide a straightforward and configurable way to perform system calls from a privileged component on behalf of an unprivileged process, using monitored system calls as triggering events. It also provides a deep inspection of system call arguments, which are usable in a declarative policy as "matches".


This project has been designed while trying to solve a critical and intricate security operation for KubeVirt.


Non-root containers and VMs protect the systems by reducing at the minimum the permissions granted to the workload. In this way, certain actions are not permitted anymore, and are delegated to privileged components. On the other hand, this complicates the synchronization and communication between the privileged and unprivileged components.


The original issue, we wanted to solve, falls in this category and consists in how to connect an unprivileged QEMU process to a privileged daemon. This option isn't natively supported in Kubernetes, and all the initially considered solutions had relevant drawbacks.

We have also quickly realized that this is a broader issue and it applies to many more areas.


KubeVirt has solved every situation singularly, and for every new feature involving privileged operations, this required new critical code to be written, audited, tested, and maintain.


While looking at these existing implementations and at the new problems, seccomp user notifiers [3] appeared to be a rather natural approach. This isn't particularly new or original. Similar examples can be found in LXD[4] or also gVisor[5] implements similar concepts, based on a userspace application kernel, where the platform interface is sitting approximately at a system call level.


The novelty introduced by seitan compared to the existing solutions is the declarative approach over the imperative programming model. In our case, the expected privileged syscalls are listed in a JSON input file and associated with a privileged operation delegated and executed by the privileged component.


If new features arise, we want a mechanism that can be expanded in a simple way. With seitan, no additional code is required, but simply a new entry must be added to the seitan input file.


Please, note that the code published under seitan [1] doesn't currently reflect yet the proposed design -- it's rather an early proof-of-concept based on a much more rudimentary initial idea. New code will be published gradually in the coming weeks.

I hope I piqued your attention, and we would greatly appreciate any comments or suggestions.


Many thanks,
Alice

Roman Mohr

unread,
Dec 19, 2022, 12:32:53 PM12/19/22
to Alice Frosi, kubevirt-dev, Stefano Brivio, David Vossel, Vladik Romanovsky, Luboslav Pivarc
Hi Alice,

First, great to see that the bind-mount problem is being tackled!

Is it fair to say, that your proposal is intended to solve disk, file and socket-mounting in various scenarios?
If yes, that sounds in general like a good boundary to me (I would be concerned that some "unneeded" privileges may not get removed from components, just because we could solve it with seccomp filters).

So, I guess what would then come up as a natural alternative, would be transferring a file-descriptor over a unix socket. Here I would love to hear your thoughts on
 * the pros/cons of both
 * a risk assessment regarding to potential side-effects on vendor- or linux specific security mechanisms.

Again, great to see this being discussed.

Best regards,
Roman 
 


 
--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CABBoX7N20nvnuPdJ27eTEzMD4E7RbV731FBGQmPYbZx%3DgGxKpA%40mail.gmail.com.

Alice Frosi

unread,
Dec 20, 2022, 4:25:00 AM12/20/22
to Roman Mohr, kubevirt-dev, Stefano Brivio, David Vossel, Vladik Romanovsky, Luboslav Pivarc
Hi Roman,

On Mon, Dec 19, 2022 at 6:32 PM Roman Mohr <rm...@google.com> wrote:
Hi Alice,

First, great to see that the bind-mount problem is being tackled!

On Fri, Dec 16, 2022 at 5:19 PM Alice Frosi <afr...@redhat.com> wrote:
Hi everyone,

we would like to introduce a new project called seitan [1] and our idea of working with KubeVirt.
I opened a PR [1] with the design proposal, and there you can find plenty of details.  Here, I'd like to outline the main idea.

Seitan project aims to provide a straightforward and configurable way to perform system calls from a privileged component on behalf of an unprivileged process, using monitored system calls as triggering events. It also provides a deep inspection of system call arguments, which are usable in a declarative policy as "matches".


This project has been designed while trying to solve a critical and intricate security operation for KubeVirt.


Non-root containers and VMs protect the systems by reducing at the minimum the permissions granted to the workload. In this way, certain actions are not permitted anymore, and are delegated to privileged components. On the other hand, this complicates the synchronization and communication between the privileged and unprivileged components.


The original issue, we wanted to solve, falls in this category and consists in how to connect an unprivileged QEMU process to a privileged daemon. This option isn't natively supported in Kubernetes, and all the initially considered solutions had relevant drawbacks.

We have also quickly realized that this is a broader issue and it applies to many more areas.


KubeVirt has solved every situation singularly, and for every new feature involving privileged operations, this required new critical code to be written, audited, tested, and maintain.


While looking at these existing implementations and at the new problems, seccomp user notifiers [3] appeared to be a rather natural approach. This isn't particularly new or original. Similar examples can be found in LXD[4] or also gVisor[5] implements similar concepts, based on a userspace application kernel, where the platform interface is sitting approximately at a system call level.


The novelty introduced by seitan compared to the existing solutions is the declarative approach over the imperative programming model. In our case, the expected privileged syscalls are listed in a JSON input file and associated with a privileged operation delegated and executed by the privileged component.


If new features arise, we want a mechanism that can be expanded in a simple way. With seitan, no additional code is required, but simply a new entry must be added to the seitan input file.


Please, note that the code published under seitan [1] doesn't currently reflect yet the proposed design -- it's rather an early proof-of-concept based on a much more rudimentary initial idea. New code will be published gradually in the coming weeks.

I hope I piqued your attention, and we would greatly appreciate any comments or suggestions.


Is it fair to say, that your proposal is intended to solve disk, file and socket-mounting in various scenarios?

Not, only. It is true that connecting QEMU to privileged daemons, opening disks, and creating tap devices can be reduced to the same problem of passing a file descriptor. We have mostly concentrated on these cases because are well-known problems.

However, it applies also to other syscalls. We have given the example of changing vcpu scheduler and priority [1]. In this case, it is virt-handler that performs the syscall. We propose to filter this syscall, and when QEMU executes it, the syscall will be filtered by the seccomp monitor, the arguments will be checked, and then the corresponding action will be executed by the privileged component. Generally, the action can be the same syscall, but not necessarily.

 
If yes, that sounds in general like a good boundary to me (I would be concerned that some "unneeded" privileges may not get removed from components, just because we could solve it with seccomp filters).

Well, the unprivileged component doesn't have those privileges.

Seccomp notifiers are simply a way to inform the privileged component that the monitored process tries to execute this syscall. Then, the syscall is executed by the privileged component if it is expected and has been previously declared inside the seitan input file.
 

So, I guess what would then come up as a natural alternative, would be transferring a file-descriptor over a unix socket. Here I would love to hear your thoughts on
 * the pros/cons of both

Sure, here are a couple of advantages of seccomp notifiers over fds and unix-sockets
  • Passing a file-descriptors over unix sockets is more complex. It requires a synchronization mechanism and overhead between the 2 user space applications. In the case of KubeVirt between virt-handler and virt-launcher. With the seccomp notifiers, you get this synchronization for "free". Let's take the SCSI persistent reservation as an example where QEMU wants to connect to a demon with a certain path. The flow is:
    • QEMU performs a connect syscall to a specific path (e.g /var/run/pr-helper.sock)
    • the seccomp monitor is notified as connect belongs to the filtered syscalls, and on QEMU side the connect is blocked
    • the seccomp monitor can then take the file descriptor from QEMU (the file descriptor information is already passed to the monitor by the seccomp notification) and connect it to the privileged daemon
    • finally, the seccomp monitor can simply let the connect syscall from QEMU continue successfully
  • QEMU/Libvirt needs to be adapted to support all the cases we want to handle with file descriptors. With seccomp we don't need to modify them, and they aren't even aware that they are running in a less privileged environment. In the case of disks, the current libvirt version doesn't support this. The same for SCSI persistent reservations.
  • Reconnections and failure don't need to be handled differently. In the case of the connection syscall, one of the two sides could encounter a failure and need to be restarted. In this case, another connection syscall will be executed, but we don't need to renew the process of exchanging file descriptors if something goes wrong. Probably, opening disks is easier as there is a single process. But in any case, the file descriptor exchange needs to be done if QEMU fails and retries another time to open a file.

 
 * a risk assessment regarding to potential side-effects on vendor- or linux specific security mechanisms.

Yes, we still need to tackle these aspects.

Together with @David Vossel, we already discussed offline that we should work together with SELinux. Especially for accessing files using file descriptors as the selinux labels are bypassed with this mechanism. We are evaluating letting the seccomp monitor inherit the SELinux context of the monitored process. In this way, the seccomp monitor will fail in opening files without the proper label. Another possibility could be to extend the seitan input file and set there the selinux labels for the path.
BTW, passing file descriptors over unix socket encounters a similar security concern. 
Similarly to SELinux, we should have a similar mechanism for AppArmor.

If this design proposal sounds promising for KubeVirt community then we can plan more effort into solving and supporting SELinux and AppArmor.
 

Again, great to see this being discussed.

Best regards,
Roman 
 

Many thanks,
Alice 

Roman Mohr

unread,
Dec 20, 2022, 5:09:32 AM12/20/22
to Alice Frosi, kubevirt-dev, Stefano Brivio, David Vossel, Vladik Romanovsky, Luboslav Pivarc
On Tue, Dec 20, 2022 at 10:24 AM Alice Frosi <afr...@redhat.com> wrote:
Hi Roman,

On Mon, Dec 19, 2022 at 6:32 PM Roman Mohr <rm...@google.com> wrote:
Hi Alice,

First, great to see that the bind-mount problem is being tackled!

On Fri, Dec 16, 2022 at 5:19 PM Alice Frosi <afr...@redhat.com> wrote:
Hi everyone,

we would like to introduce a new project called seitan [1] and our idea of working with KubeVirt.
I opened a PR [1] with the design proposal, and there you can find plenty of details.  Here, I'd like to outline the main idea.

Seitan project aims to provide a straightforward and configurable way to perform system calls from a privileged component on behalf of an unprivileged process, using monitored system calls as triggering events. It also provides a deep inspection of system call arguments, which are usable in a declarative policy as "matches".


This project has been designed while trying to solve a critical and intricate security operation for KubeVirt.


Non-root containers and VMs protect the systems by reducing at the minimum the permissions granted to the workload. In this way, certain actions are not permitted anymore, and are delegated to privileged components. On the other hand, this complicates the synchronization and communication between the privileged and unprivileged components.


The original issue, we wanted to solve, falls in this category and consists in how to connect an unprivileged QEMU process to a privileged daemon. This option isn't natively supported in Kubernetes, and all the initially considered solutions had relevant drawbacks.

We have also quickly realized that this is a broader issue and it applies to many more areas.


KubeVirt has solved every situation singularly, and for every new feature involving privileged operations, this required new critical code to be written, audited, tested, and maintain.


While looking at these existing implementations and at the new problems, seccomp user notifiers [3] appeared to be a rather natural approach. This isn't particularly new or original. Similar examples can be found in LXD[4] or also gVisor[5] implements similar concepts, based on a userspace application kernel, where the platform interface is sitting approximately at a system call level.


The novelty introduced by seitan compared to the existing solutions is the declarative approach over the imperative programming model. In our case, the expected privileged syscalls are listed in a JSON input file and associated with a privileged operation delegated and executed by the privileged component.


If new features arise, we want a mechanism that can be expanded in a simple way. With seitan, no additional code is required, but simply a new entry must be added to the seitan input file.


Please, note that the code published under seitan [1] doesn't currently reflect yet the proposed design -- it's rather an early proof-of-concept based on a much more rudimentary initial idea. New code will be published gradually in the coming weeks.

I hope I piqued your attention, and we would greatly appreciate any comments or suggestions.


Is it fair to say, that your proposal is intended to solve disk, file and socket-mounting in various scenarios?

Not, only. It is true that connecting QEMU to privileged daemons, opening disks, and creating tap devices can be reduced to the same problem of passing a file descriptor. We have mostly concentrated on these cases because are well-known problems.

However, it applies also to other syscalls. We have given the example of changing vcpu scheduler and priority [1]. In this case, it is virt-handler that performs the syscall. We propose to filter this syscall, and when QEMU executes it, the syscall will be filtered by the seccomp monitor, the arguments will be checked, and then the corresponding action will be executed by the privileged component. Generally, the action can be the same syscall, but not necessarily.

Yes, here I am not sure if the right solution would be to just let this happen. What I like right now (ignoring the cons for a moment), is that we just prepare it all upfront and tightly control the env. There is no risk that our bpf filters could somehow diverge from what libvirt tries to do. In general we just want libvirt to not do things. That as eventually always been the solution.

This is actually pretty simple and we don't have to chase what exactly libvirt tries to do. While in both cases we can have regressions, in the first case (let libvirt just not do a privileged operation), it can not diverge in potentially untested combinations. Like to create an extreme example our bpf filter would have to know all possible if/else paths in libvirt for argument building, so that we are sure we allow all valid (and only the valid) arguments.
 

 
If yes, that sounds in general like a good boundary to me (I would be concerned that some "unneeded" privileges may not get removed from components, just because we could solve it with seccomp filters).

Well, the unprivileged component doesn't have those privileges.

Seccomp notifiers are simply a way to inform the privileged component that the monitored process tries to execute this syscall. Then, the syscall is executed by the privileged component if it is expected and has been previously declared inside the seitan input file.
 

So, I guess what would then come up as a natural alternative, would be transferring a file-descriptor over a unix socket. Here I would love to hear your thoughts on
 * the pros/cons of both

Sure, here are a couple of advantages of seccomp notifiers over fds and unix-sockets
  • Passing a file-descriptors over unix sockets is more complex. It requires a synchronization mechanism and overhead between the 2 user space applications. In the case of KubeVirt between virt-handler and virt-launcher. With the seccomp notifiers, you get this synchronization for "free". Let's take the SCSI persistent reservation as an example where QEMU wants to connect to a demon with a certain path. The flow is:
    • QEMU performs a connect syscall to a specific path (e.g /var/run/pr-helper.sock)
    • the seccomp monitor is notified as connect belongs to the filtered syscalls, and on QEMU side the connect is blocked
    • the seccomp monitor can then take the file descriptor from QEMU (the file descriptor information is already passed to the monitor by the seccomp notification) and connect it to the privileged daemon
    • finally, the seccomp monitor can simply let the connect syscall from QEMU continue successfully
  • QEMU/Libvirt needs to be adapted to support all the cases we want to handle with file descriptors. With seccomp we don't need to modify them, and they aren't even aware that they are running in a less privileged environment. In the case of disks, the current libvirt version doesn't support this. The same for SCSI persistent reservations.
  • Reconnections and failure don't need to be handled differently. In the case of the connection syscall, one of the two sides could encounter a failure and need to be restarted. In this case, another connection syscall will be executed, but we don't need to renew the process of exchanging file descriptors if something goes wrong. Probably, opening disks is easier as there is a single process. But in any case, the file descriptor exchange needs to be done if QEMU fails and retries another time to open a file.

Yes, this is an interesting point. The current mounts are super-annoying, but they continue to work without having virt-handler running, which has so far been a core property. If a VM manages to enter the running state, you can't do administrative tasks on it anymore (hotplug, console, ...) but the disks are in general safe to continue running without side-effects. Is this a concern?
 

 
 * a risk assessment regarding to potential side-effects on vendor- or linux specific security mechanisms.

Yes, we still need to tackle these aspects.

Together with @David Vossel, we already discussed offline that we should work together with SELinux. Especially for accessing files using file descriptors as the selinux labels are bypassed with this mechanism. We are evaluating letting the seccomp monitor inherit the SELinux context of the monitored process. In this way, the seccomp monitor will fail in opening files without the proper label. Another possibility could be to extend the seitan input file and set there the selinux labels for the path.
BTW, passing file descriptors over unix socket encounters a similar security concern. 
Similarly to SELinux, we should have a similar mechanism for AppArmor.

If this design proposal sounds promising for KubeVirt community then we can plan more effort into solving and supporting SELinux and AppArmor.

I would prefer to discuss this a little bit more first. I added a few more examples on where this approach may have unintended side-effects and complications.

In general I see relatively little problems with mounting containerDisks and hotplug disks. There the fd should never break unless the VM goes down, and there we want to stop using it anyway.
But the socket based approach, while having an initial overhead of establishing the fd passing, is still in the game for me now. It has a very limited scope with very low risk of unintentionally running into e.g. AppArmor or SELinux by trying to solve cases where e.g. libvirt should maybe just not do something.
I am not ruling out the filter solution, but for this limited scope the socked based approach has the advantage of keeping the usual preparation flow - first virt-handler sets up the env, then virt-launcher can operate without virt-handler.

For e.g. scsi reservation socket, I am not yet sure that any of the two paths, bpf filter, or socket fd transfer, are eventually better than a mount.

Best regards,
Roman

 

Stefano Brivio

unread,
Dec 20, 2022, 7:55:11 AM12/20/22
to Alice Frosi, Roman Mohr, kubevirt-dev, David Vossel, Vladik Romanovsky, Luboslav Pivarc
On Tue, 20 Dec 2022 10:24:43 +0100
Alice Frosi <afr...@redhat.com> wrote:

> On Mon, Dec 19, 2022 at 6:32 PM Roman Mohr <rm...@google.com> wrote:
> >
> > [...]
> >
> > Is it fair to say, that your proposal is intended to solve disk, file and
> > socket-mounting in various scenarios?
> >
>
> Not, only. It is true that connecting QEMU to privileged daemons, opening
> disks, and creating tap devices can be reduced to the same problem of
> passing a file descriptor. We have mostly concentrated on these cases
> because are well-known problems.
>
> However, it applies also to other syscalls. We have given the example of
> changing vcpu scheduler and priority [1]. In this case, it is virt-handler
> that performs the syscall. We propose to filter this syscall, and when QEMU
> executes it, the syscall will be filtered by the seccomp monitor, the
> arguments will be checked, and then the corresponding action will be
> executed by the privileged component. Generally, the action can be the same
> syscall, but not necessarily.
>
> [1] https://github.com/kubevirt/kubevirt/pull/8750

Yet another example where this approach would apply quite naturally is
vhost-user for DPDK, quoting from a superseded pull request:

Right, we always have that option - have virt-handler set up the
socket correctly. That requires code that will require maintenance.

https://github.com/kubevirt/kubevirt/pull/3888#discussion_r662115623

The idea behind seitan is to avoid writing more code scattered all
over the place for these cases, rather specifying what you need.

> > [...]
> >
> > * a risk assessment regarding to potential side-effects on vendor- or
> > linux specific security mechanisms.

Note that there's also a linked document (maybe it's not very visible)
where we started doing this kind of assessment:

https://github.com/alicefr/community/blob/seitan/design-proposals/seitan/security-aspects-seitan.md

--
Stefano

Stefano Brivio

unread,
Dec 23, 2022, 2:49:18 PM12/23/22
to Roman Mohr, Alice Frosi, kubevirt-dev, David Vossel, Vladik Romanovsky, Luboslav Pivarc
Hi Roman,
I had to think for a while about this: it's true that the current way
looks different in the sense that, for example, virt-handler sets up a
socket, or virtwrap creates a tap device, in advance. But conceptually
we're not really proposing to change that in the sense of a looser
control of the environment.

That is, the most notable difference with this regard (ignoring the
advantages for a moment ;)) is rather about synchronisation and timing.
Let's take the example of the tap interface creation: now virtwrap
creates it beforehand, signals it's done, the guest can start
(implementation-wise it's much more complicated than that, which is one
thing we address).

With the approach we propose, instead, a tap device is created once the
guest signals it needs it. But it's exactly the same tap device as
configured by virt-handler (via "recipe") which is created, by an
external component (not QEMU, not libvirt) -- this doesn't change.

What changes is that it's created once QEMU asks for it (in the most
natural way, in my opinion).

> There is no risk that our bpf filters could somehow diverge from what
> libvirt tries to do. In general we just want libvirt to not do things.
> That as eventually always been the solution.

Actually, I fail to see a big difference here. Just like virt-handler
might set up things in a way that's incompatible with a given version
of libvirt or QEMU, virt-handler could configure seitan in an
incompatible way.

It's true that virt-handler doesn't monitor the system calls QEMU does,
so, should the system calls be different, but the outcome needs to
remain exactly the same, virt-handler doesn't have a problem with that.

On the other hand, the current solutions actually look *less* robust to
me. Let's say that libvirt 8.10 expects a non-blocking socket for
whatever purpose, and 8.11 "by mistake" expects the same thing, but
blocking. If virt-handler creates it, ignoring flags libvirtd would
otherwise use, you'll run into a somewhat hidden problem.

If virt-handler instructs seitan to look for system call arguments, it
can decide to refuse creating the socket (because flags don't match),
or it can decide to create it with the flags libvirtd wants, or even to
create it with a different set of flags because "we know better" in a
given case.

KubeVirt can make it as compatible or incompatible as desired by
writing recipes in a given way.

> This is actually pretty simple and we don't have to chase what exactly
> libvirt tries to do.

This is actually a big argument I see in favour of the approach we
propose: call us clumsy, but finding the examples and understanding all
the different implementations of privileged operations ultimately done
on behalf of unprivileged components took us a number of days.

Some of these approaches, taken one at a time, might be pretty simple,
but overall they're scattered all over the place and while conceptually
similar they are implemented in completely different ways, which makes
it much less simple (to us, at least).

Defining the attack surface in those terms looks incredibly
complicated. And it becomes even harder as more code is going to solve
those single, similar problems in different ways.

> While in both cases we can have regressions, in the
> first case (let libvirt just not do a privileged operation), it can not
> diverge in potentially untested combinations. Like to create an extreme
> example our bpf filter would have to know all possible if/else paths in
> libvirt for argument building, so that we are sure we allow all valid (and
> only the valid) arguments.

Well, but we are not proposing this (at least as far as KubeVirt is
concerned) as a security filter, rather as a way to do a few privileged
operations without adding more code, while admitting and describing
clearly that some are, in fact, needed. So we don't have to "accept"
any valid argument, just whatever is needed for a given functionality.

We intend to build syscall-specific knowledge into seitan, of course.
For example, if a recipe is written to trigger the creation of a socket
with a given address family, the user needs to be able to specify that
in a symbolic way.

Example for socket(2) (this is slightly more than pseudo-code right
now, it's not yet on the repository):

struct expr_cmp socket {
{ ARG_INT | ARG_SELECT, "family", af },
...
};

struct expr_values socket_types {
{ "stream", SOCK_STREAM },
{ "dgram", SOCK_DGRAM },
{ "seq", SOCK_SEQPACKET },
{ "raw", SOCK_RAW },
{ "packet", SOCK_PACKET },
};

That is, if the JSON input contains "dgram" for a given "family", we
know we have to match on the first argument to be SOCK_DGRAM.

Another example with a specific implementation in virt-handler:
https://github.com/kubevirt/kubevirt/blob/main/pkg/virt-handler/setsched.go#L11

here, virt-handler needs to build in specific knowledge about a given
syscall to issue it in advance -- seitan would carry the same knowledge
about sched_setscheduler(2) and its priority and policy arguments.

> >> [...]
> >>
> >> So, I guess what would then come up as a natural alternative, would be
> >> transferring a file-descriptor over a unix socket. Here I would love to
> >> hear your thoughts on
> >> * the pros/cons of both
> >>
> >
> > Sure, here are a couple of advantages of seccomp notifiers over fds and
> > unix-sockets
> >
> > - Passing a file-descriptors over unix sockets is more complex. It
> > requires a synchronization mechanism and overhead between the 2 user space
> > applications. In the case of KubeVirt between virt-handler and
> > virt-launcher. With the seccomp notifiers, you get this synchronization for
> > "free". Let's take the SCSI persistent reservation as an example where QEMU
> > wants to connect to a demon with a certain path. The flow is:
> > - QEMU performs a connect syscall to a specific path (e.g
> > /var/run/pr-helper.sock)
> > - the seccomp monitor is notified as connect belongs to the
> > filtered syscalls, and on QEMU side the connect is blocked
> > - the seccomp monitor can then take the file descriptor from QEMU
> > (the file descriptor information is already passed to the monitor by the
> > seccomp notification) and connect it to the privileged daemon
> > - finally, the seccomp monitor can simply let the connect syscall
> > from QEMU continue successfully
> >
> >
> > - QEMU/Libvirt needs to be adapted to support all the cases we want to
> > handle with file descriptors. With seccomp we don't need to modify them,
> > and they aren't even aware that they are running in a less privileged
> > environment. In the case of disks, the current libvirt version doesn't
> > support this. The same for SCSI persistent reservations.
> > - Reconnections and failure don't need to be handled differently. In
> > the case of the connection syscall, one of the two sides could encounter a
> > failure and need to be restarted. In this case, another connection syscall
> > will be executed, but we don't need to renew the process of exchanging file
> > descriptors if something goes wrong. Probably, opening disks is easier as
> > there is a single process. But in any case, the file descriptor exchange
> > needs to be done if QEMU fails and retries another time to open a file.
>
> Yes, this is an interesting point. The current mounts are super-annoying,
> but they continue to work without having virt-handler running, which has so
> far been a core property. If a VM manages to enter the running state, you
> can't do administrative tasks on it anymore (hotplug, console, ...) but the
> disks are in general safe to continue running without side-effects. Is this
> a concern?

My understanding is that they continue to work as long as no errors
occur. Should a failure occur, it's impossible to handle. I'm not sure
if this, specifically about mounts, is a concern.

But taking the case of a UNIX socket domain proxy for the
qemu-pr-helper Alice mentioned (which would need to be written as a
completely new implementation), I'm fairly sure a disconnection would
need to be handled for real.

> > [2] https://groups.google.com/g/kubevirt-dev/c/AmjvVqlD1Hs/m/JvxS9O8yAgAJ
> >
> >
> >> * a risk assessment regarding to potential side-effects on vendor- or
> >> linux specific security mechanisms.
> >>
> >
> > Yes, we still need to tackle these aspects.
> >
> > Together with @David Vossel <dvo...@redhat.com>, we already discussed
> > offline that we should work together with SELinux. Especially for accessing
> > files using file descriptors as the selinux labels are bypassed with this
> > mechanism. We are evaluating letting the seccomp monitor inherit the
> > SELinux context of the monitored process. In this way, the seccomp monitor
> > will fail in opening files without the proper label. Another possibility
> > could be to extend the seitan input file and set there the selinux labels
> > for the path.
> > BTW, passing file descriptors over unix socket encounters a similar
> > security concern.
> > Similarly to SELinux, we should have a similar mechanism for AppArmor.
> >
> > If this design proposal sounds promising for KubeVirt community then we
> > can plan more effort into solving and supporting SELinux and AppArmor.
> >
>
> I would prefer to discuss this a little bit more first. I added a few more
> examples on where this approach may have unintended side-effects and
> complications.
>
> In general I see relatively little problems with mounting containerDisks
> and hotplug disks. There the fd should never break unless the VM goes down,
> and there we want to stop using it anyway.
> But the socket based approach, while having an initial overhead of
> establishing the fd passing, is still in the game for me now. It has a very
> limited scope with very low risk of unintentionally running into e.g.
> AppArmor or SELinux by trying to solve cases where e.g. libvirt should
> maybe just not do something.

The initial overhead isn't so much of a concern we're pointing out
here: the biggest one is that more specific code needs to be written.

From this paragraph, one thing isn't clear to me: what different or
additional risk are you referring to, with reference to the seitan idea?

> I am not ruling out the filter solution, but for this limited scope the
> socked based approach has the advantage of keeping the usual preparation
> flow - first virt-handler sets up the env, then virt-launcher can operate
> without virt-handler.
>
> For e.g. scsi reservation socket, I am not yet sure that any of the two
> paths, bpf filter, or socket fd transfer, are eventually better than a
> mount.

In general terms, I'm not sure either -- by the way, I'm not a KubeVirt
developer, but I hope I didn't make it too obvious. :)

One thing I see very clearly though is that declaring a match on a
system call and a triggering action is... declarative.

Transferring a file descriptor (which seitan ultimately does, but with
a single implementation), or a mount, implies otherwise a bunch of code
that doesn't look exactly natural for the problem at hand.

--
Stefano

Alice Frosi

unread,
Jan 10, 2023, 11:10:17 AM1/10/23
to Roman Mohr, kubevirt-dev, Stefano Brivio, David Vossel, Vladik Romanovsky, Luboslav Pivarc
Hi Roman, 

 
Is it fair to say, that your proposal is intended to solve disk, file and socket-mounting in various scenarios?

Not, only. It is true that connecting QEMU to privileged daemons, opening disks, and creating tap devices can be reduced to the same problem of passing a file descriptor. We have mostly concentrated on these cases because are well-known problems.

However, it applies also to other syscalls. We have given the example of changing vcpu scheduler and priority [1]. In this case, it is virt-handler that performs the syscall. We propose to filter this syscall, and when QEMU executes it, the syscall will be filtered by the seccomp monitor, the arguments will be checked, and then the corresponding action will be executed by the privileged component. Generally, the action can be the same syscall, but not necessarily.

Yes, here I am not sure if the right solution would be to just let this happen. What I like right now (ignoring the cons for a moment), is that we just prepare it all upfront and tightly control the env. There is no risk that our bpf filters could somehow diverge from what libvirt tries to do. In general we just want libvirt to not do things. That as eventually always been the solution.

This is actually pretty simple and we don't have to chase what exactly libvirt tries to do. While in both cases we can have regressions, in the first case (let libvirt just not do a privileged operation), it can not diverge in potentially untested combinations. Like to create an extreme example our bpf filter would have to know all possible if/else paths in libvirt for argument building, so that we are sure we allow all valid (and only the valid) arguments.

Stefano already gave a very detailed answer in the previous email. However, I still want to emphasize that we should only list the expected and privileged syscalls. This mechanism is designed to replace the code where kubevirt needs to reimplement those syscalls or access privileged resources.

I don't think we need to chase all the possible combinations and paths that libvirt tries to do, rather implementing only the operations it cannot do because it lacks privileges. Generally, they should be pretty simple and well defined.

About preparing the environment upfront, yes definitely this is a core property that we have tried to maintain! IMHO, this proposal just reinforces it. The seitan input files need to list exactly the operations that will be executed. The expected syscalls and arguments will be precomputed and generated based on the VMI definition before launching libvirt and QEMU.


 

 
If yes, that sounds in general like a good boundary to me (I would be concerned that some "unneeded" privileges may not get removed from components, just because we could solve it with seccomp filters).

Well, the unprivileged component doesn't have those privileges.

Seccomp notifiers are simply a way to inform the privileged component that the monitored process tries to execute this syscall. Then, the syscall is executed by the privileged component if it is expected and has been previously declared inside the seitan input file.
 

So, I guess what would then come up as a natural alternative, would be transferring a file-descriptor over a unix socket. Here I would love to hear your thoughts on
 * the pros/cons of both

Sure, here are a couple of advantages of seccomp notifiers over fds and unix-sockets
  • Passing a file-descriptors over unix sockets is more complex. It requires a synchronization mechanism and overhead between the 2 user space applications. In the case of KubeVirt between virt-handler and virt-launcher. With the seccomp notifiers, you get this synchronization for "free". Let's take the SCSI persistent reservation as an example where QEMU wants to connect to a demon with a certain path. The flow is:
    • QEMU performs a connect syscall to a specific path (e.g /var/run/pr-helper.sock)
    • the seccomp monitor is notified as connect belongs to the filtered syscalls, and on QEMU side the connect is blocked
    • the seccomp monitor can then take the file descriptor from QEMU (the file descriptor information is already passed to the monitor by the seccomp notification) and connect it to the privileged daemon
    • finally, the seccomp monitor can simply let the connect syscall from QEMU continue successfully
  • QEMU/Libvirt needs to be adapted to support all the cases we want to handle with file descriptors. With seccomp we don't need to modify them, and they aren't even aware that they are running in a less privileged environment. In the case of disks, the current libvirt version doesn't support this. The same for SCSI persistent reservations.
  • Reconnections and failure don't need to be handled differently. In the case of the connection syscall, one of the two sides could encounter a failure and need to be restarted. In this case, another connection syscall will be executed, but we don't need to renew the process of exchanging file descriptors if something goes wrong. Probably, opening disks is easier as there is a single process. But in any case, the file descriptor exchange needs to be done if QEMU fails and retries another time to open a file.

Yes, this is an interesting point. The current mounts are super-annoying, but they continue to work without having virt-handler running, which has so far been a core property. If a VM manages to enter the running state, you can't do administrative tasks on it anymore (hotplug, console, ...) but the disks are in general safe to continue running without side-effects. Is this a concern?

Well, this will apply also with seccomp, once the disk has been opened and QEMU has the file descriptor of the image, it can still access it even if virt-handler is down. The real concern about the bind mount is the clean-ups done by kubelet when virt-handler isn't responding.
 
Additionally, another advantage of this approach is avoiding the synchronization between virt-handler and virt-launcher. 
We have talked mostly only about disks and sockets, but there are already other kinds of operations in the code that fall into this category. 
In the proposal, we mentioned realtime VMs and the syscall sched_setscheduler. Well, this is another good example where seitan could significantly simplify the flow.

Currently, for real-time VMs, we need to:
  - start the guest in pause state because we need to have the QEMU process already existing
  - let the virt-handler set the scheduling and the propriety with the syscall sched_setscheduler on QEMU
  - unpause the guest

With seitan, we could simply list sched_setscheduler with constant arguments (priority:1) into the seitan input file, and let seitan impersonate the syscall when the QEMU process executes it.

This would avoid the entire mechanism of pausing/unpausing the guest. In addition, it enables the VM migration while the pausing mechanism doesn't work in this case (here, I need to give credit to Lubo who helped us figure this out ;)).

 

 
 * a risk assessment regarding to potential side-effects on vendor- or linux specific security mechanisms.

Yes, we still need to tackle these aspects.

Together with @David Vossel, we already discussed offline that we should work together with SELinux. Especially for accessing files using file descriptors as the selinux labels are bypassed with this mechanism. We are evaluating letting the seccomp monitor inherit the SELinux context of the monitored process. In this way, the seccomp monitor will fail in opening files without the proper label. Another possibility could be to extend the seitan input file and set there the selinux labels for the path.
BTW, passing file descriptors over unix socket encounters a similar security concern. 
Similarly to SELinux, we should have a similar mechanism for AppArmor.

If this design proposal sounds promising for KubeVirt community then we can plan more effort into solving and supporting SELinux and AppArmor.

I would prefer to discuss this a little bit more first. I added a few more examples on where this approach may have unintended side-effects and complications.

Sure, this is exactly the goal of the proposal.
We have listed the potential scenarios where seitan could be used. However, it is a flexible and generic tool and we can decide which feature can take advantage of it and when. 
My initial intention would be to integrate the SCSI persistent reservation using this tool as it is a new feature and won't cause any backward compatibility. Nevertheless, it could be used also for improving existing features.
 

In general I see relatively little problems with mounting containerDisks and hotplug disks. There the fd should never break unless the VM goes down, and there we want to stop using it anyway.
But the socket based approach, while having an initial overhead of establishing the fd passing, is still in the game for me now. It has a very limited scope with very low risk of unintentionally running into e.g. AppArmor or SELinux by trying to solve cases where e.g. libvirt should maybe just not do something.
I am not ruling out the filter solution, but for this limited scope the socked based approach has the advantage of keeping the usual preparation flow - first virt-handler sets up the env, then virt-launcher can operate without virt-handler.
 
For e.g. scsi reservation socket, I am not yet sure that any of the two paths, bpf filter, or socket fd transfer, are eventually better than a mount.

Best regards,
Roman

Many thanks for the precious feedback,
Alice 

Roman Mohr

unread,
Jan 30, 2023, 3:48:11 AM1/30/23
to Alice Frosi, kubevirt-dev, Stefano Brivio, David Vossel, Vladik Romanovsky, Luboslav Pivarc
Thanks Stefano and Alice.

On Tue, Jan 10, 2023 at 5:10 PM Alice Frosi <afr...@redhat.com> wrote:
Hi Roman, 

 
Is it fair to say, that your proposal is intended to solve disk, file and socket-mounting in various scenarios?

Not, only. It is true that connecting QEMU to privileged daemons, opening disks, and creating tap devices can be reduced to the same problem of passing a file descriptor. We have mostly concentrated on these cases because are well-known problems.

However, it applies also to other syscalls. We have given the example of changing vcpu scheduler and priority [1]. In this case, it is virt-handler that performs the syscall. We propose to filter this syscall, and when QEMU executes it, the syscall will be filtered by the seccomp monitor, the arguments will be checked, and then the corresponding action will be executed by the privileged component. Generally, the action can be the same syscall, but not necessarily.

Yes, here I am not sure if the right solution would be to just let this happen. What I like right now (ignoring the cons for a moment), is that we just prepare it all upfront and tightly control the env. There is no risk that our bpf filters could somehow diverge from what libvirt tries to do. In general we just want libvirt to not do things. That as eventually always been the solution.

This is actually pretty simple and we don't have to chase what exactly libvirt tries to do. While in both cases we can have regressions, in the first case (let libvirt just not do a privileged operation), it can not diverge in potentially untested combinations. Like to create an extreme example our bpf filter would have to know all possible if/else paths in libvirt for argument building, so that we are sure we allow all valid (and only the valid) arguments.

Stefano already gave a very detailed answer in the previous email. However, I still want to emphasize that we should only list the expected and privileged syscalls. This mechanism is designed to replace the code where kubevirt needs to reimplement those syscalls or access privileged resources.

I don't think we need to chase all the possible combinations and paths that libvirt tries to do, rather implementing only the operations it cannot do because it lacks privileges. Generally, they should be pretty simple and well defined.

About preparing the environment upfront, yes definitely this is a core property that we have tried to maintain! IMHO, this proposal just reinforces it. The seitan input files need to list exactly the operations that will be executed. The expected syscalls and arguments will be precomputed and generated based on the VMI definition before launching libvirt and QEMU.

I think there is one critical difference though, and that is: In one case we are defining actions where we allow an unprivileged process more than it can do from a k8s point of view, while in the current case virt-handler is really doing all privileged things upfront. One key difference is explained with this sentence which we have to ask ourselves with saitan: "Am I (virt-handler) doing a privileged operation now because libvirt needs something, or because someone who just did `kubectl exec` into the pod just uses the allow-list to do potentially bad stuff?"

I think in almost all cases when we had issues so far, because libvirt did a syscall or tried to re-write files, we could just let libvirt stop doing that.

One interesting example from Stefano is the SHED example: https://github.com/kubevirt/kubevirt/blob/main/pkg/virt-handler/setsched.go#L11

Here we exactly did that: We don't let libvirt do it, we let virt-handler do it in advance. Here I don't have to be scared about potentially messing up a saitan filter, I personally don't see a difference in complexity, regarding writing a json file or golang code. In both cases a person needs deep knowledge. But if virt-handler does this upfront directly, there is little chance to be too generous on the filtering side. The performed action may still lead to a configuration where virt-launcher may abuse the setting, but at least there is not risk to give virt-launcher a chance to potentially do configurations itself which we did not think about.

Does that sound reasonable?


Best regards,
Roman

Roman Mohr

unread,
Jan 30, 2023, 4:29:45 AM1/30/23
to Stefano Brivio, Alice Frosi, kubevirt-dev, David Vossel, Vladik Romanovsky, Luboslav Pivarc
Stefano,

Thanks for the detailed writeup.

Hm, but qemu is not asking for it as far as I know. We just want it to use a tap device which we know upfront about. I also don't see this happening on live-disk-attach or live-network-disk-attach.
It looks to me like it follows a layered pattern in libvirt/qemu as well.

Still I am not a fan of what we have right now either, often we have to actually prepare "concrete" devices, where actually passing file descriptors over sockets would be even better.


> There is no risk that our bpf filters could somehow diverge from what
> libvirt tries to do. In general we just want libvirt to not do things.
> That as eventually always been the solution.

Actually, I fail to see a big difference here. Just like virt-handler
might set up things in a way that's incompatible with a given version
of libvirt or QEMU, virt-handler could configure seitan in an
incompatible way.

It's true that virt-handler doesn't monitor the system calls QEMU does,
so, should the system calls be different, but the outcome needs to
remain exactly the same, virt-handler doesn't have a problem with that.

Definitely agree, because we are in both cases guessing what libvirt tries to do. Libvirt should almost always do nothing.
If it does nothing, it can't get out of sync with our expectations. If we now do the following:

 * Let's see what libvirt does now
 * Let's write a saitan filter

and we go on, we always have the risk of getting out of sync, but if we would solve it this way:

 * Let's see what libvirt does
 * Let's ensure that it does not do it anymore by instead passing a prepared fd or a prepared socket over an api

Then we are eliminating the actual issue.

If we stick with the first approach, we may sooner or later end up in a situation 
 * where we will also have to count on how often specific syscalls are allowed to be performed.
 * where we would have to allow libvirt something which could potentially be exploited because it requires too flexible arguments

Would that be a concern to you?

 I am really not entirely sure on what is the most future-proof way to go. But I have a feeling that we should be really sure that we want to invest into allowing unprivileged processes to do additional potentially k8s-seccomp-policy-forbidden syscalls.
In general, what I think would be most beneficial for KubeVirt, would be if Libvirt would not try to do any syscalls. At least I would want it seriously considered if we are already investing in that area.

I have the feeling that we are here mostly working around the fact that we can't do with libvirt/qemu what libvirt does with qemu.

Preparing some stuff outside of the unprivileged space and then just passing file-descriptors and sockets to qemu.

File-descriptor passing between virt-handler and virt-launcher, while not initially trivial, would be a one-time thing to establish. Then adding more is simple. As an example, the iscsi-proxy is using it.
It is definitely not unknown territory where we don't know if it would work.

If we could somehow enable this, I think we get all the benefits you are mentioning but without letting the unprivileged components (which libvirt is as well),
perform actual syscalls which could architecturally be a risky path.

Best regards,
Roman

Stefano Brivio

unread,
Jan 31, 2023, 7:11:05 PM1/31/23
to Roman Mohr, Alice Frosi, kubevirt-dev, David Vossel, Vladik Romanovsky, Luboslav Pivarc
[Answering two emails in one]

Hi Roman,
Right, I see the difference here, even though I'm not sure it makes a
concrete difference in terms of security perimeter.

Taking an example you're familiar with: CVE-2022-1798 doesn't rely on
a specific action initiated by an unprivileged component -- it rather
relies on a mistake in how things were done "upfront" by virt-handler.

The "upfront" part is indeed open to interpretation for the specific
case, but in any case this potentially applies to other
vulnerabilities: just because it's done upfront it doesn't mean it's
(much more) secure.

The idea behind seitan (in the context of KubeVirt) would be to make
those mistakes harder to hit.

> One key difference is explained
> with this sentence which we have to ask ourselves with saitan: "Am I
> (virt-handler) doing a privileged operation now because libvirt needs
> something, or because someone who just did `kubectl exec` into the pod just
> uses the allow-list to do potentially bad stuff?"

Well, it's substantially different from an allow-list -- unprivileged
components wouldn't be able to perform additional system calls anyway.
The system call arguments are specified rather rigidly, and again
upfront.

So yes, it's true that seitan adds an interaction at runtime, but what
the privileged component does is clearly specified, and, circling back
on that example, following a symlink there has really to be done on
purpose.

Also, setting things up upfront might make harder to reduce the amount
of privileged operations (in the sense of system calls).

If a tap device is created beforehand, virt-handler needs to open the
device, perform the ioctl(), and pass that, with all the (namespacing,
etc.) implications of a file descriptor that was originally opened in a
privileged context. If the privileged component just does an ioctl() on
it, this kind of surface is reduced.

> I think in almost all cases when we had issues so far, because libvirt did
> a syscall or tried to re-write files, we could just let libvirt stop doing
> that.

Maybe yes, I don't have enough KubeVirt-related experience to evaluate
this. Still, what I see is many different mechanisms of performing
privileged operations, ultimately on behalf of an unprivileged
component.

That doesn't mean that seitan is the only way to fix that, of course.
For sure it makes it straightforward to "see" those cases. And, valid
or not, there must be reasons why there are so many different
approaches implemented.

> One interesting example from Stefano is the SHED example:
> https://github.com/kubevirt/kubevirt/blob/main/pkg/virt-handler/setsched.go#L11
>
> Here we exactly did that: We don't let libvirt do it, we let virt-handler
> do it in advance. Here I don't have to be scared about potentially messing
> up a saitan filter, I personally don't see a difference in
> complexity, regarding writing a json file or golang code. In both cases a
> person needs deep knowledge.

In that case yes, it also looks pretty much the same to me (minus the
fact you can't use UAPI headers, but they should be stable enough by
design and in practice they almost always are).

The only difference I see in complexity is that you need to share that
link, I would have otherwise no idea what you're talking about. If you
had a single place describing privileged operations, they would be
easier to audit and understand.

But it's just an arbitrary categorisation in some sense (albeit one
that appears to make sense to me).

> But if virt-handler does this upfront
> directly, there is little chance to be too generous on the filtering side.

We're not really implementing "filtering" with seitan, I don't see the
chance to be "generous" with that. We're adding an interaction at
runtime that needs to be scrutinised, definitely, but I think it's
substantially different from a mere filter.

> The performed action may still lead to a configuration where virt-launcher
> may abuse the setting, but at least there is not risk to give virt-launcher
> a chance to potentially do configurations itself which we did not think
> about.

Same as above -- I think.

> Does that sound reasonable?

I don't know. :)

The intentions behind seitan are pretty much to 1. describe privileged
operations in a declarative manner, 2. unify their implementation, and
3. reduce the operations that need to be done in a privileged context
to a minimum.

Perhaps, for KubeVirt, the "upfront" aspect (in the sense you describe,
not in Alice's sense) is more important than those. I can only evaluate
this in a generic Linux (or security) perspective, seeing virt-launcher,
virt-handler, libvirtd and qemu as processes, but I don't have enough
KubeVirt-related experience to suggest what's more or less important in
the specific case.

On Mon, 30 Jan 2023 10:29:33 +0100
Roman Mohr <rm...@google.com> wrote:

> On Fri, Dec 23, 2022 at 8:49 PM Stefano Brivio <sbr...@redhat.com> wrote:
>
> > On Tue, 20 Dec 2022 11:09:18 +0100
> > Roman Mohr <rm...@google.com> wrote:
> >
> > [...]
> >
> > > Yes, here I am not sure if the right solution would be to just let this
> > > happen. What I like right now (ignoring the cons for a moment), is that
> > > we just prepare it all upfront and tightly control the env.
> >
> > I had to think for a while about this: it's true that the current way
> > looks different in the sense that, for example, virt-handler sets up a
> > socket, or virtwrap creates a tap device, in advance. But conceptually
> > we're not really proposing to change that in the sense of a looser
> > control of the environment.
> >
> > That is, the most notable difference with this regard (ignoring the
> > advantages for a moment ;)) is rather about synchronisation and timing.
> > Let's take the example of the tap interface creation: now virtwrap
> > creates it beforehand, signals it's done, the guest can start
> > (implementation-wise it's much more complicated than that, which is one
> > thing we address).
> >
> > With the approach we propose, instead, a tap device is created once the
> > guest signals it needs it. But it's exactly the same tap device as
> > configured by virt-handler (via "recipe") which is created, by an
> > external component (not QEMU, not libvirt) -- this doesn't change.
> >
> > What changes is that it's created once QEMU asks for it (in the most
> > natural way, in my opinion).
>
> Hm, but qemu is not asking for it as far as I know.

Well, it "normally" does:

$ strace -eioctl qemu-system-x86_64 -netdev tap,id=x
ioctl(13, TUNGETFEATURES, 0x7ffea35e9d8c) = 0
ioctl(13, TUNSETVNETHDRSZ, 0x7ffea35e9d88) = -1 EBADFD (File descriptor in bad state)
ioctl(13, TUNSETIFF, 0x7ffea35e9d90) = -1 EPERM (Operation not permitted)
qemu-system-x86_64: -netdev tap,id=x: could not configure /dev/net/tun: Operation not permitted
+++ exited with 1 +++

At some point libvirt gained the support to pass a file descriptor
corresponding to a pre-created tap (or macvtap) device:

https://listman.redhat.com/archives/libvir-list/2019-August/msg01256.html

but this was specifically added for KubeVirt purposes:

https://bugzilla.redhat.com/show_bug.cgi?id=1723367

and the ad-hoc corresponding implementation in KubeVirt isn't trivial either:

https://github.com/kubevirt/kubevirt/pull/3290

...so, perhaps too dismissively, this is one instance of what we called
"workarounds" in the seitan proposal.

I see it as a workaround in the sense that qemu can't create an interface,
but something else can. Ultimately, any other meaningful operation on that
file descriptor (and then, interface) is done by qemu.

Just TUNSETIFF can't be done, for good reasons, and that led to re-doing
the same thing somewhere else.

> We just want it to use
> a tap device which we know upfront about. I also don't see this happening
> on live-disk-attach or live-network-disk-attach.

It looks similar to me. You could do it with qemu... except that you can't,
so let's do it somewhere else. Of course, there's no choice. :) But I have
the feeling it would be better to make that very visible.

> It looks to me like it follows a layered pattern in libvirt/qemu as well.

Some of those implementations I know about were essentially driven by
KubeVirt itself. To me they look inconsistent in nature: they exist only
for cases were it's not possible to do stuff from qemu itself without
running it as root. Sure, that's also a pattern. :)

> Still I am not a fan of what we have right now either, often we have to
> actually prepare "concrete" devices, where actually passing file
> descriptors over sockets would be even better.

That would look more consistent to me. Maybe it was never done just because
"nobody got time for that". On the other hand, I'm not sure that would cover
all the examples we are presenting with seitan.

Considering the qemu-pr-helper case with the draft (pre-seitan)
implementation by Alice:
https://github.com/alicefr/example-pidfd-getpid

just passing file descriptors around is not enough, even though you could
add some bits to qemu, and then some bits to libvirt, and then do something
equivalent in KubeVirt.

> > > There is no risk that our bpf filters could somehow diverge from what
> > > libvirt tries to do. In general we just want libvirt to not do things.
> > > That as eventually always been the solution.
> >
> > Actually, I fail to see a big difference here. Just like virt-handler
> > might set up things in a way that's incompatible with a given version
> > of libvirt or QEMU, virt-handler could configure seitan in an
> > incompatible way.
> >
> > It's true that virt-handler doesn't monitor the system calls QEMU does,
> > so, should the system calls be different, but the outcome needs to
> > remain exactly the same, virt-handler doesn't have a problem with that.
>
> Definitely agree, because we are in both cases guessing what libvirt tries
> to do. Libvirt should almost always do nothing.
> If it does nothing, it can't get out of sync with our expectations. If we
> now do the following:
>
> * Let's see what libvirt does now

...with seitan that would be a "testing" topic rather than a design one. If
the guest uses a block device it needs to open(2) it -- the existing
(loosely) POSIX abstraction is already pretty good in my opinion.

> * Let's write a saitan filter
>
> and we go on, we always have the risk of getting out of sync, but if we
> would solve it this way:
>
> * Let's see what libvirt does
> * Let's ensure that it does not do it anymore by instead passing a
> prepared fd or a prepared socket over an api

Hmm... the difference I failed to see is exactly here: you don't know
if the "prepared" file descriptor matches whatever libvirt might expect in
the future. It might have different flags, or be in different states (say,
a tap device where the TUNSETIFF arguments change).

In practice, though:

> Then we are eliminating the actual issue.

probably yes, in the vast majority of cases (as long as passing file
descriptors is enough).

> If we stick with the first approach, we may sooner or later end up in a
> situation
> * where we will also have to count on how often specific syscalls are
> allowed to be performed.

The JSON snippets in the seitan proposal actually mention counters. But
you can probably find a slightly different example where seitan is less
flexible than specific imperative code, by all means.

> * where we would have to allow libvirt something which could potentially
> be exploited because it requires too flexible arguments

...see above: we're not (just) implementing a filter.

> Would that be a concern to you?

The first point, yes, somewhat -- but I have to say that, while I
understand the category of issues you're pointing to, I don't really have
an example (other than counters, but seitan would implement those).

Do you have one? I can probably go and fish for some otherwise.

About the second point: at least for KubeVirt's own usage, the arguments
of the syscalls performed by seitan would be precomputed before runtime
(so also "upfront"), libvirt wouldn't be allowed to do anything on top.

> I am really not entirely sure on what is the most future-proof way to go.
> But I have a feeling that we should be really sure that we want to invest
> into allowing unprivileged processes to do additional potentially
> k8s-seccomp-policy-forbidden syscalls.

We wouldn't really allow additional system calls -- we would rather adopt
a model that's conceptually closer to gVisor, or a vast simplification
of it (also because we have VMs, we don't need to do so much).

Similarly to gVisor, you could even use seitan to emulate (even though
not fully "implement") some system calls that you don't want unprivileged
processes to perform directly.

Regardless of that "detail", I understand this looks like (is?) a somehow
disruptive proposal, and probably having side-by-side options available
for comparison, with a complete working implementation, would give a much
better feeling about the real impact. That, and perhaps usages that go
beyond KubeVirt. We're trying to explore that (too).

> > On the other hand, the current solutions actually look *less* robust to
> > me. Let's say that libvirt 8.10 expects a non-blocking socket for
> > whatever purpose, and 8.11 "by mistake" expects the same thing, but
> > blocking. If virt-handler creates it, ignoring flags libvirtd would
> > otherwise use, you'll run into a somewhat hidden problem.
> >
> > If virt-handler instructs seitan to look for system call arguments, it
> > can decide to refuse creating the socket (because flags don't match),
> > or it can decide to create it with the flags libvirtd wants, or even to
> > create it with a different set of flags because "we know better" in a
> > given case.
> >
> > KubeVirt can make it as compatible or incompatible as desired by
> > writing recipes in a given way.
> >
> > > This is actually pretty simple and we don't have to chase what exactly
> > > libvirt tries to do.
> >
> > This is actually a big argument I see in favour of the approach we
> > propose: call us clumsy, but finding the examples and understanding all
> > the different implementations of privileged operations ultimately done
> > on behalf of unprivileged components took us a number of days.
> >
> > Some of these approaches, taken one at a time, might be pretty simple,
> > but overall they're scattered all over the place and while conceptually
> > similar they are implemented in completely different ways, which makes
> > it much less simple (to us, at least).
> >
> > Defining the attack surface in those terms looks incredibly
> > complicated. And it becomes even harder as more code is going to solve
> > those single, similar problems in different ways.
>
> In general, what I think would be most beneficial for KubeVirt, would be if
> Libvirt would not try to do any syscalls. At least I would want it
> seriously considered if we are already investing in that area.
>
> I have the feeling that we are here mostly working around the fact that we
> can't do with libvirt/qemu what libvirt does with qemu.

I see your point here, and I also see it as a reason for some of (all?) the
existing implementations we mentioned in the proposal.

Regardless of that reason, we're still raising the topic of imperative vs.
declarative approaches.

If qemu doesn't do the system call (even though it already has
well-established, mature code doing that part), and libvirt doesn't do it
either (whether there was code doing it at some point, or not), you'll need
equivalent code in virt-handler.

You could have a declarative, or unified approach to privileged operations,
or both, even without seitan. Especially having privileged operations not
scattered around as they are now would look quite valuable to me.

But still, if you want to use another abstraction (other than system calls)
as a basis for it, I guess you need additional infrastructure (for
synchronisation, i.e. pausing and resuming the guest, and passing
information back and forth in a way that fits your abstraction) and a
second/third implementation ("don't let qemu create a tap device, don't let
libvirt create a tap device, create a tap device in virt-handler").

I'm not entirely sure, because we don't have a comparable alternative, but
the resource abstraction provided by system calls looks quite solid to me.
And it's anyway what the kernel (and its LSMs, essentially) use to allow or
deny anything the VM needs.

> Preparing some stuff outside of the unprivileged space and then just
> passing file-descriptors and sockets to qemu.
>
> File-descriptor passing between virt-handler and virt-launcher, while not
> initially trivial, would be a one-time thing to establish. Then adding more
> is simple. As an example, the iscsi-proxy is using it.

Sure, that's definitely viable. You might still need to build some matching
implementations in libvirt (and possibly qemu), and it only covers cases
where file descriptors are all you need (and not more than what you want),
but both caveats are not necessarily "bad".

> It is definitely not unknown territory where we don't know if it would work.

If it's about "working" in the long term -- sure, the seitan approach isn't
proven at all. About working per se... we have a working example of SCSI
persistent reservation based on an early draft:

https://github.com/alicefr/example-seitan

> If we could somehow enable this, I think we get all the benefits you are
> mentioning but without letting the unprivileged components (which libvirt
> is as well),
> perform actual syscalls which could architecturally be a risky path.

Allow me to repeat this ;) ...unprivileged components wouldn't be
allowed to perform additional system calls, in any case. We could
filter some, more accurately than what seccomp alone can do:

https://github.com/alicefr/community/blob/seitan/design-proposals/seitan/security-aspects-seitan.md#usage-as-security-filter

...but that's not really the focus for the KubeVirt usage we were
thinking of, at the moment.

--
Stefano

Reply all
Reply to author
Forward
0 new messages