This project has been designed while trying to solve a critical and intricate security operation for KubeVirt.
Non-root containers and VMs protect the systems by reducing at the minimum the permissions granted to the workload. In this way, certain actions are not permitted anymore, and are delegated to privileged components. On the other hand, this complicates the synchronization and communication between the privileged and unprivileged components.
The original issue, we wanted to solve, falls in this category and consists in how to connect an unprivileged QEMU process to a privileged daemon. This option isn't natively supported in Kubernetes, and all the initially considered solutions had relevant drawbacks.
We have also quickly realized that this is a broader issue and it applies to many more areas.
KubeVirt has solved every situation singularly, and for every new feature involving privileged operations, this required new critical code to be written, audited, tested, and maintain.
While looking at these existing implementations and at the new problems, seccomp user notifiers [3] appeared to be a rather natural approach. This isn't particularly new or original. Similar examples can be found in LXD[4] or also gVisor[5] implements similar concepts, based on a userspace application kernel, where the platform interface is sitting approximately at a system call level.
The novelty introduced by seitan compared to the existing solutions is the declarative approach over the imperative programming model. In our case, the expected privileged syscalls are listed in a JSON input file and associated with a privileged operation delegated and executed by the privileged component.
If new features arise, we want a mechanism that can be expanded in a simple way. With seitan, no additional code is required, but simply a new entry must be added to the seitan input file.
--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CABBoX7N20nvnuPdJ27eTEzMD4E7RbV731FBGQmPYbZx%3DgGxKpA%40mail.gmail.com.
Hi Alice,First, great to see that the bind-mount problem is being tackled!On Fri, Dec 16, 2022 at 5:19 PM Alice Frosi <afr...@redhat.com> wrote:Hi everyone,we would like to introduce a new project called seitan [1] and our idea of working with KubeVirt.I opened a PR [1] with the design proposal, and there you can find plenty of details. Here, I'd like to outline the main idea.Seitan project aims to provide a straightforward and configurable way to perform system calls from a privileged component on behalf of an unprivileged process, using monitored system calls as triggering events. It also provides a deep inspection of system call arguments, which are usable in a declarative policy as "matches".
This project has been designed while trying to solve a critical and intricate security operation for KubeVirt.
Non-root containers and VMs protect the systems by reducing at the minimum the permissions granted to the workload. In this way, certain actions are not permitted anymore, and are delegated to privileged components. On the other hand, this complicates the synchronization and communication between the privileged and unprivileged components.
The original issue, we wanted to solve, falls in this category and consists in how to connect an unprivileged QEMU process to a privileged daemon. This option isn't natively supported in Kubernetes, and all the initially considered solutions had relevant drawbacks.
We have also quickly realized that this is a broader issue and it applies to many more areas.
KubeVirt has solved every situation singularly, and for every new feature involving privileged operations, this required new critical code to be written, audited, tested, and maintain.
While looking at these existing implementations and at the new problems, seccomp user notifiers [3] appeared to be a rather natural approach. This isn't particularly new or original. Similar examples can be found in LXD[4] or also gVisor[5] implements similar concepts, based on a userspace application kernel, where the platform interface is sitting approximately at a system call level.
The novelty introduced by seitan compared to the existing solutions is the declarative approach over the imperative programming model. In our case, the expected privileged syscalls are listed in a JSON input file and associated with a privileged operation delegated and executed by the privileged component.
If new features arise, we want a mechanism that can be expanded in a simple way. With seitan, no additional code is required, but simply a new entry must be added to the seitan input file.
Please, note that the code published under seitan [1] doesn't currently reflect yet the proposed design -- it's rather an early proof-of-concept based on a much more rudimentary initial idea. New code will be published gradually in the coming weeks.
I hope I piqued your attention, and we would greatly appreciate any comments or suggestions.Is it fair to say, that your proposal is intended to solve disk, file and socket-mounting in various scenarios?
If yes, that sounds in general like a good boundary to me (I would be concerned that some "unneeded" privileges may not get removed from components, just because we could solve it with seccomp filters).
So, I guess what would then come up as a natural alternative, would be transferring a file-descriptor over a unix socket. Here I would love to hear your thoughts on
* the pros/cons of both
* a risk assessment regarding to potential side-effects on vendor- or linux specific security mechanisms.
Again, great to see this being discussed.Best regards,Roman
Hi Roman,On Mon, Dec 19, 2022 at 6:32 PM Roman Mohr <rm...@google.com> wrote:Hi Alice,First, great to see that the bind-mount problem is being tackled!On Fri, Dec 16, 2022 at 5:19 PM Alice Frosi <afr...@redhat.com> wrote:Hi everyone,we would like to introduce a new project called seitan [1] and our idea of working with KubeVirt.I opened a PR [1] with the design proposal, and there you can find plenty of details. Here, I'd like to outline the main idea.Seitan project aims to provide a straightforward and configurable way to perform system calls from a privileged component on behalf of an unprivileged process, using monitored system calls as triggering events. It also provides a deep inspection of system call arguments, which are usable in a declarative policy as "matches".
This project has been designed while trying to solve a critical and intricate security operation for KubeVirt.
Non-root containers and VMs protect the systems by reducing at the minimum the permissions granted to the workload. In this way, certain actions are not permitted anymore, and are delegated to privileged components. On the other hand, this complicates the synchronization and communication between the privileged and unprivileged components.
The original issue, we wanted to solve, falls in this category and consists in how to connect an unprivileged QEMU process to a privileged daemon. This option isn't natively supported in Kubernetes, and all the initially considered solutions had relevant drawbacks.
We have also quickly realized that this is a broader issue and it applies to many more areas.
KubeVirt has solved every situation singularly, and for every new feature involving privileged operations, this required new critical code to be written, audited, tested, and maintain.
While looking at these existing implementations and at the new problems, seccomp user notifiers [3] appeared to be a rather natural approach. This isn't particularly new or original. Similar examples can be found in LXD[4] or also gVisor[5] implements similar concepts, based on a userspace application kernel, where the platform interface is sitting approximately at a system call level.
The novelty introduced by seitan compared to the existing solutions is the declarative approach over the imperative programming model. In our case, the expected privileged syscalls are listed in a JSON input file and associated with a privileged operation delegated and executed by the privileged component.
If new features arise, we want a mechanism that can be expanded in a simple way. With seitan, no additional code is required, but simply a new entry must be added to the seitan input file.
Please, note that the code published under seitan [1] doesn't currently reflect yet the proposed design -- it's rather an early proof-of-concept based on a much more rudimentary initial idea. New code will be published gradually in the coming weeks.
I hope I piqued your attention, and we would greatly appreciate any comments or suggestions.Is it fair to say, that your proposal is intended to solve disk, file and socket-mounting in various scenarios?Not, only. It is true that connecting QEMU to privileged daemons, opening disks, and creating tap devices can be reduced to the same problem of passing a file descriptor. We have mostly concentrated on these cases because are well-known problems.However, it applies also to other syscalls. We have given the example of changing vcpu scheduler and priority [1]. In this case, it is virt-handler that performs the syscall. We propose to filter this syscall, and when QEMU executes it, the syscall will be filtered by the seccomp monitor, the arguments will be checked, and then the corresponding action will be executed by the privileged component. Generally, the action can be the same syscall, but not necessarily.
If yes, that sounds in general like a good boundary to me (I would be concerned that some "unneeded" privileges may not get removed from components, just because we could solve it with seccomp filters).Well, the unprivileged component doesn't have those privileges.Seccomp notifiers are simply a way to inform the privileged component that the monitored process tries to execute this syscall. Then, the syscall is executed by the privileged component if it is expected and has been previously declared inside the seitan input file.So, I guess what would then come up as a natural alternative, would be transferring a file-descriptor over a unix socket. Here I would love to hear your thoughts on
* the pros/cons of bothSure, here are a couple of advantages of seccomp notifiers over fds and unix-sockets
- Passing a file-descriptors over unix sockets is more complex. It requires a synchronization mechanism and overhead between the 2 user space applications. In the case of KubeVirt between virt-handler and virt-launcher. With the seccomp notifiers, you get this synchronization for "free". Let's take the SCSI persistent reservation as an example where QEMU wants to connect to a demon with a certain path. The flow is:
- QEMU performs a connect syscall to a specific path (e.g /var/run/pr-helper.sock)
- the seccomp monitor is notified as connect belongs to the filtered syscalls, and on QEMU side the connect is blocked
- the seccomp monitor can then take the file descriptor from QEMU (the file descriptor information is already passed to the monitor by the seccomp notification) and connect it to the privileged daemon
- finally, the seccomp monitor can simply let the connect syscall from QEMU continue successfully
- QEMU/Libvirt needs to be adapted to support all the cases we want to handle with file descriptors. With seccomp we don't need to modify them, and they aren't even aware that they are running in a less privileged environment. In the case of disks, the current libvirt version doesn't support this. The same for SCSI persistent reservations.
- Reconnections and failure don't need to be handled differently. In the case of the connection syscall, one of the two sides could encounter a failure and need to be restarted. In this case, another connection syscall will be executed, but we don't need to renew the process of exchanging file descriptors if something goes wrong. Probably, opening disks is easier as there is a single process. But in any case, the file descriptor exchange needs to be done if QEMU fails and retries another time to open a file.
* a risk assessment regarding to potential side-effects on vendor- or linux specific security mechanisms.Yes, we still need to tackle these aspects.Together with @David Vossel, we already discussed offline that we should work together with SELinux. Especially for accessing files using file descriptors as the selinux labels are bypassed with this mechanism. We are evaluating letting the seccomp monitor inherit the SELinux context of the monitored process. In this way, the seccomp monitor will fail in opening files without the proper label. Another possibility could be to extend the seitan input file and set there the selinux labels for the path.BTW, passing file descriptors over unix socket encounters a similar security concern.Similarly to SELinux, we should have a similar mechanism for AppArmor.If this design proposal sounds promising for KubeVirt community then we can plan more effort into solving and supporting SELinux and AppArmor.
Is it fair to say, that your proposal is intended to solve disk, file and socket-mounting in various scenarios?Not, only. It is true that connecting QEMU to privileged daemons, opening disks, and creating tap devices can be reduced to the same problem of passing a file descriptor. We have mostly concentrated on these cases because are well-known problems.However, it applies also to other syscalls. We have given the example of changing vcpu scheduler and priority [1]. In this case, it is virt-handler that performs the syscall. We propose to filter this syscall, and when QEMU executes it, the syscall will be filtered by the seccomp monitor, the arguments will be checked, and then the corresponding action will be executed by the privileged component. Generally, the action can be the same syscall, but not necessarily.Yes, here I am not sure if the right solution would be to just let this happen. What I like right now (ignoring the cons for a moment), is that we just prepare it all upfront and tightly control the env. There is no risk that our bpf filters could somehow diverge from what libvirt tries to do. In general we just want libvirt to not do things. That as eventually always been the solution.This is actually pretty simple and we don't have to chase what exactly libvirt tries to do. While in both cases we can have regressions, in the first case (let libvirt just not do a privileged operation), it can not diverge in potentially untested combinations. Like to create an extreme example our bpf filter would have to know all possible if/else paths in libvirt for argument building, so that we are sure we allow all valid (and only the valid) arguments.
If yes, that sounds in general like a good boundary to me (I would be concerned that some "unneeded" privileges may not get removed from components, just because we could solve it with seccomp filters).Well, the unprivileged component doesn't have those privileges.Seccomp notifiers are simply a way to inform the privileged component that the monitored process tries to execute this syscall. Then, the syscall is executed by the privileged component if it is expected and has been previously declared inside the seitan input file.So, I guess what would then come up as a natural alternative, would be transferring a file-descriptor over a unix socket. Here I would love to hear your thoughts on
* the pros/cons of bothSure, here are a couple of advantages of seccomp notifiers over fds and unix-sockets
- Passing a file-descriptors over unix sockets is more complex. It requires a synchronization mechanism and overhead between the 2 user space applications. In the case of KubeVirt between virt-handler and virt-launcher. With the seccomp notifiers, you get this synchronization for "free". Let's take the SCSI persistent reservation as an example where QEMU wants to connect to a demon with a certain path. The flow is:
- QEMU performs a connect syscall to a specific path (e.g /var/run/pr-helper.sock)
- the seccomp monitor is notified as connect belongs to the filtered syscalls, and on QEMU side the connect is blocked
- the seccomp monitor can then take the file descriptor from QEMU (the file descriptor information is already passed to the monitor by the seccomp notification) and connect it to the privileged daemon
- finally, the seccomp monitor can simply let the connect syscall from QEMU continue successfully
- QEMU/Libvirt needs to be adapted to support all the cases we want to handle with file descriptors. With seccomp we don't need to modify them, and they aren't even aware that they are running in a less privileged environment. In the case of disks, the current libvirt version doesn't support this. The same for SCSI persistent reservations.
- Reconnections and failure don't need to be handled differently. In the case of the connection syscall, one of the two sides could encounter a failure and need to be restarted. In this case, another connection syscall will be executed, but we don't need to renew the process of exchanging file descriptors if something goes wrong. Probably, opening disks is easier as there is a single process. But in any case, the file descriptor exchange needs to be done if QEMU fails and retries another time to open a file.
Yes, this is an interesting point. The current mounts are super-annoying, but they continue to work without having virt-handler running, which has so far been a core property. If a VM manages to enter the running state, you can't do administrative tasks on it anymore (hotplug, console, ...) but the disks are in general safe to continue running without side-effects. Is this a concern?
* a risk assessment regarding to potential side-effects on vendor- or linux specific security mechanisms.Yes, we still need to tackle these aspects.Together with @David Vossel, we already discussed offline that we should work together with SELinux. Especially for accessing files using file descriptors as the selinux labels are bypassed with this mechanism. We are evaluating letting the seccomp monitor inherit the SELinux context of the monitored process. In this way, the seccomp monitor will fail in opening files without the proper label. Another possibility could be to extend the seitan input file and set there the selinux labels for the path.BTW, passing file descriptors over unix socket encounters a similar security concern.Similarly to SELinux, we should have a similar mechanism for AppArmor.If this design proposal sounds promising for KubeVirt community then we can plan more effort into solving and supporting SELinux and AppArmor.I would prefer to discuss this a little bit more first. I added a few more examples on where this approach may have unintended side-effects and complications.
In general I see relatively little problems with mounting containerDisks and hotplug disks. There the fd should never break unless the VM goes down, and there we want to stop using it anyway.But the socket based approach, while having an initial overhead of establishing the fd passing, is still in the game for me now. It has a very limited scope with very low risk of unintentionally running into e.g. AppArmor or SELinux by trying to solve cases where e.g. libvirt should maybe just not do something.I am not ruling out the filter solution, but for this limited scope the socked based approach has the advantage of keeping the usual preparation flow - first virt-handler sets up the env, then virt-launcher can operate without virt-handler.
For e.g. scsi reservation socket, I am not yet sure that any of the two paths, bpf filter, or socket fd transfer, are eventually better than a mount.
Best regards,
Roman
Hi Roman,Is it fair to say, that your proposal is intended to solve disk, file and socket-mounting in various scenarios?Not, only. It is true that connecting QEMU to privileged daemons, opening disks, and creating tap devices can be reduced to the same problem of passing a file descriptor. We have mostly concentrated on these cases because are well-known problems.However, it applies also to other syscalls. We have given the example of changing vcpu scheduler and priority [1]. In this case, it is virt-handler that performs the syscall. We propose to filter this syscall, and when QEMU executes it, the syscall will be filtered by the seccomp monitor, the arguments will be checked, and then the corresponding action will be executed by the privileged component. Generally, the action can be the same syscall, but not necessarily.Yes, here I am not sure if the right solution would be to just let this happen. What I like right now (ignoring the cons for a moment), is that we just prepare it all upfront and tightly control the env. There is no risk that our bpf filters could somehow diverge from what libvirt tries to do. In general we just want libvirt to not do things. That as eventually always been the solution.This is actually pretty simple and we don't have to chase what exactly libvirt tries to do. While in both cases we can have regressions, in the first case (let libvirt just not do a privileged operation), it can not diverge in potentially untested combinations. Like to create an extreme example our bpf filter would have to know all possible if/else paths in libvirt for argument building, so that we are sure we allow all valid (and only the valid) arguments.Stefano already gave a very detailed answer in the previous email. However, I still want to emphasize that we should only list the expected and privileged syscalls. This mechanism is designed to replace the code where kubevirt needs to reimplement those syscalls or access privileged resources.I don't think we need to chase all the possible combinations and paths that libvirt tries to do, rather implementing only the operations it cannot do because it lacks privileges. Generally, they should be pretty simple and well defined.About preparing the environment upfront, yes definitely this is a core property that we have tried to maintain! IMHO, this proposal just reinforces it. The seitan input files need to list exactly the operations that will be executed. The expected syscalls and arguments will be precomputed and generated based on the VMI definition before launching libvirt and QEMU.
> There is no risk that our bpf filters could somehow diverge from what
> libvirt tries to do. In general we just want libvirt to not do things.
> That as eventually always been the solution.
Actually, I fail to see a big difference here. Just like virt-handler
might set up things in a way that's incompatible with a given version
of libvirt or QEMU, virt-handler could configure seitan in an
incompatible way.
It's true that virt-handler doesn't monitor the system calls QEMU does,
so, should the system calls be different, but the outcome needs to
remain exactly the same, virt-handler doesn't have a problem with that.