[kubevirt-dev] SCSI persistent reservation in KubeVirt

Alice Frosi

unread,

Aug 18, 2022, 8:05:15 AM8/18/22

to kubevirt-dev, Paolo Bonzini, Vladik Romanovsky, Roman Mohr

Hi everyone,

I'd like to open a discussion about the support for the SCSI persistent reservation in KubeVIrt. You can read more details about the feature in [1].

The persistent reservation is performed by the qemu-pr-helper [2], a privileged daemon that executes the privileged operations. In order to allow the persistent reservation for unprivileged VMs, the guest needs to be able to talk to the pr-helper through a socket. My current plan is to deploy the pr-helper daemons inside the virt-handler pod in a separate container. Then, virt-handler can make the pr-helper socket available to virt-launcher and the unprivileged guest.

Here, I'd like to discuss in more detail the various options to make the connection between QEMU and the pr-helper:

1. The straightforward solution consists in letting virt-handler bind mounting the socket inside the virt-launcher pod if the VMI requests the persistent reservation. I already implemented a POC and it is available in [3]. We already perform various mounting operations in KubeVirt for container and hotplugged disks. The first option proposes to add an additional mount operation for the socket.

The drawback of this solution is that the mount operations performed by virt-handler aren't transparent to kubernetes. For example, kubelet could fail to clean-up the virt-launcher pod, if virt-handler haven't performed the unmount yet.

2. Create a proxy connection for every VM and the pr-helper. This solution doesn't require any mount operations as the socket of the proxy could be directly created by virt-handler in the virt-launcher path. However, we cannot blindly proxy the connection as this. We also need to forward the SCM_RIGHTS ancillary data, and this means a minimal understanding of the pr-helper protocol. We might expect some overhead due to the proxied connection.

3. You might also think of a CSI driver solution. I'll exclude this possibility because we would need a single PV that contains the directory with the socket but we cannot have multiple PVCs that reference the same PV. The PVCs are namespace scoped, therefore the same socket cannot be mounted in VMs in a different namespace.

I'd like to have some feedback about the various solutions. I'm more familiar with the first solution as I already prototyped it. I might miss other problems for the second solution.

Any help or feedback is highly appreciated :)

[1] https://github.com/kubevirt/kubevirt/issues/8115

[2] https://www.qemu.org/docs/master/interop/pr-helper.html

[3] https://github.com/kubevirt/kubevirt/pull/8210

Many thanks,

Alice

Alice Frosi

unread,

Oct 7, 2022, 4:21:57 AM10/7/22

to kubevirt-dev, Paolo Bonzini, Vladik Romanovsky, Roman Mohr, Adam Litke

Hi everyone,

I have been working on an example that illustrates how we could proxy the communication between QEMU running in an unprivileged (a slightly different version of solution 2 in the previous email). I'd like feedback if this could be implemented and integrated into kubevirt. The example is available in [4] and you can find all the technical details in the README.

So far, this is my favorite solution as it avoids the mounting issue described in the previous email. The example uses the syscall pidfd_getpid [5] and this allows duplicating a fd in the calling process. With this, we can connect 2 processes running in 2 different containers with a UNIX socket. The socket doesn't need to be present in the container of the calling process as it is able to retrieve the fd with the syscall. The major drawback I see with this solution is that the syscall pidfd_getpid has been introduced in kernel 5.6 (host not guest), Therefore, this feature could be used only on recent kernels. Is this a reasonable solution and limitation?

[4] https://github.com/alicefr/example-pidfd-getpid/tree/pr-helper

[5] https://man7.org/linux/man-pages/man2/pidfd_getfd.2.html

Many thanks,

Alice

Alice Frosi

unread,

Oct 7, 2022, 5:20:41 AM10/7/22

to kubevirt-dev, Vladik Romanovsky, Roman Mohr

Apologize, one correction otherwise the idea doesn't make sense

he example uses the syscall pidfd_getpid [5] and this allows duplicating a fd in the calling process. With this, we can connect 2 processes running in 2 different containers with a UNIX socket. The socket doesn't need to be present in the container of the ~~calling~~ unprivileged process as it is able to retrieve the fd with the syscall.

Ilya Maximets

unread,

Oct 13, 2022, 4:25:43 AM10/13/22

to kubevirt-dev

Hi Alice,

Some time back I was thinking about a similar problem for vhost-user connections and containers, i.e. how to connect DPDK applications from the inside of a container with OVS on the host or between each other. One idea I had is to create a separate daemon called 'socker-pair broker' that can create socket pairs and give them to other processes, so all containers only need to have a single main UNIX socket mounted and can talk with a service and negotiate creation of socket pairs for different needs.

I have a reference implementation of a protocol [1], daemon and a client library [2]. It doesn't use any cutting edge system calls, so can be used pretty much anywhere.

The code was never used in production environment, so there might be some issues and some permissions handling should be added, but it should be enough for a controlled environment.

The work diagram is here: https://github.com/igsilya/one-socket/blob/main/doc/socketpair-broker.rst

Not sure if it is suitable for your use case, just sharing what I have, in case it might be interesting.

[1] https://github.com/igsilya/one-socket/blob/main/doc/socketpair-broker-proto-spec.rst

[2] https://github.com/igsilya/one-socket

Best regards, Ilya Maximets.

Stefan Hajnoczi

unread,

Oct 13, 2022, 4:25:43 AM10/13/22

to kubevirt-dev

Hi Alice,

I'm curious about the general problem of cross-namespace UNIX domain sockets, which also comes up with vhost-user devices in Kubernetes.

If QEMU had a listen socket for pr-helper to connect to (inverting the direction of the socket connection), then could the proxy process be eliminated? QEMU listens and the privileged pr-helper container temporarily enters the mount namespace and connects to QEMU's UNIX domain socket. Now QEMU and pr-helper are connected without a proxy.

I will let Ilya Maximets know about this discussion because he previously proposed a broker service for connectivity between namespaces. That's another approach to solving the general problem.

Stefan

Alice Frosi

unread,

Oct 13, 2022, 5:17:36 AM10/13/22

to Stefan Hajnoczi, kubevirt-dev, Ilya Maximets

Hi Stefan,

On Thu, Oct 13, 2022 at 10:25 AM Stefan Hajnoczi <shaj...@redhat.com> wrote:

Hi Alice,
I'm curious about the general problem of cross-namespace UNIX domain sockets, which also comes up with vhost-user devices in Kubernetes.

I still need to carefully read previous Ilya's email and work on vhost-user.

If QEMU had a listen socket for pr-helper to connect to (inverting the direction of the socket connection), then could the proxy process be eliminated? QEMU listens and the privileged pr-helper container temporarily enters the mount namespace and connects to QEMU's UNIX domain socket. Now QEMU and pr-helper are connected without a proxy.

Yes, the direction is the actual issue here. Having a listening socket in the unprivileged container and connecting from the privileged container isn't an issue. It is even the way how virt-handler (privileged) establishes a communication channel with virt-launcher (unprivileged).

Unfortunately, in my case, the direction is the opposite QEMU needs to connect to the pr-helper.

--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/bc09f80a-1062-407c-b68c-aaac0c7f17ban%40googlegroups.com.

Alice Frosi

unread,

Oct 13, 2022, 6:06:46 AM10/13/22

to Ilya Maximets, kubevirt-dev

Hi Ilya,

I have a couple of questions on this.

On Thu, Oct 13, 2022 at 10:25 AM Ilya Maximets <imax...@redhat.com> wrote:

Hi Alice,

Some time back I was thinking about a similar problem for vhost-user connections and containers, i.e. how to connect DPDK applications from the inside of a container with OVS on the host or between each other. One idea I had is to create a separate daemon called 'socker-pair broker' that can create socket pairs and give them to other processes, so all containers only need to have a single main UNIX socket mounted and can talk with a service and negotiate creation of socket pairs for different needs.

I have a reference implementation of a protocol [1], daemon and a client library [2]. It doesn't use any cutting edge system calls, so can be used pretty much anywhere.
The code was never used in production environment, so there might be some issues and some permissions handling should be added, but it should be enough for a controlled environment.
The work diagram is here: https://github.com/igsilya/one-socket/blob/main/doc/socketpair-broker.rst

Not sure if it is suitable for your use case, just sharing what I have, in case it might be interesting.

How does the listening server connect to the broker daemon? I understand the client par but I'm missing the server in the picture. Additionally, this would require changes in the pr-helper right?

Secondly, where would you deploy this broker daemon? Is this a privileged component? I'm not sure if the unprivileged client could just connect to a privileged component. I see this as a bit problematic. Maybe can you expand a bit on this?

Does the broker socket need to be mounted inside the container? We'd like to avoid the need of mounting a socket in the virt-launcher pod for the reasons I have described in the first email of this thread. This is also the main reason for the pidfd_getfd syscall.

Many thanks,

Alice

Ilya Maximets

unread,

Oct 13, 2022, 7:41:04 AM10/13/22

to Alice Frosi, kubevirt-dev

On Thu, Oct 13, 2022 at 12:07 PM Alice Frosi <afr...@redhat.com> wrote:

Hi Ilya,

I have a couple of questions on this.

On Thu, Oct 13, 2022 at 10:25 AM Ilya Maximets <imax...@redhat.com> wrote:
Hi Alice,

Some time back I was thinking about a similar problem for vhost-user connections and containers, i.e. how to connect DPDK applications from the inside of a container with OVS on the host or between each other. One idea I had is to create a separate daemon called 'socker-pair broker' that can create socket pairs and give them to other processes, so all containers only need to have a single main UNIX socket mounted and can talk with a service and negotiate creation of socket pairs for different needs.

I have a reference implementation of a protocol [1], daemon and a client library [2]. It doesn't use any cutting edge system calls, so can be used pretty much anywhere.
The code was never used in production environment, so there might be some issues and some permissions handling should be added, but it should be enough for a controlled environment.
The work diagram is here: https://github.com/igsilya/one-socket/blob/main/doc/socketpair-broker.rst

Not sure if it is suitable for your use case, just sharing what I have, in case it might be interesting.

How does the listening server connect to the broker daemon? I understand the client par but I'm missing the server in the picture. Additionally, this would require changes in the pr-helper right?

For the broker daemon both the server and a client are just clients. Two clients connect to a broker, the broker gives them a socket pair, both clients disconnect from the broker and communicate directly between each other over the socket pair they now have. Connection is already established and there is no need to listen() or connect(), just send()/recv() right away.

This is more optimized for 1:1 communications. For a 1:N topology, one of the broker clients (a.k.a. server) may request a new connection from the broker right after receiving one for a previous client. Broker holds client connections, so we will not miss them. This case can also be optimized by not closing server-to-broker connection, so broker can supply socket pairs to the 'server' with new and new clients. I don't remember if that is implemented in one-socket, but it should not be hard to add support for that case.

I'm not familiar enough with kubevirt, but I would say that pr-helper will need to be able to talk with a broker, so it needs to understand the workflow, so some changes will be needed to add support.

Secondly, where would you deploy this broker daemon? Is this a privileged component? I'm not sure if the unprivileged client could just connect to a privileged component. I see this as a bit problematic. Maybe can you expand a bit on this?

It's not a privileged component, all it does is listen()/socketpair()/send(), fairly basic socket operations. Clients just need a way to connect to it.

It can run as a systemd daemon on the host (possibly socket-activated) or anywhere else from where you can get the one socket file out.

Does the broker socket need to be mounted inside the container? We'd like to avoid the need of mounting a socket in the virt-launcher pod for the reasons I have described in the first email of this thread. This is also the main reason for the pidfd_getfd syscall.

Yes, you need to mount the broker socket to containers, so it might not be an option for you. The point is that it is one and only socket that needs to be mounted. It will be the same socket file for every container and it is not really service-specific, it's a very generic service.

IMHO, the world would be a much simpler place if some one-socket-like service is available system-wide in kubernetes and containers were able to request it as a resource. I think this can be achieved by creating a kubernetes device plugin.

Best regards, Ilya Maximets.

Many thanks,
Alice

--
You received this message because you are subscribed to a topic in the Google Groups "kubevirt-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kubevirt-dev/AmjvVqlD1Hs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CABBoX7OmO2edgaTzqq3Ai_cTZARtDdrnFbuQ91qFtwRtSVBAUA%40mail.gmail.com.

Stefan Hajnoczi

unread,

Oct 13, 2022, 8:06:42 AM10/13/22

to kubevirt-dev

On Thursday, October 13, 2022 at 5:17:36 AM UTC-4 Alice Frosi wrote:

Hi Stefan,

On Thu, Oct 13, 2022 at 10:25 AM Stefan Hajnoczi <shaj...@redhat.com> wrote:
Hi Alice,
I'm curious about the general problem of cross-namespace UNIX domain sockets, which also comes up with vhost-user devices in Kubernetes.

I still need to carefully read previous Ilya's email and work on vhost-user.

If QEMU had a listen socket for pr-helper to connect to (inverting the direction of the socket connection), then could the proxy process be eliminated? QEMU listens and the privileged pr-helper container temporarily enters the mount namespace and connects to QEMU's UNIX domain socket. Now QEMU and pr-helper are connected without a proxy.

Yes, the direction is the actual issue here. Having a listening socket in the unprivileged container and connecting from the privileged container isn't an issue. It is even the way how virt-handler (privileged) establishes a communication channel with virt-launcher (unprivileged).
Unfortunately, in my case, the direction is the opposite QEMU needs to connect to the pr-helper.

Is it not possible to adjust the QEMU/pr-helper code to allow connecting the other way around too?

Stefan

Alice Frosi

unread,

Oct 13, 2022, 8:12:32 AM10/13/22

to Stefan Hajnoczi, Paolo Bonzini, kubevirt-dev

Something that maybe Paolo can answer?

It might be even enough for QEMU to accept a file descriptor that will be connected by the management layer to the pr-helper.

But in any case, it isn't something I can immediately use, but I plan to get rid of the proxy in a second step.

Alice

Stefan

--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/103855fc-6fb6-4bad-9174-2eaaa76f0809n%40googlegroups.com.

Alice Frosi

unread,

Oct 13, 2022, 8:53:48 AM10/13/22

to Ilya Maximets, kubevirt-dev

On Thu, Oct 13, 2022 at 1:41 PM Ilya Maximets <i.max...@redhat.com> wrote:

On Thu, Oct 13, 2022 at 12:07 PM Alice Frosi <afr...@redhat.com> wrote:
Hi Ilya,

I have a couple of questions on this.

On Thu, Oct 13, 2022 at 10:25 AM Ilya Maximets <imax...@redhat.com> wrote:
Hi Alice,

Some time back I was thinking about a similar problem for vhost-user connections and containers, i.e. how to connect DPDK applications from the inside of a container with OVS on the host or between each other. One idea I had is to create a separate daemon called 'socker-pair broker' that can create socket pairs and give them to other processes, so all containers only need to have a single main UNIX socket mounted and can talk with a service and negotiate creation of socket pairs for different needs.

I have a reference implementation of a protocol [1], daemon and a client library [2]. It doesn't use any cutting edge system calls, so can be used pretty much anywhere.
The code was never used in production environment, so there might be some issues and some permissions handling should be added, but it should be enough for a controlled environment.
The work diagram is here: https://github.com/igsilya/one-socket/blob/main/doc/socketpair-broker.rst

Not sure if it is suitable for your use case, just sharing what I have, in case it might be interesting.

How does the listening server connect to the broker daemon? I understand the client par but I'm missing the server in the picture. Additionally, this would require changes in the pr-helper right?

For the broker daemon both the server and a client are just clients. Two clients connect to a broker, the broker gives them a socket pair, both clients disconnect from the broker and communicate directly between each other over the socket pair they now have. Connection is already established and there is no need to listen() or connect(), just send()/recv() right away.
This is more optimized for 1:1 communications. For a 1:N topology, one of the broker clients (a.k.a. server) may request a new connection from the broker right after receiving one for a previous client. Broker holds client connections, so we will not miss them. This case can also be optimized by not closing server-to-broker connection, so broker can supply socket pairs to the 'server' with new and new clients. I don't remember if that is implemented in one-socket, but it should not be hard to add support for that case.

Thanks!

I'm not familiar enough with kubevirt, but I would say that pr-helper will need to be able to talk with a broker, so it needs to understand the workflow, so some changes will be needed to add support.

This isn't really related to KubeVirt but to QEMU.

Secondly, where would you deploy this broker daemon? Is this a privileged component? I'm not sure if the unprivileged client could just connect to a privileged component. I see this as a bit problematic. Maybe can you expand a bit on this?

It's not a privileged component, all it does is listen()/socketpair()/send(), fairly basic socket operations. Clients just need a way to connect to it.
It can run as a systemd daemon on the host (possibly socket-activated) or anywhere else from where you can get the one socket file out.

Yes, I understand this. But we could have privileged and unprivileged components that want to communicate. AFAIU, the broker doesn't make this difference and in this case, it is the broker that manages the connection. For example, unwanted components might manage to connect to privileged containers.

In my example, the model is a bit different because the connection is established by the privileged component.

Or do I miss something here?

Does the broker socket need to be mounted inside the container? We'd like to avoid the need of mounting a socket in the virt-launcher pod for the reasons I have described in the first email of this thread. This is also the main reason for the pidfd_getfd syscall.

Yes, you need to mount the broker socket to containers, so it might not be an option for you. The point is that it is one and only socket that needs to be mounted. It will be the same socket file for every container and it is not really service-specific, it's a very generic service.
IMHO, the world would be a much simpler place if some one-socket-like service is available system-wide in kubernetes and containers were able to request it as a resource. I think this can be achieved by creating a kubernetes device plugin.

Well, the mount of the socket is problematic because the bind mount isn't transparent to k8s as it is done by kubevirt. If the unmount of the socket isn't done properly on the by kubevirt, kubernetes cannot clean up and umount the container filesystem. Otherwise, I could simply mount the pr-helper socket and QEMU could directly connect to it. This is how I implemented the first poc.

What will be in your case the resource managed by the kubernetes device plugin? I think you are talking about the resource exposed through vhost-user here, not the socket, right?

Alice

Ilya Maximets

unread,

Oct 13, 2022, 11:21:36 AM10/13/22

to Alice Frosi, kubevirt-dev

On Thu, Oct 13, 2022 at 2:53 PM Alice Frosi <afr...@redhat.com> wrote:

On Thu, Oct 13, 2022 at 1:41 PM Ilya Maximets <i.max...@redhat.com> wrote:

On Thu, Oct 13, 2022 at 12:07 PM Alice Frosi <afr...@redhat.com> wrote:
Hi Ilya,

I have a couple of questions on this.

On Thu, Oct 13, 2022 at 10:25 AM Ilya Maximets <imax...@redhat.com> wrote:
Hi Alice,

Some time back I was thinking about a similar problem for vhost-user connections and containers, i.e. how to connect DPDK applications from the inside of a container with OVS on the host or between each other. One idea I had is to create a separate daemon called 'socker-pair broker' that can create socket pairs and give them to other processes, so all containers only need to have a single main UNIX socket mounted and can talk with a service and negotiate creation of socket pairs for different needs.

I have a reference implementation of a protocol [1], daemon and a client library [2]. It doesn't use any cutting edge system calls, so can be used pretty much anywhere.
The code was never used in production environment, so there might be some issues and some permissions handling should be added, but it should be enough for a controlled environment.
The work diagram is here: https://github.com/igsilya/one-socket/blob/main/doc/socketpair-broker.rst

Not sure if it is suitable for your use case, just sharing what I have, in case it might be interesting.

How does the listening server connect to the broker daemon? I understand the client par but I'm missing the server in the picture. Additionally, this would require changes in the pr-helper right?

For the broker daemon both the server and a client are just clients. Two clients connect to a broker, the broker gives them a socket pair, both clients disconnect from the broker and communicate directly between each other over the socket pair they now have. Connection is already established and there is no need to listen() or connect(), just send()/recv() right away.
This is more optimized for 1:1 communications. For a 1:N topology, one of the broker clients (a.k.a. server) may request a new connection from the broker right after receiving one for a previous client. Broker holds client connections, so we will not miss them. This case can also be optimized by not closing server-to-broker connection, so broker can supply socket pairs to the 'server' with new and new clients. I don't remember if that is implemented in one-socket, but it should not be hard to add support for that case.

Thanks!

I'm not familiar enough with kubevirt, but I would say that pr-helper will need to be able to talk with a broker, so it needs to understand the workflow, so some changes will be needed to add support.

This isn't really related to KubeVirt but to QEMU.

Secondly, where would you deploy this broker daemon? Is this a privileged component? I'm not sure if the unprivileged client could just connect to a privileged component. I see this as a bit problematic. Maybe can you expand a bit on this?

It's not a privileged component, all it does is listen()/socketpair()/send(), fairly basic socket operations. Clients just need a way to connect to it.
It can run as a systemd daemon on the host (possibly socket-activated) or anywhere else from where you can get the one socket file out.

Yes, I understand this. But we could have privileged and unprivileged components that want to communicate. AFAIU, the broker doesn't make this difference and in this case, it is the broker that manages the connection. For example, unwanted components might manage to connect to privileged containers.
In my example, the model is a bit different because the connection is established by the privileged component.
Or do I miss something here?

Each client sends a connection key to a broker, key is an arbitrary sequence of bytes. Broker connects clients to each other only if their keys match.Kind of a simple way to prevent unwanted connections.

We may also enforce matching of process user/group/other things that broker can get from the client connection, but that is the opposite of the goal here as we try to match privileged process with unprivileged.

Does the broker socket need to be mounted inside the container? We'd like to avoid the need of mounting a socket in the virt-launcher pod for the reasons I have described in the first email of this thread. This is also the main reason for the pidfd_getfd syscall.

Yes, you need to mount the broker socket to containers, so it might not be an option for you. The point is that it is one and only socket that needs to be mounted. It will be the same socket file for every container and it is not really service-specific, it's a very generic service.
IMHO, the world would be a much simpler place if some one-socket-like service is available system-wide in kubernetes and containers were able to request it as a resource. I think this can be achieved by creating a kubernetes device plugin.

Well, the mount of the socket is problematic because the bind mount isn't transparent to k8s as it is done by kubevirt. If the unmount of the socket isn't done properly on the by kubevirt, kubernetes cannot clean up and umount the container filesystem. Otherwise, I could simply mount the pr-helper socket and QEMU could directly connect to it. This is how I implemented the first poc.

What will be in your case the resource managed by the kubernetes device plugin? I think you are talking about the resource exposed through vhost-user here, not the socket, right?

Actually I was thinking about a broker socket being managed by the device plugin. It's a resource of an infinite quantity. Device plugins are able to add mounts to pods, IIRC, so it can mount the socket on startup and unmount on pod termination.

Not sure if it's the best way to do that, but it should work I guess.

Alice

--
You received this message because you are subscribed to a topic in the Google Groups "kubevirt-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kubevirt-dev/AmjvVqlD1Hs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kubevirt-dev...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CABBoX7MeLpk%2BQy785q-z%2B6-B3J1L4Yi5E_2uUvftzXEDeKTB%2Bg%40mail.gmail.com.

Stefan Hajnoczi

unread,

Oct 13, 2022, 11:29:09 AM10/13/22

to Paolo Bonzini, Alice Frosi, Stefan Hajnoczi, i.max...@ovn.org, kubevirt-dev

On Thu, Oct 13, 2022 at 02:58:06PM +0200, Paolo Bonzini wrote:
> On 10/13/22 14:12, Alice Frosi wrote:
> > If QEMU had a listen socket for pr-helper to connect to
> > (inverting the direction of the socket connection), then
> > could the proxy process be eliminated? QEMU listens and the
> > privileged pr-helper container temporarily enters the mount
> > namespace and connects to QEMU's UNIX domain socket. Now
> > QEMU and pr-helper are connected without a proxy.
> >
> > Yes, the direction is the actual issue here. Having a listening
> > socket in the unprivileged container and connecting from the
> > privileged container isn't an issue. It is even the way how
> > virt-handler (privileged) establishes a communication channel
> > with virt-launcher (unprivileged).
> > Unfortunately, in my case, the direction is the opposite QEMU
> > needs to connect to the pr-helper.
> >
> > Is it not possible to adjust the QEMU/pr-helper code to allow
> > connecting the other way around too?
> >
> > Something that maybe Paolo can answer?
>

> The reason is that there can be a single qemu-pr-helper for multiple VMs
> (which is the case also in the KubeVirt case). If qemu-pr-helper connected
> to QEMU instead of the other way, you would need some kind of RPC to pass
> the path or the connected socket to qemu-pr-helper.
>
> In addition, though this may not be the case for KubeVirt, the
> qemu-pr-helper can be restarted (either because it crashes or because of a
> system update). In this case it is easier for QEMU to reconnect than for
> qemu-pr-helper to do so, since qemu-pr-helper does not track all the VMs
> it's serving.
>
> However I wonder if with Ilya's broker only QEMU would have to be adapted,
> and not qemu-pr-helper? The broker can issue a normal connect() system call
> when it receives the request from QEMU.

There is a broker protocol for connecting and fetching the socketpair
fd. The broker is not a transparent proxy, so a normal connect() syscall
isn't possible (unless the design of the broken is changed).

When discussing the broker in the past it appeared to me like the broker
could be eliminated if the AF_UNIX implementation in the kernel was
extended. Today AF_UNIX has three addressing modes:
- pathname addresses: a UNIX domain socket are looked up by path in a
mount namespace
- unnamed: anonymous like a socketpair
- abstract: are looked up by name in a net namespace and are not visible
on a filesystem

None of these modes are sufficient for cross-container UNIX domain
sockets. One class of solutions adds userspace components
(proxies/brokers/etc) on top without changing how AF_UNIX works. Another
class of solutions adds a cross-namespace addressing mode to AF_UNIX so
that bind(2)/connect(2) work without additional components in userspace.

I'm not sure if this can be done nicely, but I think it's worth
ruling out first before adding more components on top to work around
limitations of the lower level components.

My concerns with a kernel solution are:
- Do existing applications need to be modified? The only way I can think
of for avoiding that is to piggyback on top of pathname addresses with
a magic prefix like "/dev/unix/", but that has it's own problems.
- What are the security implications of allowing applications to
communicate across containers (namespaces)? As long as existing
programs don't accidentally expose their sockets across namespaces, I
think the risk is small. The listening process needs a way to validate
the connecting process' credentials to distinguish them from the local
namespace - SO_PEERCRED/struct ucred isn't enough :(.
- Can a new address mode be added without breaking existing applications
or POSIX?

I don't have good answers to any of these points, so I guess a kernel
solution will be hard.

Stefan

signature.asc

Alice Frosi

unread,

Feb 8, 2023, 3:34:22 AM2/8/23

to kubevirt-dev, Vladik Romanovsky, Luboslav Pivarc, Roman Mohr

Hi everyone,

I'd like to give an update on this topic and bring your attention to a new PR I recently opened [1].

After a lot of investigations, we have decided to use the device plugin framework in order to mount the pr-helper socket into virt-launcher. The pr-helper socket will be exposed as a resource and a new kubevirt device plugin will be in charge of handling the resource.

Initially, I have discarded this option as device plugins don't have access control. However, we have concluded that it is still safe because SCSI persistent reservation requires access to the pr-helper socket and the SCSI device controlled through PVC/CSI interface.

This is a relatively simply and clean way to mount the pr-helper socket inside virt-launcher as the clean-up of the pod is synchronized with kubelet.

Any review of this approach will be very helpful.

I currently set the PRs in progress because I need some feedback on the feature gate. I gave more details in the comment [2].

Additionally, I'm struggling with bazel and python integration. In order to test the SCSI persistent reservation, I'd like to use tartgetcli to create a SCSI loopback device.

Unfortunately, targetcli fails when I added it to the vm-killer image as in [3] because it cannot find udev library (a version of this image is available at [4]). Apparently, python uses the /etc/ld.so.cache in order to find the shared library. When I run ldconfig inside the vm-killer image, then the cache is built and targetcli starts working properly,

If any of you have any hints on how to integrate this with bazel, it will be much appreciated!

[1] https://github.com/kubevirt/kubevirt/pull/9177

[2] https://github.com/kubevirt/kubevirt/pull/9177#issuecomment-1420500371

[3] https://github.com/alicefr/kubevirt/commit/3f88a48c1bb2c6c4676998657b4de190633dce7b

[4] https://quay.io/repository/afrosi_rh/vm-killer

Many thanks,

Alice

Reply all

Reply to author

Forward