Towards an ivshmem 2.0?

405 views
Skip to first unread message

Jan Kiszka

unread,
Jan 16, 2017, 3:36:52 AM1/16/17
to qemu-devel, Jailhouse
Hi,

some of you may know that we are using a shared memory device similar to
ivshmem in the partitioning hypervisor Jailhouse [1].

We started as being compatible to the original ivshmem that QEMU
implements, but we quickly deviated in some details, and in the recent
months even more. Some of the deviations are related to making the
implementation simpler. The new ivshmem takes <500 LoC - Jailhouse is
aiming at safety critical systems and, therefore, a small code base.
Other changes address deficits in the original design, like missing
life-cycle management.

Now the question is if there is interest in defining a common new
revision of this device and maybe also of some protocols used on top,
such as virtual network links. Ideally, this would enable us to share
Linux drivers. We will definitely go for upstreaming at least a network
driver such as [2], a UIO driver and maybe also a serial port/console.

I've attached a first draft of the specification of our new ivshmem
device. A working implementation can be found in the wip/ivshmem2 branch
of Jailhouse [3], the corresponding ivshmem-net driver in [4].

Deviations from the original design:

- Only two peers per link

This simplifies the implementation and also the interfaces (think of
life-cycle management in a multi-peer environment). Moreover, we do
not have an urgent use case for multiple peers, thus also not
reference for a protocol that could be used in such setups. If someone
else happens to share such a protocol, it would be possible to discuss
potential extensions and their implications.

- Side-band registers to discover and configure share memory regions

This was one of the first changes: We removed the memory regions from
the PCI BARs and gave them special configuration space registers. By
now, these registers are embedded in a PCI capability. The reasons are
that Jailhouse does not allow to relocate the regions in guest address
space (but other hypervisors may if they like to) and that we now have
up to three of them.

- Changed PCI base class code to 0xff (unspecified class)

This allows us to define our own sub classes and interfaces. That is
now exploited for specifying the shared memory protocol the two
connected peers should use. It also allows the Linux drivers to match
on that.

- INTx interrupts support is back

This is needed on target platforms without MSI controllers, i.e.
without the required guest support. Namely some PCI-less ARM SoCs
required the reintroduction. While doing this, we also took care of
keeping the MMIO registers free of privileged controls so that a
guest OS can map them safely into a guest userspace application.

And then there are some extensions of the original ivshmem:

- Multiple shared memory regions, including unidirectional ones

It is now possible to expose up to three different shared memory
regions: The first one is read/writable for both sides. The second
region is read/writable for the local peer and read-only for the
remote peer (useful for output queues). And the third is read-only
locally but read/writable remotely (ie. for input queues).
Unidirectional regions prevent that the receiver of some data can
interfere with the sender while it is still building the message, a
property that is not only useful for safety critical communication,
we are sure.

- Life-cycle management via local and remote state

Each device can now signal its own state in form of a value to the
remote side, which triggers an event there. Moreover, state changes
done by the hypervisor to one peer are signalled to the other side.
And we introduced a write-to-shared-memory mechanism for the
respective remote state so that guests do not have to issue an MMIO
access in order to check the state.

So, this is our proposal. Would be great to hear some opinions if you
see value in adding support for such an "ivshmem 2.0" device to QEMU as
well and expand its ecosystem towards Linux upstream, maybe also DPDK
again. If you see problems in the new design /wrt what QEMU provides so
far with its ivshmem device, let's discuss how to resolve them. Looking
forward to any feedback!

Jan

[1] https://github.com/siemens/jailhouse
[2]
http://git.kiszka.org/?p=linux.git;a=blob;f=drivers/net/ivshmem-net.c;h=0e770ca293a4aca14a55ac0e66871b09c82647af;hb=refs/heads/queues/jailhouse
[3] https://github.com/siemens/jailhouse/commits/wip/ivshmem2
[4]
http://git.kiszka.org/?p=linux.git;a=shortlog;h=refs/heads/queues/jailhouse-ivshmem2

--
Siemens AG, Corporate Technology, CT RDA ITP SES-DE
Corporate Competence Center Embedded Linux
ivshmem-v2-specification.md

Marc-André Lureau

unread,
Jan 16, 2017, 7:41:36 AM1/16/17
to Jan Kiszka, qemu-devel, Jailhouse, Wei Wang, Markus Armbruster
Hi

On Mon, Jan 16, 2017 at 12:37 PM Jan Kiszka <jan.k...@siemens.com> wrote:
Hi,

some of you may know that we are using a shared memory device similar to
ivshmem in the partitioning hypervisor Jailhouse [1].

We started as being compatible to the original ivshmem that QEMU
implements, but we quickly deviated in some details, and in the recent
months even more. Some of the deviations are related to making the
implementation simpler. The new ivshmem takes <500 LoC - Jailhouse is
aiming at safety critical systems and, therefore, a small code base.
Other changes address deficits in the original design, like missing
life-cycle management.

Now the question is if there is interest in defining a common new
revision of this device and maybe also of some protocols used on top,
such as virtual network links. Ideally, this would enable us to share
Linux drivers. We will definitely go for upstreaming at least a network
driver such as [2], a UIO driver and maybe also a serial port/console.


This sounds like duplicating efforts done with virtio and vhost-pci. Have you looked at Wei Wang proposal?

I've attached a first draft of the specification of our new ivshmem
device. A working implementation can be found in the wip/ivshmem2 branch
of Jailhouse [3], the corresponding ivshmem-net driver in [4].

You don't have qemu branch, right?
 

Deviations from the original design:

- Only two peers per link


sound sane, that's also what vhost-pci aims to afaik
 
  This simplifies the implementation and also the interfaces (think of
  life-cycle management in a multi-peer environment). Moreover, we do
  not have an urgent use case for multiple peers, thus also not
  reference for a protocol that could be used in such setups. If someone
  else happens to share such a protocol, it would be possible to discuss
  potential extensions and their implications.

- Side-band registers to discover and configure share memory regions

  This was one of the first changes: We removed the memory regions from
  the PCI BARs and gave them special configuration space registers. By
  now, these registers are embedded in a PCI capability. The reasons are
  that Jailhouse does not allow to relocate the regions in guest address
  space (but other hypervisors may if they like to) and that we now have
  up to three of them.

 Sorry, I can't comment on that.


- Changed PCI base class code to 0xff (unspecified class)

  This allows us to define our own sub classes and interfaces. That is
  now exploited for specifying the shared memory protocol the two
  connected peers should use. It also allows the Linux drivers to match
  on that.


Why not, but it worries me that you are going to invent protocols similar to virtio devices, aren't you?
 
- INTx interrupts support is back

  This is needed on target platforms without MSI controllers, i.e.
  without the required guest support. Namely some PCI-less ARM SoCs
  required the reintroduction. While doing this, we also took care of
  keeping the MMIO registers free of privileged controls so that a
  guest OS can map them safely into a guest userspace application.


Right, it's not completely removed from ivshmem qemu upstream, although it should probably be allowed to setup a doorbell-ivshmem with msi=off (this may be quite trivial to add back)
 
And then there are some extensions of the original ivshmem:

- Multiple shared memory regions, including unidirectional ones

  It is now possible to expose up to three different shared memory
  regions: The first one is read/writable for both sides. The second
  region is read/writable for the local peer and read-only for the
  remote peer (useful for output queues). And the third is read-only
  locally but read/writable remotely (ie. for input queues).
  Unidirectional regions prevent that the receiver of some data can
  interfere with the sender while it is still building the message, a
  property that is not only useful for safety critical communication,
  we are sure.

Sounds like a good idea, and something we may want in virtio too

- Life-cycle management via local and remote state

  Each device can now signal its own state in form of a value to the
  remote side, which triggers an event there. Moreover, state changes
  done by the hypervisor to one peer are signalled to the other side.
  And we introduced a write-to-shared-memory mechanism for the
  respective remote state so that guests do not have to issue an MMIO
  access in order to check the state.

There is also ongoing work to better support disconnect/reconnect in virtio.
 

So, this is our proposal. Would be great to hear some opinions if you
see value in adding support for such an "ivshmem 2.0" device to QEMU as
well and expand its ecosystem towards Linux upstream, maybe also DPDK
again. If you see problems in the new design /wrt what QEMU provides so
far with its ivshmem device, let's discuss how to resolve them. Looking
forward to any feedback!


My feeling is that ivshmem is not being actively developped in qemu, but rather virtio-based solutions (vhost-pci for vm2vm).

--
Marc-André Lureau

Jan Kiszka

unread,
Jan 16, 2017, 8:10:22 AM1/16/17
to Marc-André Lureau, qemu-devel, Jailhouse, Wei Wang, Markus Armbruster
Hi Marc-André,

On 2017-01-16 13:41, Marc-André Lureau wrote:
> Hi
>
> On Mon, Jan 16, 2017 at 12:37 PM Jan Kiszka <jan.k...@siemens.com
> <mailto:jan.k...@siemens.com>> wrote:
>
> Hi,
>
> some of you may know that we are using a shared memory device similar to
> ivshmem in the partitioning hypervisor Jailhouse [1].
>
> We started as being compatible to the original ivshmem that QEMU
> implements, but we quickly deviated in some details, and in the recent
> months even more. Some of the deviations are related to making the
> implementation simpler. The new ivshmem takes <500 LoC - Jailhouse is
> aiming at safety critical systems and, therefore, a small code base.
> Other changes address deficits in the original design, like missing
> life-cycle management.
>
> Now the question is if there is interest in defining a common new
> revision of this device and maybe also of some protocols used on top,
> such as virtual network links. Ideally, this would enable us to share
> Linux drivers. We will definitely go for upstreaming at least a network
> driver such as [2], a UIO driver and maybe also a serial port/console.
>
>
> This sounds like duplicating efforts done with virtio and vhost-pci.
> Have you looked at Wei Wang proposal?

I didn't follow it recently, but the original concept was about
introducing an IOMMU model to the picture, and that's complexity-wise a
no-go for us (we can do this whole thing in less than 500 lines, even
virtio itself is more complex). IIUC, the alternative to an IOMMU is
mapping the whole frontend VM memory into the backend VM - that's
security/safety-wise an absolute no-go.

>
> I've attached a first draft of the specification of our new ivshmem
> device. A working implementation can be found in the wip/ivshmem2 branch
> of Jailhouse [3], the corresponding ivshmem-net driver in [4].
>
>
> You don't have qemu branch, right?

Yes, not yet. I would look into creating a QEMU device model if there is
serious interest.

>
>
>
> Deviations from the original design:
>
> - Only two peers per link
>
>
> sound sane, that's also what vhost-pci aims to afaik
>
>
> This simplifies the implementation and also the interfaces (think of
> life-cycle management in a multi-peer environment). Moreover, we do
> not have an urgent use case for multiple peers, thus also not
> reference for a protocol that could be used in such setups. If someone
> else happens to share such a protocol, it would be possible to discuss
> potential extensions and their implications.
>
> - Side-band registers to discover and configure share memory regions
>
> This was one of the first changes: We removed the memory regions from
> the PCI BARs and gave them special configuration space registers. By
> now, these registers are embedded in a PCI capability. The reasons are
> that Jailhouse does not allow to relocate the regions in guest address
> space (but other hypervisors may if they like to) and that we now have
> up to three of them.
>
>
> Sorry, I can't comment on that.
>
>
> - Changed PCI base class code to 0xff (unspecified class)
>
> This allows us to define our own sub classes and interfaces. That is
> now exploited for specifying the shared memory protocol the two
> connected peers should use. It also allows the Linux drivers to match
> on that.
>
>
> Why not, but it worries me that you are going to invent protocols
> similar to virtio devices, aren't you?

That partly comes with the desire to simplify the transport (pure shared
memory). With ivshmem-net, we are at least reusing virtio rings and will
try to do this with the new (and faster) virtio ring format as well.
As pointed out, for us it's most important to keep the design simple -
even at the price of "reinventing" some drivers for upstream (at least,
we do not need two sets of drivers because our interface is fully
symmetric). I don't see yet how vhost-pci could achieve the same, but
I'm open to learn more!

Thanks,
Jan

Stefan Hajnoczi

unread,
Jan 16, 2017, 9:19:00 AM1/16/17
to Jan Kiszka, qemu-devel, Jailhouse
On Mon, Jan 16, 2017 at 09:36:51AM +0100, Jan Kiszka wrote:
> some of you may know that we are using a shared memory device similar to
> ivshmem in the partitioning hypervisor Jailhouse [1].
>
> We started as being compatible to the original ivshmem that QEMU
> implements, but we quickly deviated in some details, and in the recent
> months even more. Some of the deviations are related to making the
> implementation simpler. The new ivshmem takes <500 LoC - Jailhouse is
> aiming at safety critical systems and, therefore, a small code base.
> Other changes address deficits in the original design, like missing
> life-cycle management.

My first thought is "what about virtio?". Can you share some background
on why ivshmem fits the use case better than virtio?

The reason I ask is because the ivshmem devices you define would have
parallels to existing virtio devices and this could lead to duplication.

Stefan
signature.asc

Jan Kiszka

unread,
Jan 16, 2017, 9:35:02 AM1/16/17
to Stefan Hajnoczi, qemu-devel, Jailhouse
virtio was created as an interface between a host and a guest. It has no
notion of direct (or even symmetric) connection between guests. With
ivshmem, we want to establish only a minimal host-guest interface. We
want to keep the host out of the business negotiating protocol details
between two connected guests.

So, the trade-off was between reusing existing virtio drivers - in the
best case, some changes would have been required definitely - and
requiring complex translation of virtio into a vm-to-vm model on the one
side and establishing a new driver ecosystem on much simpler host
services (500 LoC...). We went for the latter.

Jan

Wang, Wei W

unread,
Jan 17, 2017, 4:13:50 AM1/17/17
to Jan Kiszka, Marc-André Lureau, qemu-devel, Jailhouse, Markus Armbruster
Hi Jan,
Though the virtio based solution might be complex for you, a big advantage is that we have lots of people working to improve virtio. For example, the upcoming virtio 1.1 has vring improvement, we can easily upgrade all the virtio based solutions, such as vhost-pci, to take advantage of this improvement. From the long term perspective, I think this kind of complexity is worthwhile.

We further have security features(e.g. vIOMMU) can be applied to vhost-pci.
Can you please explain more about the process of transferring a packet using the three different memory regions?
In the kernel implementation, the sk_buf can be allocated anywhere.

Btw, this looks similar to the memory access protection mechanism using EPTP switching:
Slide 25 http://www.linux-kvm.org/images/8/87/02x09-Aspen-Jun_Nakajima-KVM_as_the_NFV_Hypervisor.pdf
This missed right side of the figure is an alternative EPT, which gives a full access permission to the small piece of security code.

> >
> >
> > - Life-cycle management via local and remote state
> >
> > Each device can now signal its own state in form of a value to the
> > remote side, which triggers an event there. Moreover, state changes
> > done by the hypervisor to one peer are signalled to the other side.
> > And we introduced a write-to-shared-memory mechanism for the
> > respective remote state so that guests do not have to issue an MMIO
> > access in order to check the state.
> >
> >
> > There is also ongoing work to better support disconnect/reconnect in
> > virtio.
> >
> >
> >
> > So, this is our proposal. Would be great to hear some opinions if you
> > see value in adding support for such an "ivshmem 2.0" device to QEMU as
> > well and expand its ecosystem towards Linux upstream, maybe also DPDK
> > again. If you see problems in the new design /wrt what QEMU provides so
> > far with its ivshmem device, let's discuss how to resolve them. Looking
> > forward to any feedback!
> >
> >
> > My feeling is that ivshmem is not being actively developped in qemu,
> > but rather virtio-based solutions (vhost-pci for vm2vm).
>
> As pointed out, for us it's most important to keep the design simple - even at the
> price of "reinventing" some drivers for upstream (at least, we do not need two
> sets of drivers because our interface is fully symmetric). I don't see yet how
> vhost-pci could achieve the same, but I'm open to learn more!

Maybe I didn’t fully understand this - "we do not need two sets of drivers because our interface is fully symmetric"?

The vhost-pci driver is a standalone network driver from the local guest point of view - it's no different than any other network drivers in the guest. When talking about usage, it's used together with another VM's virtio device - would this be the "two sets of drivers" that you meant? I think this is pretty nature and reasonable, as it is essentially a vm-to-vm communication. Furthermore, we are able to dynamically create/destroy and hot-plug in/out a vhost-pci device based on runtime requests.

Thanks for sharing your ideas.

Best,
Wei

Jan Kiszka

unread,
Jan 17, 2017, 4:46:20 AM1/17/17
to Wang, Wei W, Marc-André Lureau, qemu-devel, Jailhouse, Markus Armbruster
We will adopt virtio 1.1 ring formats. That's one reason why there is
also still a bidirectional shared memory region: to host the new
descriptors (while keeping the payload safely in the unidirectional
regions).

>
> We further have security features(e.g. vIOMMU) can be applied to vhost-pci.

As pointed out, this is way too complex for us. A complete vIOMMU model
would easily add a few thousand lines of code to a hypervisor that tries
to stay below 10k LoC. Each line costs a lot of money when going for
certification. Plus I'm not even sure that there will always be
performance benefits, but that's to be seen when both solutions matured.
With shared memory-backed communication, you obviously will have to
copy, to and sometimes also from the communication regions. But you no
longer have to flip any mappings (or even give up on secure isolation).

Why we have up to three regions: two unidirectional ones for payload,
one for shared control structures or custom protocols. See also above.

>
> Btw, this looks similar to the memory access protection mechanism using EPTP switching:
> Slide 25 http://www.linux-kvm.org/images/8/87/02x09-Aspen-Jun_Nakajima-KVM_as_the_NFV_Hypervisor.pdf
> This missed right side of the figure is an alternative EPT, which gives a full access permission to the small piece of security code.

EPTP might be some nice optimization for scenarios where you have to
switch (but are its security problems resolved by now?), but a) we can
avoid switching and b) it's Intel-only while we need a generic solution
for all archs.

>
>>>
>>>
>>> - Life-cycle management via local and remote state
>>>
>>> Each device can now signal its own state in form of a value to the
>>> remote side, which triggers an event there. Moreover, state changes
>>> done by the hypervisor to one peer are signalled to the other side.
>>> And we introduced a write-to-shared-memory mechanism for the
>>> respective remote state so that guests do not have to issue an MMIO
>>> access in order to check the state.
>>>
>>>
>>> There is also ongoing work to better support disconnect/reconnect in
>>> virtio.
>>>
>>>
>>>
>>> So, this is our proposal. Would be great to hear some opinions if you
>>> see value in adding support for such an "ivshmem 2.0" device to QEMU as
>>> well and expand its ecosystem towards Linux upstream, maybe also DPDK
>>> again. If you see problems in the new design /wrt what QEMU provides so
>>> far with its ivshmem device, let's discuss how to resolve them. Looking
>>> forward to any feedback!
>>>
>>>
>>> My feeling is that ivshmem is not being actively developped in qemu,
>>> but rather virtio-based solutions (vhost-pci for vm2vm).
>>
>> As pointed out, for us it's most important to keep the design simple - even at the
>> price of "reinventing" some drivers for upstream (at least, we do not need two
>> sets of drivers because our interface is fully symmetric). I don't see yet how
>> vhost-pci could achieve the same, but I'm open to learn more!
>
> Maybe I didn’t fully understand this - "we do not need two sets of drivers because our interface is fully symmetric"?

We have no backend/frontend drivers. While vhost-pci can reuse virtio
frontend drivers, it still requires new backend drivers. We use the same
drivers on both sides - it's just symmetric. That also simplifies
arguing over non-interference because both sides have equal capabilities.

>
> The vhost-pci driver is a standalone network driver from the local guest point of view - it's no different than any other network drivers in the guest. When talking about usage, it's used together with another VM's virtio device - would this be the "two sets of drivers" that you meant? I think this is pretty nature and reasonable, as it is essentially a vm-to-vm communication. Furthermore, we are able to dynamically create/destroy and hot-plug in/out a vhost-pci device based on runtime requests.

Hotplugging works with shared memory devices as well. We don't use it
during runtime of the hypervisor due to safety constraints, but devices
show up and disappear in the root cell (the primary Linux) as the
hypervisor starts or stops.

Jan

Stefan Hajnoczi

unread,
Jan 17, 2017, 4:59:14 AM1/17/17
to Jan Kiszka, Marc-André Lureau, qemu-devel, Jailhouse, Wei Wang, Markus Armbruster
The concept of symmetry is nice but only applies for communications
channels like networking and serial.

It doesn't apply for I/O that is fundamentally asymmetric like disk I/O.

I just wanted to point this out because lack symmetry has also bothered
me about virtio but it's actually impossible to achieve it for all
device types.

Stefan
signature.asc

Stefan Hajnoczi

unread,
Jan 17, 2017, 5:00:09 AM1/17/17
to Jan Kiszka, qemu-devel, Jailhouse
Thanks. I was going in the same direction about vhost-pci as
Marc-André. Let's switch to his sub-thread.

Stefan
signature.asc

Jan Kiszka

unread,
Jan 17, 2017, 5:32:34 AM1/17/17
to Stefan Hajnoczi, Marc-André Lureau, qemu-devel, Jailhouse, Wei Wang, Markus Armbruster
That's true. Not sure what all is planned for vhost-pci. Our scope is
limited (though mass storage proxying could be interesting at some
point), plus there is the option to do X-over-network.

Jan

Wang, Wei W

unread,
Jan 20, 2017, 6:54:22 AM1/20/17
to Jan Kiszka, Marc-André Lureau, qemu-devel, Jailhouse, Markus Armbruster
The vring example I gave might be confusing, sorry about that. My point is that every part of virtio is getting matured and improved from time to time. Personally, having a new device developed and maintained in an active and popular model is helpful. Also, as new features being gradually added in the future, a simple device could become complex.

Having a theoretical analysis on the performance:
The traditional shared memory mechanism, sharing an intermediate memory, requires 2 copies to get the packet transmitted. It's not just one more copy compared to the 1-copy solution, I think some more things we may need to take into account:
1) there are extra ring operation overhead on both the sending and receiving side to access the shared memory (i.e. IVSHMEM);
2) extra protocol to use the shared memory;
3) the piece of allocated shared memory from the host = C(n,2), where n is the number of VMs. Like for 20 VMs who want to talk to each other, there will be 190 pieces of memory allocated from the host.

That being said, if people really want the 2-copy solution, we can also have vhost-pci support it that way as a new feature (not sure if you would be interested in collaborating on the project):
With the new feature added, the master VM sends only a piece of memory (equivalent to IVSHMEM, but allocated by the guest) to the slave over vhost-user protocol, and the vhost-pci device on the slave side only hosts that piece of shared memory.

Best,
Wei

Jan Kiszka

unread,
Jan 20, 2017, 11:37:44 AM1/20/17
to Wang, Wei W, Marc-André Lureau, qemu-devel, Jailhouse, Markus Armbruster
We can't afford becoming more complex, that is the whole point.
Complexity shall go into the guest, not the hypervisor, when it is
really needed.

>
> Having a theoretical analysis on the performance:
> The traditional shared memory mechanism, sharing an intermediate memory, requires 2 copies to get the packet transmitted. It's not just one more copy compared to the 1-copy solution, I think some more things we may need to take into account:

1-copy (+potential transfers to userspace, but that's the same for
everyone) is conceptually possible, definitely under stacks like DPDK.
However, Linux skbs are currently not prepared for picking up
shmem-backed packets, we already looked into this. Likely addressable,
though.

> 1) there are extra ring operation overhead on both the sending and receiving side to access the shared memory (i.e. IVSHMEM);
> 2) extra protocol to use the shared memory;
> 3) the piece of allocated shared memory from the host = C(n,2), where n is the number of VMs. Like for 20 VMs who want to talk to each other, there will be 190 pieces of memory allocated from the host.

Well, only if all VMs need to talk to all others directly. On real
setups, you would add direct links for heavy traffic and otherwise do
software switching. Moreover, those links would only have to be backed
by physical memory in static setups all the time.

Also, we didn't completely rule out a shmem bus with multiple peers
connected. That's just looking for a strong use case - and then a robust
design, of course.

>
> That being said, if people really want the 2-copy solution, we can also have vhost-pci support it that way as a new feature (not sure if you would be interested in collaborating on the project):
> With the new feature added, the master VM sends only a piece of memory (equivalent to IVSHMEM, but allocated by the guest) to the slave over vhost-user protocol, and the vhost-pci device on the slave side only hosts that piece of shared memory.

I'm all in for something that allows to strip down vhost-pci to
something that - while staying secure - is simple and /also/ allows
static configurations. But I'm not yet seeing that this would still be
virtio or vhost-pci.

What would be the minimal viable vhost-pci device set from your POV?
What would have to be provided by the hypervisor for that?

Wang, Wei W

unread,
Jan 22, 2017, 10:49:10 PM1/22/17
to Jan Kiszka, Marc-André Lureau, qemu-devel, Jailhouse, Markus Armbruster
Not sure how difficult it would be to get that change upstream-ed to the kernel, but looking forward to seeing your solutions.

> > 1) there are extra ring operation overhead on both the sending and
> > receiving side to access the shared memory (i.e. IVSHMEM);
> > 2) extra protocol to use the shared memory;
> > 3) the piece of allocated shared memory from the host = C(n,2), where n is the
> number of VMs. Like for 20 VMs who want to talk to each other, there will be
> 190 pieces of memory allocated from the host.
>
> Well, only if all VMs need to talk to all others directly. On real setups, you would
> add direct links for heavy traffic and otherwise do software switching. Moreover,
> those links would only have to be backed by physical memory in static setups all
> the time.
>
> Also, we didn't completely rule out a shmem bus with multiple peers connected.
> That's just looking for a strong use case - and then a robust design, of course.
>
> >
> > That being said, if people really want the 2-copy solution, we can also have
> vhost-pci support it that way as a new feature (not sure if you would be
> interested in collaborating on the project):
> > With the new feature added, the master VM sends only a piece of memory
> (equivalent to IVSHMEM, but allocated by the guest) to the slave over vhost-user
> protocol, and the vhost-pci device on the slave side only hosts that piece of
> shared memory.
>
> I'm all in for something that allows to strip down vhost-pci to something that -
> while staying secure - is simple and /also/ allows static configurations. But I'm
> not yet seeing that this would still be virtio or vhost-pci.
>
> What would be the minimal viable vhost-pci device set from your POV?

For the static configuration option, I think it mainly needs the device emulation part of the current implementation, which has ~500 LOC currently. This would also need to add a new feature to virtio_net, to let the virtio_net driver use the IVSHMEM, and same for the vhost-pci device.

I think most part of the vhost-user protocol can be bypassed for this usage, because the device feature bits don’t need to be negotiated between the two devices, the memory and vring info doesn’t need to be transferred. To support interrupt, we may still need vhost-user to share irqfd.

> What would have to be provided by the hypervisor for that?
>

We don’t need any support from KVM, for the qemu part support, please see above.

Best,
Wei

Måns Rullgård

unread,
Jan 23, 2017, 5:14:54 AM1/23/17
to Wang, Wei W, Jan Kiszka, Marc-André Lureau, qemu-devel, Jailhouse, Markus Armbruster
"Wang, Wei W" <wei.w...@intel.com> writes:

> On Saturday, January 21, 2017 12:38 AM, Jan Kiszka wrote:
>> On 2017-01-20 12:54, Wang, Wei W wrote:
>>> Having a theoretical analysis on the performance: The traditional
>>> shared memory mechanism, sharing an intermediate memory, requires 2
>>> copies to get the packet transmitted. It's not just one more copy
>>> compared to the 1-copy solution, I think some more things we may
>>> need to take into account:
>>
>> 1-copy (+potential transfers to userspace, but that's the same for
>> everyone) is conceptually possible, definitely under stacks like
>> DPDK. However, Linux skbs are currently not prepared for picking up
>> shmem-backed packets, we already looked into this. Likely
>> addressable, though.
>
> Not sure how difficult it would be to get that change upstream-ed to
> the kernel, but looking forward to seeing your solutions.

The problem is that the shared memory mapping doesn't have a struct page
as required by lots of networking code.

--
Måns Rullgård

Markus Armbruster

unread,
Jan 23, 2017, 9:19:27 AM1/23/17
to Jan Kiszka, qemu-devel, Jailhouse
Jan Kiszka <jan.k...@siemens.com> writes:

> Hi,
>
> some of you may know that we are using a shared memory device similar to
> ivshmem in the partitioning hypervisor Jailhouse [1].
>
> We started as being compatible to the original ivshmem that QEMU
> implements, but we quickly deviated in some details, and in the recent
> months even more. Some of the deviations are related to making the
> implementation simpler. The new ivshmem takes <500 LoC - Jailhouse is

Compare: hw/misc/ivshmem.c ~1000 SLOC, measured with sloccount.

> aiming at safety critical systems and, therefore, a small code base.
> Other changes address deficits in the original design, like missing
> life-cycle management.
>
> Now the question is if there is interest in defining a common new
> revision of this device and maybe also of some protocols used on top,
> such as virtual network links. Ideally, this would enable us to share
> Linux drivers. We will definitely go for upstreaming at least a network
> driver such as [2], a UIO driver and maybe also a serial port/console.
>
> I've attached a first draft of the specification of our new ivshmem
> device. A working implementation can be found in the wip/ivshmem2 branch
> of Jailhouse [3], the corresponding ivshmem-net driver in [4].
>
> Deviations from the original design:
>
> - Only two peers per link

Uh, define "link".

> This simplifies the implementation and also the interfaces (think of
> life-cycle management in a multi-peer environment). Moreover, we do
> not have an urgent use case for multiple peers, thus also not
> reference for a protocol that could be used in such setups. If someone
> else happens to share such a protocol, it would be possible to discuss
> potential extensions and their implications.
>
> - Side-band registers to discover and configure share memory regions
>
> This was one of the first changes: We removed the memory regions from
> the PCI BARs and gave them special configuration space registers. By
> now, these registers are embedded in a PCI capability. The reasons are
> that Jailhouse does not allow to relocate the regions in guest address
> space (but other hypervisors may if they like to) and that we now have
> up to three of them.

I'm afraid I don't quite understand the change, nor the rationale. I
guess I could figure out the former by studying the specification.

> - Changed PCI base class code to 0xff (unspecified class)

Changed from 0x5 (memory controller).

> This allows us to define our own sub classes and interfaces. That is
> now exploited for specifying the shared memory protocol the two
> connected peers should use. It also allows the Linux drivers to match
> on that.
>
> - INTx interrupts support is back
>
> This is needed on target platforms without MSI controllers, i.e.
> without the required guest support. Namely some PCI-less ARM SoCs
> required the reintroduction. While doing this, we also took care of
> keeping the MMIO registers free of privileged controls so that a
> guest OS can map them safely into a guest userspace application.

So you need interrupt capability. Current upstream ivshmem requires a
server such as the one in contrib/ivshmem-server/. What about yours?

The interrupt feature enables me to guess a definition of "link": A and
B are peers of the same link if they can interrupt each other.

Does your ivshmem2 support interrupt-less operation similar to
ivshmem-plain?

> And then there are some extensions of the original ivshmem:
>
> - Multiple shared memory regions, including unidirectional ones
>
> It is now possible to expose up to three different shared memory
> regions: The first one is read/writable for both sides. The second
> region is read/writable for the local peer and read-only for the
> remote peer (useful for output queues). And the third is read-only
> locally but read/writable remotely (ie. for input queues).
> Unidirectional regions prevent that the receiver of some data can
> interfere with the sender while it is still building the message, a
> property that is not only useful for safety critical communication,
> we are sure.
>
> - Life-cycle management via local and remote state
>
> Each device can now signal its own state in form of a value to the
> remote side, which triggers an event there.

How are "events" related to interrupts?

> Moreover, state changes
> done by the hypervisor to one peer are signalled to the other side.
> And we introduced a write-to-shared-memory mechanism for the
> respective remote state so that guests do not have to issue an MMIO
> access in order to check the state.
>
> So, this is our proposal. Would be great to hear some opinions if you
> see value in adding support for such an "ivshmem 2.0" device to QEMU as
> well and expand its ecosystem towards Linux upstream, maybe also DPDK
> again. If you see problems in the new design /wrt what QEMU provides so
> far with its ivshmem device, let's discuss how to resolve them. Looking
> forward to any feedback!

My general opinion on ivshmem is well-known, but I repeat it for the
record: merging it was a mistake, and using it is probably a mistake. I
detailed my concerns in "Why I advise against using ivshmem"[*].

My philosophical concerns remain. Perhaps you can assuage them.

Only some of my practical concerns have since been addressed. In part
by myself, because having a flawed implementation of a bad idea is
strictly worse than the same with flaws corrected as far as practical.
But even today, docs/specs/ivshmem-spec.txt is a rather depressing read.

However, there's one thing that's still worse than a more or less flawed
implementation of a bad idea: two implementations of a bad idea. Could
ivshmem2 be done in a way that permits *replacing* ivshmem?
[*] http://lists.gnu.org/archive/html/qemu-devel/2014-06/msg02968.html

Jan Kiszka

unread,
Jan 25, 2017, 4:18:50 AM1/25/17
to Markus Armbruster, qemu-devel, Jailhouse
On 2017-01-23 15:19, Markus Armbruster wrote:
> Jan Kiszka <jan.k...@siemens.com> writes:
>
>> Hi,
>>
>> some of you may know that we are using a shared memory device similar to
>> ivshmem in the partitioning hypervisor Jailhouse [1].
>>
>> We started as being compatible to the original ivshmem that QEMU
>> implements, but we quickly deviated in some details, and in the recent
>> months even more. Some of the deviations are related to making the
>> implementation simpler. The new ivshmem takes <500 LoC - Jailhouse is
>
> Compare: hw/misc/ivshmem.c ~1000 SLOC, measured with sloccount.

That difference comes from remote/migration support and general QEMU
integration - likely not very telling due to the different environments.

>
>> aiming at safety critical systems and, therefore, a small code base.
>> Other changes address deficits in the original design, like missing
>> life-cycle management.
>>
>> Now the question is if there is interest in defining a common new
>> revision of this device and maybe also of some protocols used on top,
>> such as virtual network links. Ideally, this would enable us to share
>> Linux drivers. We will definitely go for upstreaming at least a network
>> driver such as [2], a UIO driver and maybe also a serial port/console.
>>
>> I've attached a first draft of the specification of our new ivshmem
>> device. A working implementation can be found in the wip/ivshmem2 branch
>> of Jailhouse [3], the corresponding ivshmem-net driver in [4].
>>
>> Deviations from the original design:
>>
>> - Only two peers per link
>
> Uh, define "link".

VMs are linked via a common shared memory. Interrupt delivery follows
that route as well.

>
>> This simplifies the implementation and also the interfaces (think of
>> life-cycle management in a multi-peer environment). Moreover, we do
>> not have an urgent use case for multiple peers, thus also not
>> reference for a protocol that could be used in such setups. If someone
>> else happens to share such a protocol, it would be possible to discuss
>> potential extensions and their implications.
>>
>> - Side-band registers to discover and configure share memory regions
>>
>> This was one of the first changes: We removed the memory regions from
>> the PCI BARs and gave them special configuration space registers. By
>> now, these registers are embedded in a PCI capability. The reasons are
>> that Jailhouse does not allow to relocate the regions in guest address
>> space (but other hypervisors may if they like to) and that we now have
>> up to three of them.
>
> I'm afraid I don't quite understand the change, nor the rationale. I
> guess I could figure out the former by studying the specification.

a) It's a Jailhouse thing (we disallow the guest to move the regions
around in its address space)
b) With 3 regions + MSI-X + MMIO registers, we run out of BARs (or
would have to downgrade them to 32 bit)

>
>> - Changed PCI base class code to 0xff (unspecified class)
>
> Changed from 0x5 (memory controller).

Right.

>
>> This allows us to define our own sub classes and interfaces. That is
>> now exploited for specifying the shared memory protocol the two
>> connected peers should use. It also allows the Linux drivers to match
>> on that.
>>
>> - INTx interrupts support is back
>>
>> This is needed on target platforms without MSI controllers, i.e.
>> without the required guest support. Namely some PCI-less ARM SoCs
>> required the reintroduction. While doing this, we also took care of
>> keeping the MMIO registers free of privileged controls so that a
>> guest OS can map them safely into a guest userspace application.
>
> So you need interrupt capability. Current upstream ivshmem requires a
> server such as the one in contrib/ivshmem-server/. What about yours?

IIRC, the need for a server with QEMU/KVM is related to live migration.
Jailhouse is simpler, all guests are managed by the same hypervisor
instance, and there is no migration. That makes interrupt delivery much
simpler as well. However, the device spec should not exclude other
architectures.

>
> The interrupt feature enables me to guess a definition of "link": A and
> B are peers of the same link if they can interrupt each other.
>
> Does your ivshmem2 support interrupt-less operation similar to
> ivshmem-plain?

Each receiver of interrupts is free to enable that - or leave it off as
it is the default after reset. But currently the spec demands that
either MSI-X or INTx is reported as available to the guests. We could
extend it to permit reporting no interrupts support if there is a good
case for it.

I will have to look into the details of the client-server structure of
QEMU's ivshmem again to answer the question under with restriction we
can make it both simpler and more robust. As Jailhouse has no live
migration support, requirements on ivshmem related to that may only be
addressed by chance so far.

>
>> And then there are some extensions of the original ivshmem:
>>
>> - Multiple shared memory regions, including unidirectional ones
>>
>> It is now possible to expose up to three different shared memory
>> regions: The first one is read/writable for both sides. The second
>> region is read/writable for the local peer and read-only for the
>> remote peer (useful for output queues). And the third is read-only
>> locally but read/writable remotely (ie. for input queues).
>> Unidirectional regions prevent that the receiver of some data can
>> interfere with the sender while it is still building the message, a
>> property that is not only useful for safety critical communication,
>> we are sure.
>>
>> - Life-cycle management via local and remote state
>>
>> Each device can now signal its own state in form of a value to the
>> remote side, which triggers an event there.
>
> How are "events" related to interrupts?

Confusing term chosen here: an interrupt is triggered on the remote side
(if it has interrupts enabled).

>
>> Moreover, state changes
>> done by the hypervisor to one peer are signalled to the other side.
>> And we introduced a write-to-shared-memory mechanism for the
>> respective remote state so that guests do not have to issue an MMIO
>> access in order to check the state.
>>
>> So, this is our proposal. Would be great to hear some opinions if you
>> see value in adding support for such an "ivshmem 2.0" device to QEMU as
>> well and expand its ecosystem towards Linux upstream, maybe also DPDK
>> again. If you see problems in the new design /wrt what QEMU provides so
>> far with its ivshmem device, let's discuss how to resolve them. Looking
>> forward to any feedback!
>
> My general opinion on ivshmem is well-known, but I repeat it for the
> record: merging it was a mistake, and using it is probably a mistake. I
> detailed my concerns in "Why I advise against using ivshmem"[*].
>
> My philosophical concerns remain. Perhaps you can assuage them.
>
> Only some of my practical concerns have since been addressed. In part
> by myself, because having a flawed implementation of a bad idea is
> strictly worse than the same with flaws corrected as far as practical.
> But even today, docs/specs/ivshmem-spec.txt is a rather depressing read.

I agree.

>
> However, there's one thing that's still worse than a more or less flawed
> implementation of a bad idea: two implementations of a bad idea. Could
> ivshmem2 be done in a way that permits *replacing* ivshmem?

If people see the need for having a common ivshmem2, that should of
course be designed to replace the original version of QEMU. I wouldn't
like to design it being backward compatible, but the new version should
provide all useful and required features of the old one.

Of course, I'm careful with investing much time into expanding the
existing, for Jailhouse possibly sufficient design if there no real
interest in continuing the ivshmem support in QEMU - because of
vhost-pci or other reasons. But if that interest exists, it would be
beneficial for us to have QEMU supporting a compatible version and using
the same guest drivers. Then I would start looking into concrete patches
for it as well.

Jan

Markus Armbruster

unread,
Jan 27, 2017, 2:36:13 PM1/27/17
to Jan Kiszka, Jailhouse, qemu-devel
Jan Kiszka <jan.k...@web.de> writes:

> On 2017-01-23 15:19, Markus Armbruster wrote:
>> Jan Kiszka <jan.k...@siemens.com> writes:
>>
>>> Hi,
>>>
>>> some of you may know that we are using a shared memory device similar to
>>> ivshmem in the partitioning hypervisor Jailhouse [1].
>>>
>>> We started as being compatible to the original ivshmem that QEMU
>>> implements, but we quickly deviated in some details, and in the recent
>>> months even more. Some of the deviations are related to making the
>>> implementation simpler. The new ivshmem takes <500 LoC - Jailhouse is
>>
>> Compare: hw/misc/ivshmem.c ~1000 SLOC, measured with sloccount.
>
> That difference comes from remote/migration support and general QEMU
> integration - likely not very telling due to the different environments.

Plausible.
Have you considered putting your three shared memory regions in memory
consecutively, so they can be covered by a single BAR? Similar to how a
single BAR covers both MSI-X table and PBA.

>>> - Changed PCI base class code to 0xff (unspecified class)
>>
>> Changed from 0x5 (memory controller).
>
> Right.
>
>>
>>> This allows us to define our own sub classes and interfaces. That is
>>> now exploited for specifying the shared memory protocol the two
>>> connected peers should use. It also allows the Linux drivers to match
>>> on that.
>>>
>>> - INTx interrupts support is back
>>>
>>> This is needed on target platforms without MSI controllers, i.e.
>>> without the required guest support. Namely some PCI-less ARM SoCs
>>> required the reintroduction. While doing this, we also took care of
>>> keeping the MMIO registers free of privileged controls so that a
>>> guest OS can map them safely into a guest userspace application.
>>
>> So you need interrupt capability. Current upstream ivshmem requires a
>> server such as the one in contrib/ivshmem-server/. What about yours?
>
> IIRC, the need for a server with QEMU/KVM is related to live migration.
> Jailhouse is simpler, all guests are managed by the same hypervisor
> instance, and there is no migration. That makes interrupt delivery much
> simpler as well. However, the device spec should not exclude other
> architectures.

The server doesn't really help with live migration. It's used to dole
out file descriptors for shared memory and interrupt signalling, and to
notify of peer connect/disconnect.

>> The interrupt feature enables me to guess a definition of "link": A and
>> B are peers of the same link if they can interrupt each other.
>>
>> Does your ivshmem2 support interrupt-less operation similar to
>> ivshmem-plain?
>
> Each receiver of interrupts is free to enable that - or leave it off as
> it is the default after reset. But currently the spec demands that
> either MSI-X or INTx is reported as available to the guests. We could
> extend it to permit reporting no interrupts support if there is a good
> case for it.

I think the case for interrupt-incapable ivshmem-plain is that
interrupt-capable ivshmem-doorbell requires a server, and is therefore a
bit more complex to set up, and has additional failure modes.

If that wasn't the case, a single device variant would make more sense.

Besides, contrib/ivshmem-server/ is not fit for production use.

> I will have to look into the details of the client-server structure of
> QEMU's ivshmem again to answer the question under with restriction we
> can make it both simpler and more robust. As Jailhouse has no live
> migration support, requirements on ivshmem related to that may only be
> addressed by chance so far.

Here's how live migration works with QEMU's ivshmem: exactly one peer
(the "master") migrates with its ivshmem device, all others need to hot
unplug ivshmem, migrate, hot plug it back after the master completed its
migration. The master connects to the new server on the destination on
startup, then live migration copies over the shared memory. The other
peers connect to the new server when they get their ivshmem hot plugged
again.

>>> And then there are some extensions of the original ivshmem:
>>>
>>> - Multiple shared memory regions, including unidirectional ones
>>>
>>> It is now possible to expose up to three different shared memory
>>> regions: The first one is read/writable for both sides. The second
>>> region is read/writable for the local peer and read-only for the
>>> remote peer (useful for output queues). And the third is read-only
>>> locally but read/writable remotely (ie. for input queues).
>>> Unidirectional regions prevent that the receiver of some data can
>>> interfere with the sender while it is still building the message, a
>>> property that is not only useful for safety critical communication,
>>> we are sure.
>>>
>>> - Life-cycle management via local and remote state
>>>
>>> Each device can now signal its own state in form of a value to the
>>> remote side, which triggers an event there.
>>
>> How are "events" related to interrupts?
>
> Confusing term chosen here: an interrupt is triggered on the remote side
> (if it has interrupts enabled).

Got it.
Nobody likes to provide backward compability, but everybody likes to
take advantage of it :)

Seriously, I can't say whether feature parity would suffice, or whether
we need full backward compatibility.

> Of course, I'm careful with investing much time into expanding the
> existing, for Jailhouse possibly sufficient design if there no real
> interest in continuing the ivshmem support in QEMU - because of
> vhost-pci or other reasons. But if that interest exists, it would be
> beneficial for us to have QEMU supporting a compatible version and using
> the same guest drivers. Then I would start looking into concrete patches
> for it as well.

Interest is difficult for me to gauge, not least because alternatives
are still being worked on.

Jan Kiszka

unread,
Jan 29, 2017, 3:43:42 AM1/29/17
to Markus Armbruster, Jailhouse, qemu-devel
Would still require to pass three times some size information (each
region can be different or empty/non-existent). Moreover, a) is not
possible then without ugly modifications to the guest because they
expect BAR-based regions to be relocatable.

>
>>>> - Changed PCI base class code to 0xff (unspecified class)
>>>
>>> Changed from 0x5 (memory controller).
>>
>> Right.
>>
>>>
>>>> This allows us to define our own sub classes and interfaces. That is
>>>> now exploited for specifying the shared memory protocol the two
>>>> connected peers should use. It also allows the Linux drivers to match
>>>> on that.
>>>>
>>>> - INTx interrupts support is back
>>>>
>>>> This is needed on target platforms without MSI controllers, i.e.
>>>> without the required guest support. Namely some PCI-less ARM SoCs
>>>> required the reintroduction. While doing this, we also took care of
>>>> keeping the MMIO registers free of privileged controls so that a
>>>> guest OS can map them safely into a guest userspace application.
>>>
>>> So you need interrupt capability. Current upstream ivshmem requires a
>>> server such as the one in contrib/ivshmem-server/. What about yours?
>>
>> IIRC, the need for a server with QEMU/KVM is related to live migration.
>> Jailhouse is simpler, all guests are managed by the same hypervisor
>> instance, and there is no migration. That makes interrupt delivery much
>> simpler as well. However, the device spec should not exclude other
>> architectures.
>
> The server doesn't really help with live migration. It's used to dole
> out file descriptors for shared memory and interrupt signalling, and to
> notify of peer connect/disconnect.

That should be solvable directly between two peers.
OK, hot-plug is a simple answer to this problem. It would be even
cleaner to support from the guest POV with the new state signalling
mechanism of ivshmem2.
Given the deficits of the current design and the lacking driver support
in Linux, people should be happy if the new interface is default but the
old one could still be selected for a while. But a first step will
likely be a separate implementation of the interface.

>
>> Of course, I'm careful with investing much time into expanding the
>> existing, for Jailhouse possibly sufficient design if there no real
>> interest in continuing the ivshmem support in QEMU - because of
>> vhost-pci or other reasons. But if that interest exists, it would be
>> beneficial for us to have QEMU supporting a compatible version and using
>> the same guest drivers. Then I would start looking into concrete patches
>> for it as well.
>
> Interest is difficult for me to gauge, not least because alternatives
> are still being worked on.

I'm considering to suggest this as GSoC project now.

Jan

msuchanek

unread,
Jan 29, 2017, 6:56:26 AM1/29/17
to Stefan Hajnoczi, Jan Kiszka, Jailhouse, Wei Wang, Marc-André Lureau, qemu-devel, Markus Armbruster, Qemu-devel
What's asymetric about storage? IIRC both SCSI and Firewire which can be
used for storage are symmetric. All asymmetry only comes from usage
convention or less capable buses like IDE/SATA.

Thanks

Michal

Marc-André Lureau

unread,
Jan 29, 2017, 9:00:48 AM1/29/17
to Jan Kiszka, Jailhouse, Markus Armbruster, qemu-devel, Wei Wang
Hi

On Sun, Jan 29, 2017 at 12:44 PM Jan Kiszka <jan.k...@web.de> wrote:
>> Of course, I'm careful with investing much time into expanding the
>> existing, for Jailhouse possibly sufficient design if there no real
>> interest in continuing the ivshmem support in QEMU - because of
>> vhost-pci or other reasons. But if that interest exists, it would be
>> beneficial for us to have QEMU supporting a compatible version and using
>> the same guest drivers. Then I would start looking into concrete patches
>> for it as well.
>
> Interest is difficult for me to gauge, not least because alternatives
> are still being worked on.

I'm considering to suggest this as GSoC project now.

It's better for a student and for the community if the work get accepted in the end.

So, I think that could be an intersting GSoC (implementing your ivshmem 2 proposal). However, if the qemu community isn't ready to accept a new ivshmem, and would rather have vhost-pci based solution, I would suggest a different project (hopefully Wei Wang can help define it and mentor): work on a vhost-pci using dedicated shared PCI BARs (and kernel support to avoid extra copy - if I understand the extra copy situation correctly).
--
Marc-André Lureau

Jan Kiszka

unread,
Jan 29, 2017, 9:14:16 AM1/29/17
to Marc-André Lureau, Jailhouse, Markus Armbruster, qemu-devel, Wei Wang
It's still open if vhost-pci can replace ivshmem (not to speak of being
desirable for Jailhouse - I'm still studying). In that light, having
both implementations available to do real comparisons is valuable IMHO.

That said, we will play with open cards, explain the student the
situation and let her/him decide knowingly.

Jan

PS: We have a mixed history /wrt actually merging student projects.

Markus Armbruster

unread,
Jan 30, 2017, 3:00:17 AM1/30/17
to Jan Kiszka, Jailhouse, qemu-devel
Yes. Precedence: location of MSI-X table and PBA are specified in the
MSI-X Capability Structure as offset and BIR.

> Moreover, a) is not
> possible then without ugly modifications to the guest because they
> expect BAR-based regions to be relocatable.

Can you explain why not letting the guest map the shared memory into its
address space on its own just like any other piece of device memory is a
requirement?

>>>>> - Changed PCI base class code to 0xff (unspecified class)
>>>>
>>>> Changed from 0x5 (memory controller).
>>>
>>> Right.
>>>
>>>>
>>>>> This allows us to define our own sub classes and interfaces. That is
>>>>> now exploited for specifying the shared memory protocol the two
>>>>> connected peers should use. It also allows the Linux drivers to match
>>>>> on that.
>>>>>
>>>>> - INTx interrupts support is back
>>>>>
>>>>> This is needed on target platforms without MSI controllers, i.e.
>>>>> without the required guest support. Namely some PCI-less ARM SoCs
>>>>> required the reintroduction. While doing this, we also took care of
>>>>> keeping the MMIO registers free of privileged controls so that a
>>>>> guest OS can map them safely into a guest userspace application.
>>>>
>>>> So you need interrupt capability. Current upstream ivshmem requires a
>>>> server such as the one in contrib/ivshmem-server/. What about yours?
>>>
>>> IIRC, the need for a server with QEMU/KVM is related to live migration.
>>> Jailhouse is simpler, all guests are managed by the same hypervisor
>>> instance, and there is no migration. That makes interrupt delivery much
>>> simpler as well. However, the device spec should not exclude other
>>> architectures.
>>
>> The server doesn't really help with live migration. It's used to dole
>> out file descriptors for shared memory and interrupt signalling, and to
>> notify of peer connect/disconnect.
>
> That should be solvable directly between two peers.

Even between multiple peers, but it might complicate the peers.

Note that the current ivshmem client-server protocol doesn't support
graceful recovery from a server crash. The clients can hobble on with
reduced functionality, though (see ivshmem-spec.txt). Live migration
could be a way to recover, if the application permits it.
Yes, proper state signalling should make this cleaner. Without it,
every protocol built on top of ivshmem needs to come up with its own
state signalling. The robustness problems should be obvious.

This is one aspect of my objection to the idea "just share some memory,
it's simple": it's not a protocol. It's at best a building block for
protocols.

With ivshmem-doorbell, peers get notified of connects and disconnects.
However, the device can't notify guest software. Fixable with
additional registers and an interrupt.

The design of ivshmem-plain has peers knowing nothing about their peers,
so a fix would require a redesign.

[...]

Markus Armbruster

unread,
Jan 30, 2017, 3:02:25 AM1/30/17
to Jan Kiszka, Marc-André Lureau, Jailhouse, Wei Wang, qemu-devel
Jan Kiszka <jan.k...@web.de> writes:

> On 2017-01-29 15:00, Marc-André Lureau wrote:
>> Hi
>>
>> On Sun, Jan 29, 2017 at 12:44 PM Jan Kiszka <jan.k...@web.de
>> <mailto:jan.k...@web.de>> wrote:
>>
>> >> Of course, I'm careful with investing much time into expanding the
>> >> existing, for Jailhouse possibly sufficient design if there no real
>> >> interest in continuing the ivshmem support in QEMU - because of
>> >> vhost-pci or other reasons. But if that interest exists, it would be
>> >> beneficial for us to have QEMU supporting a compatible version
>> and using
>> >> the same guest drivers. Then I would start looking into concrete
>> patches
>> >> for it as well.
>> >
>> > Interest is difficult for me to gauge, not least because alternatives
>> > are still being worked on.
>>
>> I'm considering to suggest this as GSoC project now.
>>
>>
>> It's better for a student and for the community if the work get accepted
>> in the end.

Yes.

>> So, I think that could be an intersting GSoC (implementing your ivshmem
>> 2 proposal). However, if the qemu community isn't ready to accept a new
>> ivshmem, and would rather have vhost-pci based solution, I would suggest
>> a different project (hopefully Wei Wang can help define it and mentor):
>> work on a vhost-pci using dedicated shared PCI BARs (and kernel support
>> to avoid extra copy - if I understand the extra copy situation correctly).
>
> It's still open if vhost-pci can replace ivshmem (not to speak of being
> desirable for Jailhouse - I'm still studying). In that light, having
> both implementations available to do real comparisons is valuable IMHO.

Yes, but is it appropriate for GSoC?

> That said, we will play with open cards, explain the student the
> situation and let her/him decide knowingly.

Both the student and the QEMU project need to consider the situation
carefully.

> Jan
>
> PS: We have a mixed history /wrt actually merging student projects.

Yes, but having screwed up is no license to screw up some more :)

Jan Kiszka

unread,
Jan 30, 2017, 3:05:46 AM1/30/17
to Markus Armbruster, Marc-André Lureau, Jailhouse, Wei Wang, qemu-devel
After having received multiple feedbacks in this direction, I will drop
that proposal from our list. So, don't worry. ;)

Jan

Jan Kiszka

unread,
Jan 30, 2017, 3:14:32 AM1/30/17
to Markus Armbruster, Jailhouse, qemu-devel
It requires reconfiguration of the sensitive 2nd level page tables
during runtime of the guest. We are avoiding the neccessery checking and
synchronization measures so far which reduces code complexity further.

BTW, PCI has a similar concept of static assignment (PCI EA), but that
is unfortunately incompatible to our needs [1].
True, but that is exactly the advantage we see for our case: The
hypervisor needs no knowledge about the protocol run over the link. That
was one reason to avoid virtio so far.

>
> With ivshmem-doorbell, peers get notified of connects and disconnects.
> However, the device can't notify guest software. Fixable with
> additional registers and an interrupt.
>
> The design of ivshmem-plain has peers knowing nothing about their peers,
> so a fix would require a redesign.
>
> [...]
>

Jan

[1] https://groups.google.com/forum/#!topic/jailhouse-dev/H62ahr0_bRk

Stefan Hajnoczi

unread,
Jan 30, 2017, 6:25:58 AM1/30/17
to msuchanek, Jan Kiszka, Jailhouse, Wei Wang, Marc-André Lureau, qemu-devel, Markus Armbruster, Qemu-devel
I'll also add Intel NVMe and virtio-blk to the list of interfaces that
are not symmetric.

Even for SCSI, separate roles for initiator and target are central to
the SCSI Architecture Model. The consequence is that hardware
interfaces and software stacks are not symmetric. For example, the
Linux SCSI target only supports a small set of FC HBAs with explicit
target mode support rather than all SCSI HBAs.

Intuitively this makes sense - if the I/O has clear "client" and
"server" roles then why should both sides implement both roles? It adds
cost and complexity for no benefit.

The same goes for other device types like graphics cards. They are
inherently asymmetric. Only one side has the actual hardware to perform
the I/O so it doesn't make sense to be symmetric.

You can pretend they are symmetric by restricting the hardware interface
and driver to just message passing. Then another layer of software
handles the asymmetric behavior. But then you may as well use iSCSI,
VNC, etc and not have a hardware interface for disk and graphics.

Stefan
signature.asc

Markus Armbruster

unread,
Jan 30, 2017, 7:19:09 AM1/30/17
to Jan Kiszka, Jailhouse, qemu-devel
You mean the hypervisor needs to act when the guest maps BARs, and that
gives the guest an attack vector?

Don't you have to deal with that anyway, for other PCI devices?

This is just out of curiosity, feel free to ignore me :)

> BTW, PCI has a similar concept of static assignment (PCI EA), but that
> is unfortunately incompatible to our needs [1].

Interesting.
I understand where you're coming from. I think the correct answer is to
layer protocols, and choose carefully how much of the stack to keep in
the hypervisor.

I feel you take (at least) two steps towards providing a (low-level)
protocol. One, you provide for an ID of the next higher protocol level
(see "Changed PCI base class code" above). Two, you include generic
state signalling.

Jan Kiszka

unread,
Jan 30, 2017, 10:57:50 AM1/30/17
to Markus Armbruster, Jailhouse, qemu-devel
On 2017-01-30 13:19, Markus Armbruster wrote:
>>> Can you explain why not letting the guest map the shared memory into its
>>> address space on its own just like any other piece of device memory is a
>>> requirement?
>>
>> It requires reconfiguration of the sensitive 2nd level page tables
>> during runtime of the guest. We are avoiding the neccessery checking and
>> synchronization measures so far which reduces code complexity further.
>
> You mean the hypervisor needs to act when the guest maps BARs, and that
> gives the guest an attack vector?

Possibly, at least correctness issue will arise. We need to add TLB
flushes e.g., something that is not needed right now with the mappings
remaining static while a guest is running.

>
> Don't you have to deal with that anyway, for other PCI devices?

Physical devices are presented to the guest with their BARs programmed
(as if the firmware did that already), and Jailhouse denies
reprogramming (only for the purpose of size discovery). Linux is fine
with that, and RTOSes ported to Jailhouse only become simpler.

Virtualized regions are trapped-and-emulate anyway, so no need for
reprogramming the mappings

Jan

Wang, Wei W

unread,
Jan 30, 2017, 9:51:08 PM1/30/17
to Jan Kiszka, Marc-André Lureau, Jailhouse, Markus Armbruster, qemu-devel
Thanks for the suggestion. I’m glad to help it.

For that sort of usage (static configuration extension [1]), I’m thinking that it’s possible to build symmetric vhost-pci-net communication, as appose to “vhost-pci-net<-> virtio-net”.

> It's still open if vhost-pci can replace ivshmem (not to speak of being desirable
> for Jailhouse - I'm still studying). In that light, having both implementations
> available to do real comparisons is valuable IMHO.
>
> That said, we will play with open cards, explain the student the situation and let
> her/him decide knowingly.

I think the static configuration of vhost-pci would be quite similar to your ivshmem based proposal- could be thought of as moving your proposal to the virtio device structure. Do you see any more big difference? Or is there any fundamental reason that it is not good to do that based on virtio? Thanks.

Best,
Wei

[1] static configuration extension: set the vhost-pci device via the QEMU command line (rather than hotplugging via vhost-user protocol) , and share a piece of memory between two VMs (rather than the whole VM's memory)

Reply all
Reply to author
Forward
0 new messages