Re: Vhost-pci RFC2.0

26 views
Skip to first unread message

Jan Kiszka

unread,
Apr 19, 2017, 3:35:19 AM4/19/17
to Wang, Wei W, Marc-André Lureau, Michael S. Tsirkin, Stefan Hajnoczi, pbon...@redhat.com, qemu-...@nongnu.org, virti...@lists.oasis-open.org, Jailhouse
On 2017-04-19 08:38, Wang, Wei W wrote:
> Hi,
>
> We made some design changes to the original vhost-pci design, and want
> to open
> a discussion about the latest design (labelled 2.0) and its extension (2.1).
> 2.0 design: One VM shares the entire memory of another VM
> 2.1 design: One VM uses an intermediate memory shared with another VM for
> packet transmission.
>
> For the convenience of discussion, I have some pictures presented at
> this link:
> _https://github.com/wei-w-wang/vhost-pci-discussion/blob/master/vhost-pci-rfc2.0.pdf_
>
> Fig. 1 shows the common driver frame that we want use to build the 2.0
> and 2.1
> design. A TX/RX engine consists of a local ring and an exotic ring.
> Local ring:
> 1) allocated by the driver itself;
> 2) registered with the device (i.e. virtio_add_queue())
> Exotic ring:
> 1) ring memory comes from the outside (of the driver), and exposed to
> the driver
> via a BAR MMIO;

Small additional requirement: In order to make this usable with
Jailhouse as well, we need [also] a side-channel configuration for the
regions, i.e. likely via a PCI capability. There are too few BARs, and
they suggest relocatablity, which is not available under Jailhouse for
simplicity reasons (IOW, the shared regions are statically mapped by the
hypervisor into the affected guest address spaces).

> 2) does not have a registration in the device, so no ioeventfd/irqfd,
> configuration
> registers allocated in the device
>
> Fig. 2 shows how the driver frame is used to build the 2.0 design.
> 1) Asymmetric: vhost-pci-net <-> virtio-net
> 2) VM1 shares the entire memory of VM2, and the exotic rings are the rings
> from VM2.
> 3) Performance (in terms of copies between VMs):
> TX: 0-copy (packets are put to VM2’s RX ring directly)
> RX: 1-copy (the green arrow line in the VM1’s RX engine)
>
> Fig. 3 shows how the driver frame is used to build the 2.1 design.
> 1) Symmetric: vhost-pci-net <-> vhost-pci-net

This is interesting!

> 2) Share an intermediate memory, allocated by VM1’s vhost-pci device,
> for data exchange, and the exotic rings are built on the shared memory
> 3) Performance:
> TX: 1-copy
> RX: 1-copy

I'm not yet sure I to this right: there are two different MMIO regions
involved, right? One is used for VM1's RX / VM2's TX, and the other for
the reverse path? Would allow our requirement to have those regions
mapped with asymmetric permissions (RX read-only, TX read/write).

>
> Fig. 4 shows the inter-VM notification path for 2.0 (2.1 is similar).
> The four eventfds are allocated by virtio-net, and shared with
> vhost-pci-net:
> Uses virtio-net’s TX/RX kickfd as the vhost-pci-net’s RX/TX callfd
> Uses virtio-net’s TX/RX callfd as the vhost-pci-net’s RX/TX kickfd
> Example of how it works:
> After packets are put into vhost-pci-net’s TX, the driver kicks TX, which
> causes the an interrupt associated with fd3 to be injected to virtio-net
>
> The draft code of the 2.0 design is ready, and can be found here:
> Qemu: _https://github.com/wei-w-wang/vhost-pci-device_
> Guest driver: _https://github.com/wei-w-wang/vhost-pci-driver_
>
> We tested the 2.0 implementation using the Spirent packet
> generator to transmit 64B packets, the results show that the
> throughput of vhost-pci reaches around 1.8Mpps, which is around
> two times larger than the legacy OVS+DPDK. Also, vhost-pci shows
> better scalability than OVS+DPDK.
>

Do you have numbers for the symmetric 2.1 case as well? Or is the driver
not yet ready for that yet? Otherwise, I could try to make it work over
a simplistic vhost-pci 2.1 version in Jailhouse as well. That would give
a better picture of how much additional complexity this would mean
compared to our ivshmem 2.0.

Jan

--
Siemens AG, Corporate Technology, CT RDA ITP SES-DE
Corporate Competence Center Embedded Linux

Wei Wang

unread,
Apr 19, 2017, 4:40:38 AM4/19/17
to Jan Kiszka, Marc-André Lureau, Michael S. Tsirkin, Stefan Hajnoczi, pbon...@redhat.com, qemu-...@nongnu.org, virti...@lists.oasis-open.org, Jailhouse
What kind of configuration would you need for the regions?
I think adding a PCI capability should be easy.

>> 2) does not have a registration in the device, so no ioeventfd/irqfd,
>> configuration
>> registers allocated in the device
>>
>> Fig. 2 shows how the driver frame is used to build the 2.0 design.
>> 1) Asymmetric: vhost-pci-net <-> virtio-net
>> 2) VM1 shares the entire memory of VM2, and the exotic rings are the rings
>> from VM2.
>> 3) Performance (in terms of copies between VMs):
>> TX: 0-copy (packets are put to VM2’s RX ring directly)
>> RX: 1-copy (the green arrow line in the VM1’s RX engine)
>>
>> Fig. 3 shows how the driver frame is used to build the 2.1 design.
>> 1) Symmetric: vhost-pci-net <-> vhost-pci-net
> This is interesting!
>
>> 2) Share an intermediate memory, allocated by VM1’s vhost-pci device,
>> for data exchange, and the exotic rings are built on the shared memory
>> 3) Performance:
>> TX: 1-copy
>> RX: 1-copy
> I'm not yet sure I to this right: there are two different MMIO regions
> involved, right? One is used for VM1's RX / VM2's TX, and the other for
> the reverse path? Would allow our requirement to have those regions
> mapped with asymmetric permissions (RX read-only, TX read/write).
The design presented here intends to use only one BAR to expose
both TX and RX. The two VMs share an intermediate memory
here, why couldn't we give the same permission to TX and RX?


>>
>> Fig. 4 shows the inter-VM notification path for 2.0 (2.1 is similar).
>> The four eventfds are allocated by virtio-net, and shared with
>> vhost-pci-net:
>> Uses virtio-net’s TX/RX kickfd as the vhost-pci-net’s RX/TX callfd
>> Uses virtio-net’s TX/RX callfd as the vhost-pci-net’s RX/TX kickfd
>> Example of how it works:
>> After packets are put into vhost-pci-net’s TX, the driver kicks TX, which
>> causes the an interrupt associated with fd3 to be injected to virtio-net
>>
>> The draft code of the 2.0 design is ready, and can be found here:
>> Qemu: _https://github.com/wei-w-wang/vhost-pci-device_
>> Guest driver: _https://github.com/wei-w-wang/vhost-pci-driver_
>>
>> We tested the 2.0 implementation using the Spirent packet
>> generator to transmit 64B packets, the results show that the
>> throughput of vhost-pci reaches around 1.8Mpps, which is around
>> two times larger than the legacy OVS+DPDK. Also, vhost-pci shows
>> better scalability than OVS+DPDK.
>>
> Do you have numbers for the symmetric 2.1 case as well? Or is the driver
> not yet ready for that yet? Otherwise, I could try to make it work over
> a simplistic vhost-pci 2.1 version in Jailhouse as well. That would give
> a better picture of how much additional complexity this would mean
> compared to our ivshmem 2.0.
>

Implementation of 2.1 is not ready yet. We can extend it to 2.1 after
the common driver frame is reviewed.


Best,
Wei

Jan Kiszka

unread,
Apr 19, 2017, 4:49:09 AM4/19/17
to Wei Wang, Marc-André Lureau, Michael S. Tsirkin, Stefan Hajnoczi, pbon...@redhat.com, qemu-...@nongnu.org, virti...@lists.oasis-open.org, Jailhouse
Basically address and size, see
https://github.com/siemens/jailhouse/blob/wip/ivshmem2/Documentation/ivshmem-v2-specification.md#vendor-specific-capability-id-09h
For security and/or safety reasons: the TX side can then safely prepare
and sign a message in-place because the RX side cannot mess around with
it while not yet being signed (or check-summed). Saves one copy from a
secure place into the shared memory.

>
>>> Fig. 4 shows the inter-VM notification path for 2.0 (2.1 is similar).
>>> The four eventfds are allocated by virtio-net, and shared with
>>> vhost-pci-net:
>>> Uses virtio-net’s TX/RX kickfd as the vhost-pci-net’s RX/TX callfd
>>> Uses virtio-net’s TX/RX callfd as the vhost-pci-net’s RX/TX kickfd
>>> Example of how it works:
>>> After packets are put into vhost-pci-net’s TX, the driver kicks TX,
>>> which
>>> causes the an interrupt associated with fd3 to be injected to virtio-net
>>> The draft code of the 2.0 design is ready, and can be found here:
>>> Qemu: _https://github.com/wei-w-wang/vhost-pci-device_
>>> Guest driver: _https://github.com/wei-w-wang/vhost-pci-driver_
>>> We tested the 2.0 implementation using the Spirent packet
>>> generator to transmit 64B packets, the results show that the
>>> throughput of vhost-pci reaches around 1.8Mpps, which is around
>>> two times larger than the legacy OVS+DPDK. Also, vhost-pci shows
>>> better scalability than OVS+DPDK.
>>>
>> Do you have numbers for the symmetric 2.1 case as well? Or is the driver
>> not yet ready for that yet? Otherwise, I could try to make it work over
>> a simplistic vhost-pci 2.1 version in Jailhouse as well. That would give
>> a better picture of how much additional complexity this would mean
>> compared to our ivshmem 2.0.
>>
>
> Implementation of 2.1 is not ready yet. We can extend it to 2.1 after
> the common driver frame is reviewed.

Can you you assess the needed effort?

For us, this is a critical feature, because we need to decide if
vhost-pci can be an option at all. In fact, the "exotic ring" will be
the only way to provide secure inter-partition communication on Jailhouse.

Thanks,

Wei Wang

unread,
Apr 19, 2017, 5:07:38 AM4/19/17
to Jan Kiszka, Marc-André Lureau, Michael S. Tsirkin, Stefan Hajnoczi, pbon...@redhat.com, qemu-...@nongnu.org, virti...@lists.oasis-open.org, Jailhouse
Got it, thanks. That should be easy to add to 2.1.
If we allow guest1 to write to RX, what safety issue would it cause to
guest2?
If what is here for 2.0 is suitable to be upstream-ed, I think it will
be easy
to extend it to 2.1 (probably within 1 month).

Best,
Wei




Jan Kiszka

unread,
Apr 19, 2017, 5:31:43 AM4/19/17
to Wei Wang, Marc-André Lureau, Michael S. Tsirkin, Stefan Hajnoczi, pbon...@redhat.com, qemu-...@nongnu.org, virti...@lists.oasis-open.org, Jailhouse
This way, guest1 could trick guest2, in a race condition, to sign a
modified message instead of the original one.
Unfortunate ordering here, though. Specifically if we need to modify
existing things instead of just adding something. We will need 2.1 prior
to committing to 2.0 being the right thing.

Wei Wang

unread,
Apr 19, 2017, 6:01:11 AM4/19/17
to Jan Kiszka, Marc-André Lureau, Michael S. Tsirkin, Stefan Hajnoczi, pbon...@redhat.com, qemu-...@nongnu.org, virti...@lists.oasis-open.org, Jailhouse
Just align the context that we are talking about: RX is the intermediate
shared ring that guest1 uses to receive packets and guest2 uses to send
packet.

Seems the issue is that guest1 will receive a hacked message from RX
(modified by itself). How would it affect guest2?
If you want, we can get the common part of design ready first,
then we can start to build on the common part at the same time.
The draft code of 2.0 is ready. I can clean it up, making it easier for
us to continue and change.

Best,
Wei



Jan Kiszka

unread,
Apr 19, 2017, 6:36:24 AM4/19/17
to Wei Wang, Marc-André Lureau, Michael S. Tsirkin, Stefan Hajnoczi, pbon...@redhat.com, qemu-...@nongnu.org, virti...@lists.oasis-open.org, Jailhouse
On 2017-04-19 12:02, Wei Wang wrote:
>>>>> The design presented here intends to use only one BAR to expose
>>>>> both TX and RX. The two VMs share an intermediate memory
>>>>> here, why couldn't we give the same permission to TX and RX?
>>>>>
>>>> For security and/or safety reasons: the TX side can then safely prepare
>>>> and sign a message in-place because the RX side cannot mess around with
>>>> it while not yet being signed (or check-summed). Saves one copy from a
>>>> secure place into the shared memory.
>>> If we allow guest1 to write to RX, what safety issue would it cause to
>>> guest2?
>> This way, guest1 could trick guest2, in a race condition, to sign a
>> modified message instead of the original one.
>>
> Just align the context that we are talking about: RX is the intermediate
> shared ring that guest1 uses to receive packets and guest2 uses to send
> packet.
>
> Seems the issue is that guest1 will receive a hacked message from RX
> (modified by itself). How would it affect guest2?

Retry: guest2 wants to send a signed/hashed message to guest1. For that
purpose, it starts to build that message inside the shared memory that
guest1 can at least read, then guest2 signs that message, also in-place.
If guest1 can modify the message inside the ring while guest2 has not
yet signed it, the result is invalid.

Now, if guest2 is the final receiver of the message, nothing is lost,
guest2 just shot itself into the foot. However, if guest2 is just a
router (gray channel) and the message travels further, guest2 now has
corrupted that channel without allowing the final receive to detect
that. That's the scenario.
Without going into details yet, a meta requirement for us will be to
have advanced features optional, negotiable. Basically, we would like to
minimize the interface to an equivalent of what the ivshmem 2.0 is about
(there is no need for more in a safe/secure partitioning scenario). At
the same time, the complexity for a guest should remain low as well.

From past experience, the only way to ensure that is having a working
prototype. So I will have to look into enabling that.

Wei Wang

unread,
Apr 19, 2017, 7:12:07 AM4/19/17
to Jan Kiszka, Marc-André Lureau, Michael S. Tsirkin, Stefan Hajnoczi, pbon...@redhat.com, qemu-...@nongnu.org, virti...@lists.oasis-open.org, Jailhouse
If guest2 has been a malicious guest, I think it wouldn't make a
difference whether we protect the shared RX or not. As a router,
guest2 can play tricks on the messages after read it and then
send the modified message to a third man, right?
OK. Looks like the ordering needs to be changed. This doesn't appear
to be a problem to me.

If the final design doesn't deviate a lot from what's presented here,
I think it should be easy to get 2.1 implemented quickly.
Let's first get the design ready, then assess the effort for
implementation.


Best,
Wei

Jan Kiszka

unread,
Apr 19, 2017, 7:21:18 AM4/19/17
to Wei Wang, Marc-André Lureau, Michael S. Tsirkin, Stefan Hajnoczi, pbon...@redhat.com, qemu-...@nongnu.org, virti...@lists.oasis-open.org, Jailhouse
It can swallow it, "steal" it (redirect), but it can't manipulate the
signed content without being caught, that's the idea. It's particularly
relevant for safety-critical traffic from one safe application to
another over unreliable channels, but it may also be relevant for the
integrity of messages in a secure setup.
OK, thanks.

Wang, Wei W

unread,
Apr 19, 2017, 10:34:19 AM4/19/17
to Jan Kiszka, Marc-André Lureau, Michael S. Tsirkin, Stefan Hajnoczi, pbon...@redhat.com, qemu-...@nongnu.org, virti...@lists.oasis-open.org, Jailhouse
OK, I see most of your story, thanks. To get to the bottom of it, is it possible to
Sign the packet before put it onto the unreliable channel (e.g. the shared RX),
Instead of signing in-place? If that's doable, we can have a simpler shared channel.


Best,
Wei

Jan Kiszka

unread,
Apr 19, 2017, 10:52:55 AM4/19/17
to Wang, Wei W, Marc-André Lureau, Michael S. Tsirkin, Stefan Hajnoczi, pbon...@redhat.com, qemu-...@nongnu.org, virti...@lists.oasis-open.org, Jailhouse
Of course, you can always add another copy... But as it was trivial to
add unidirectional shared memory support to ivshmem [1], I see no reason
this shouldn't be possible for vhost-pci as well.

Jan

[1] https://github.com/siemens/jailhouse/commit/cfbd0b96d9cdb1ab7246c64bc446be39deb3f087, hypervisor part:

hypervisor/include/jailhouse/cell-config.h | 4 ++--
hypervisor/include/jailhouse/ivshmem.h | 2 +-
hypervisor/ivshmem.c | 52 +++++++++++++++++++++++++++++++++++-----------------
3 files changed, 38 insertions(+), 20 deletions(-)

Wei Wang

unread,
Apr 20, 2017, 2:49:44 AM4/20/17
to Jan Kiszka, Marc-André Lureau, Michael S. Tsirkin, Stefan Hajnoczi, pbon...@redhat.com, qemu-...@nongnu.org, virti...@lists.oasis-open.org, Jailhouse
IIUC, this requires the ring and it's head/tail to be put into different
regions, it would
be hard to fit the existing virtqueue into the shared the channel, since
the vring and
its pointers (e.g. idx) and flags are on the same page.
So, probably will need to use another ring type.


Best,
Wei





Jan Kiszka

unread,
Apr 20, 2017, 3:06:03 AM4/20/17
to Wei Wang, Marc-André Lureau, Michael S. Tsirkin, Stefan Hajnoczi, pbon...@redhat.com, qemu-...@nongnu.org, virti...@lists.oasis-open.org, Jailhouse
The current virtio spec allows this split already - though it may not be
lived by existing implementations (we do, though, in ivshmem-net).
Future virtio spec (1.1 IIRC) will require a third region that holds the
meta data and has to be writable by both sides. But it will remain
possible to keep outgoing and incoming payload in separate pages, thus
with different access permissions.

Jan

Wei Wang

unread,
Apr 20, 2017, 4:56:36 AM4/20/17
to Jan Kiszka, Marc-André Lureau, Michael S. Tsirkin, Stefan Hajnoczi, pbon...@redhat.com, qemu-...@nongnu.org, virti...@lists.oasis-open.org, Jailhouse
Yes, need to change some implementation. Will give it a try later.


Best,
Wei




Reply all
Reply to author
Forward
0 new messages