Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[PATCH] Support PCIe Alternative RID Interpretation (ARI)

436 views
Skip to first unread message

Ryan Stone

unread,
Mar 14, 2014, 10:06:12 PM3/14/14
to
http://people.freebsd.org/~rstone/patches/pci_ari.diff

This patch add support for PCIe Alternative RID Interpretation (ARI)
to our PCI implementation. ARI is an optional feature in PCIe; when
it is enabled on a endpoint device that device can have up to 256 PCI
functions (increased from 8). This is implemented by re-interpreting
the 5-bit slot number as being part of the function number. The slot
number for all such functions will implicitly be 0.

There are two main changes here. The first changes PCI enumeration to
explicitly probe slot 0, function 0 separate from all other devices.
This is necessary because we must check whether the device supports
ARI and enable it before enumerating the rest of the devices.

The other main change affects pci configuration space accesses. When
pcib_read_config/pcib_write_config is called on a downstream port that
has ARI enabled, we convert the 8-bit function number back into a
5-bit slot number and 3-bit function number before passing the request
up the device tree. This was the simplest way to insulate the rest of
the PCI subsystem from knowledge of ARI.

I have also added a new tunable, hw.pci.enable_ari, that controls
whether ARI will be enabled on devices that support it. I plan on
defaulting it to 1 on HEAD but will probably default to 0 when I MFC.

I've tested on a few different motherboards at $WORK, but the only
ARI-capable device that I have access to is Intel 82599 NICs (ixgbe
driver), so if anybody is able to test with other hardware that would
be really helpful. Testing that the driver is able to attach and
testing basic functionality (e.g. ping on a network interface) would
be sufficient.

In order for ARI to be enabled both the downstream port (the physical
PCIe slot) and the endpoint device (the PCIe card) have to both
support ARI. My patch adds support to pciconf -lc to have it print
whether a downstream port supports ARI. You can check for ARI support
by looking for:

cap 10[90] = PCI-Express 2 root port slot max data 128(256) link x0(x4)
speed 0.0(5.0) ASPM disabled(L0s/L1) ARI disabled

If it says "ARI enabled" or "ARI disabled" then the port supports ARI.
If ARI is disabled please confirm that no ARI-capable device is in
the slot (or else it's a bug and please report it). If ARI is enabled
then please confirm that there is an ARI-capable device in the slot
(otherwise please report it to me)

pciconf -lc already will print out if a endpoint device supports ARI.
Look for the ARI extended capability:

ecap 000e[150] = ARI 1
_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hacke...@freebsd.org"

Konstantin Belousov

unread,
Mar 16, 2014, 10:12:17 AM3/16/14
to
On Fri, Mar 14, 2014 at 10:06:12PM -0400, Ryan Stone wrote:
> http://people.freebsd.org/~rstone/patches/pci_ari.diff
>
> This patch add support for PCIe Alternative RID Interpretation (ARI)
> to our PCI implementation. ARI is an optional feature in PCIe; when
> it is enabled on a endpoint device that device can have up to 256 PCI
> functions (increased from 8). This is implemented by re-interpreting
> the 5-bit slot number as being part of the function number. The slot
> number for all such functions will implicitly be 0.
>
> There are two main changes here. The first changes PCI enumeration to
> explicitly probe slot 0, function 0 separate from all other devices.
> This is necessary because we must check whether the device supports
> ARI and enable it before enumerating the rest of the devices.
Am I reading the patch correctly, that device (0, 0, 200) would return
slot 0 and function 200 from pci_get_slot() and pci_get_function() ?
This is expected, but it would break VT-d busdma, I think.

This is not said in the VT-d spec about ARI, but I believe that
DMAR would split the function number by 7-3/2-0 bits, same as for
the non-ARI devices. Then the transactions will be translated by
the wrong context.

From other minor notes, having additional line for "ARI enabled"
message under bootverbose would make already excessive PCI config
dump even more problematic.

Ryan Stone

unread,
Mar 16, 2014, 11:05:16 AM3/16/14
to
On Sun, Mar 16, 2014 at 10:12 AM, Konstantin Belousov
<kost...@gmail.com> wrote:
> This is expected, but it would break VT-d busdma, I think.
>
> This is not said in the VT-d spec about ARI, but I believe that
> DMAR would split the function number by 7-3/2-0 bits, same as for
> the non-ARI devices. Then the transactions will be translated by
> the wrong context.

Ah, good catch. I took a quick look at the spec and the code, and I
believe that I see the problem. I think that the proper solution is
to add a new method, pcib_get_rid(), that returns (bus << 8) | (slot
<< 5) | func for non-ARI devices and (bus << 8) | func for ARI
devices. Then we could add a pci_get_rid() that just calls the pcib
method, and the DMAR code could be changed to work in terms of the RID
instead of bus/slot/function. My reading of the spec is that VT-d is
really implemented in terms of the RID anyway, but the spec authors
took pains to give examples in terms of bus/slot/function because
that's how the software developers understand things.

Do you know if bhyve's pci passthrough code uses the same code to add
DMAR entries as the busdma code? If not I'll have to track down how
bhyve does it because it would likely have the same problem.

> From other minor notes, having additional line for "ARI enabled"
> message under bootverbose would make already excessive PCI config
> dump even more problematic.

I can remove it. At the time I wanted some kind of indication that
ARI was being used, but pciconf can tell you that now so it's not
really necessary.

Konstantin Belousov

unread,
Mar 16, 2014, 5:03:52 PM3/16/14
to
On Sun, Mar 16, 2014 at 11:05:16AM -0400, Ryan Stone wrote:
> On Sun, Mar 16, 2014 at 10:12 AM, Konstantin Belousov
> <kost...@gmail.com> wrote:
> > This is expected, but it would break VT-d busdma, I think.
> >
> > This is not said in the VT-d spec about ARI, but I believe that
> > DMAR would split the function number by 7-3/2-0 bits, same as for
> > the non-ARI devices. Then the transactions will be translated by
> > the wrong context.
>
> Ah, good catch. I took a quick look at the spec and the code, and I
> believe that I see the problem. I think that the proper solution is
> to add a new method, pcib_get_rid(), that returns (bus << 8) | (slot
> << 5) | func for non-ARI devices and (bus << 8) | func for ARI
> devices. Then we could add a pci_get_rid() that just calls the pcib
> method, and the DMAR code could be changed to work in terms of the RID
> instead of bus/slot/function. My reading of the spec is that VT-d is
> really implemented in terms of the RID anyway, but the spec authors
> took pains to give examples in terms of bus/slot/function because
> that's how the software developers understand things.
Well, I agree, but the current DMAR driver is also written in term
of bsf. So using the pci_get_rid() is possible, but adding a
methods like pci_get_fake_slot/func, which would return ARI rid
converted into fake slot and function at the 3/2 boundary is less
work.

Anyway, please add whatever you find suitable as the accessor method,
and I hope to be able to produce a patch that would convert Intel
IOMMU to use it instead of direct manipulations of bsf.

>
> Do you know if bhyve's pci passthrough code uses the same code to add
> DMAR entries as the busdma code? If not I'll have to track down how
> bhyve does it because it would likely have the same problem.
I am not aware of bhyve code. They definitely use their own code
for the passthrough right now.

>
> > From other minor notes, having additional line for "ARI enabled"
> > message under bootverbose would make already excessive PCI config
> > dump even more problematic.
>
> I can remove it. At the time I wanted some kind of indication that
> ARI was being used, but pciconf can tell you that now so it's not
> really necessary.

Or stuff it into the already printed line.

Neel Natu

unread,
Mar 17, 2014, 5:20:00 PM3/17/14
to
Hi Ryan,

On Sun, Mar 16, 2014 at 8:05 AM, Ryan Stone <rys...@gmail.com> wrote:
> On Sun, Mar 16, 2014 at 10:12 AM, Konstantin Belousov
> <kost...@gmail.com> wrote:
>> This is expected, but it would break VT-d busdma, I think.
>>
>> This is not said in the VT-d spec about ARI, but I believe that
>> DMAR would split the function number by 7-3/2-0 bits, same as for
>> the non-ARI devices. Then the transactions will be translated by
>> the wrong context.
>
> Ah, good catch. I took a quick look at the spec and the code, and I
> believe that I see the problem. I think that the proper solution is
> to add a new method, pcib_get_rid(), that returns (bus << 8) | (slot
> << 5) | func for non-ARI devices and (bus << 8) | func for ARI
> devices. Then we could add a pci_get_rid() that just calls the pcib
> method, and the DMAR code could be changed to work in terms of the RID
> instead of bus/slot/function. My reading of the spec is that VT-d is
> really implemented in terms of the RID anyway, but the spec authors
> took pains to give examples in terms of bus/slot/function because
> that's how the software developers understand things.
>
> Do you know if bhyve's pci passthrough code uses the same code to add
> DMAR entries as the busdma code? If not I'll have to track down how
> bhyve does it because it would likely have the same problem.
>

bhyve has a different implementation of the VT-d driver although
transitioning to x86/iommu is long overdue.

In any case it seems that the VT-d implementation in bhyve will work
fine with ARI enabled devices.

best
Neel

>> From other minor notes, having additional line for "ARI enabled"
>> message under bootverbose would make already excessive PCI config
>> dump even more problematic.
>
> I can remove it. At the time I wanted some kind of indication that
> ARI was being used, but pciconf can tell you that now so it's not
> really necessary.

Ryan Stone

unread,
Mar 18, 2014, 1:51:47 PM3/18/14
to
On Mon, Mar 17, 2014 at 5:20 PM, Neel Natu <neel...@gmail.com> wrote:
> Hi Ryan,
> In any case it seems that the VT-d implementation in bhyve will work
> fine with ARI enabled devices.

There was an assert that would trip for the function number being
greater than 8.


I've put together the following patch series:

http://people.freebsd.org/~rstone/patches/ari/0001-Add-a-method-to-get-the-PCI-RID-for-a-device.patch

This adds a method to get the RID that will be consumed by the VT-d
drivers. This patch is non-ARI only.


http://people.freebsd.org/~rstone/patches/ari/0002-Re-implement-the-DMAR-I-O-MMU-code-in-terms-of-PCI-R.patch

This reworks the busdma DMAR code to work in terms of RIDs where
necessary. This should be a no-op. I tested this with
hw.dmar.enable=1 on a Nehalem with the em driver and a Sandy Bridge
with the igb and ixgbe driver and was able to pass traffic.


http://people.freebsd.org/~rstone/patches/ari/0003-Re-write-bhyve-s-I-O-MMU-handling-in-terms-of-PCI-RI.patch

Same thing, but for bhyve. I'm not sure that passing the rid down to
the CPU-dependent layer is the right thing to do, because I'm not sure
what the AMD VT-d equivalent requires. Should I just pass down the
entire device_t and let the CPU layer deal with it? I tested loading
vmm.ko with a device assigned to the ppt driver but I didn't go as far
as starting a VM and using PCI passthrough.

(Also, as you'd probably expected doing this with hw.dmar.enable=1
causes all hell to break loose).


http://people.freebsd.org/~rstone/patches/ari/0004-Add-support-for-PCIe-ARI.patch

This is a slightly reworked version of the previous patch. The main
difference are that there is a new implementation of pcib_get_rid that
understands ARI RIDs. I also fixed a bug where the default
implementation of pcib_numslots was not actually being used because I
misspelled DEFAULT as default in pcib_if.m.


http://people.freebsd.org/~rstone/patches/ari/0005-Print-status-of-ARI-capability-in-pciconf-c.patch

This makes pciconf -c dump the status of ARI on PCIe downstream ports.

Konstantin Belousov

unread,
Mar 19, 2014, 10:02:36 AM3/19/14
to
On Tue, Mar 18, 2014 at 01:51:47PM -0400, Ryan Stone wrote:
> On Mon, Mar 17, 2014 at 5:20 PM, Neel Natu <neel...@gmail.com> wrote:
> > Hi Ryan,
> > In any case it seems that the VT-d implementation in bhyve will work
> > fine with ARI enabled devices.
>
> There was an assert that would trip for the function number being
> greater than 8.
>
>
> I've put together the following patch series:
>
> http://people.freebsd.org/~rstone/patches/ari/0001-Add-a-method-to-get-the-PCI-RID-for-a-device.patch
>
I do not like this, in fact. Not the general idea, but the implementation.
I think that what is needed is method which would return combined
slot/function number (or just function number for ARI-enabled device).
Then, existing pci_get_bus() and this new method are enough for proper
construction of the translating structures.

> This adds a method to get the RID that will be consumed by the VT-d
> drivers. This patch is non-ARI only.
>
>
> http://people.freebsd.org/~rstone/patches/ari/0002-Re-implement-the-DMAR-I-O-MMU-code-in-terms-of-PCI-R.patch
>
> This reworks the busdma DMAR code to work in terms of RIDs where
> necessary. This should be a no-op. I tested this with
> hw.dmar.enable=1 on a Nehalem with the em driver and a Sandy Bridge
> with the igb and ixgbe driver and was able to pass traffic.
Again, I do not like this. IMO the patch is too conservative.
Almost all occurences of the s/f in the IOMMU code must be removed,
and replaced by the half-rid returned by the new method from above.
In particular, context must be stripped of s/f at all.

As before, if you want, omit this part altogether, and I will try to
do the pass later.

Neel Natu

unread,
Mar 20, 2014, 12:47:03 PM3/20/14
to
Hi Ryan,

On Tue, Mar 18, 2014 at 10:51 AM, Ryan Stone <rys...@gmail.com> wrote:
> On Mon, Mar 17, 2014 at 5:20 PM, Neel Natu <neel...@gmail.com> wrote:
>> Hi Ryan,
>> In any case it seems that the VT-d implementation in bhyve will work
>> fine with ARI enabled devices.
>
> There was an assert that would trip for the function number being
> greater than 8.
>
>
> I've put together the following patch series:
>
> http://people.freebsd.org/~rstone/patches/ari/0001-Add-a-method-to-get-the-PCI-RID-for-a-device.patch
>
> This adds a method to get the RID that will be consumed by the VT-d
> drivers. This patch is non-ARI only.
>
>
> http://people.freebsd.org/~rstone/patches/ari/0002-Re-implement-the-DMAR-I-O-MMU-code-in-terms-of-PCI-R.patch
>
> This reworks the busdma DMAR code to work in terms of RIDs where
> necessary. This should be a no-op. I tested this with
> hw.dmar.enable=1 on a Nehalem with the em driver and a Sandy Bridge
> with the igb and ixgbe driver and was able to pass traffic.
>
>
> http://people.freebsd.org/~rstone/patches/ari/0003-Re-write-bhyve-s-I-O-MMU-handling-in-terms-of-PCI-RI.patch
>
> Same thing, but for bhyve. I'm not sure that passing the rid down to
> the CPU-dependent layer is the right thing to do, because I'm not sure
> what the AMD VT-d equivalent requires. Should I just pass down the
> entire device_t and let the CPU layer deal with it? I tested loading
> vmm.ko with a device assigned to the ppt driver but I didn't go as far
> as starting a VM and using PCI passthrough.
>

The bhyve portion of the patch looks fine to me.

A cursory read of the AMD IOMMU spec suggests that the translation
table is indexed using the 16-bit RID so passing it into the
CPU-dependent layer will work fine.

best
Neel

> (Also, as you'd probably expected doing this with hw.dmar.enable=1
> causes all hell to break loose).
>
>
> http://people.freebsd.org/~rstone/patches/ari/0004-Add-support-for-PCIe-ARI.patch
>
> This is a slightly reworked version of the previous patch. The main
> difference are that there is a new implementation of pcib_get_rid that
> understands ARI RIDs. I also fixed a bug where the default
> implementation of pcib_numslots was not actually being used because I
> misspelled DEFAULT as default in pcib_if.m.
>
>
> http://people.freebsd.org/~rstone/patches/ari/0005-Print-status-of-ARI-capability-in-pciconf-c.patch
>
> This makes pciconf -c dump the status of ARI on PCIe downstream ports.
0 new messages