AMD IO_PAGE_FAULT w/NTB on Write ops?

88 views
Skip to first unread message

Eric Pilmore

unread,
Apr 20, 2019, 5:06:45 AM4/20/19
to linux-ntb, linu...@vger.kernel.org, S Taylor, D Meyer
Hi Folks,

Before I ask my questions, here is a little background on the
environment I have:
- 2 hosts: 1 Xeon based (Intel(R) Xeon(R) Gold 5115 CPU @ 2.40GHz),
1 AMD based (AMD EPYC 7401 24-Core Processor)
- Each host is interconnected via an external PCI-e (switchtec) switch.
- The two hosts are exporting memory to each other via NTB.
- IOMMU is enabled in both hosts. The Xeon platform requires some BIOS
settings and a kernel parameter (intel_iommu=on), however as far as I
have been able to determine, the AMD only requires the IOMMU BIOS
setting to be enabled and no special kernel boot parameters. Does that
sound right for AMD?
- Region of memory exported to each host is acquired/mapped via
dma_alloc_coherent() using the "device" of the respective external
PCI-e switch.
- The dma_addr returned from the dma_alloc_coherent is relayed to the
peer host who then adds that value (i.e. IOVA offset) to it's local
PCI BAR representing the switch, and then ioremap()'s that resulting
address to get a CPU virtual address to which it can now perform
ioread/iowrite operations.

What we have found is that the Xeon based host can successfully ioread
to this mapped shared buffer, but whenever it attempts an iowrite to
this region, it results in an IO_PAGE_FAULT on the AMD based host:

AMD-Vi: Event logged [IO_PAGE_FAULT device=23:01.2 domain=0x0000
address=0x00000000fde1c18c flags=0x0070]

Going in the opposite direction there are no issues, i.e. the AMD
based host can successfully ioread/iowrite to the mapped in buffer
exported by the Xeon host. Or if both hosts are Xeon's, then
everything works fine also.

I have looked high and low, and have not been able to interpret what
the "flags=0x0070" represent. I assume they are indicating some write
permission error, but was wondering if anybody here might know?

More importantly, does anybody know why the AMD IOMMU might seemingly
default to not allow Write operations to the exported memory? Is there
some additional BIOS or kernel boot parameter setting that needs to be
set?

lspci on the AMD hosts of the external PCI-e switch:
23:00.0 PCI bridge: PMC-Sierra Inc. Device 8536
23:00.1 Bridge: PMC-Sierra Inc. Device 8536

The 23:00.1 BDF is the NTB bridge. The BDF (23:01.2) in the error
message represents the "NTB translated" BDF of the request that came
from the peer, i.e. the 01.2 is the proxy-id. Is there a chance that
this proxy-id is causing some confusion for the AMD IOMMU?

Would greatly appreciate any assistance!

Thanks!

--
Eric Pilmore
epil...@gigaio.com
http://gigaio.com
Phone: (858) 775 2514

This e-mail message is intended only for the individual(s) to whom it
is addressed and
may contain information that is privileged, confidential, proprietary,
or otherwise exempt
from disclosure under applicable law. If you believe you have received
this message in
error, please advise the sender by return e-mail and delete it from
your mailbox.
Thank you.

Logan Gunthorpe

unread,
Apr 22, 2019, 1:14:54 PM4/22/19
to Eric Pilmore, linux-ntb, linu...@vger.kernel.org, S Taylor, D Meyer


On 2019-04-20 3:06 a.m., Eric Pilmore wrote:
> What we have found is that the Xeon based host can successfully ioread
> to this mapped shared buffer, but whenever it attempts an iowrite to
> this region, it results in an IO_PAGE_FAULT on the AMD based host:
>
> AMD-Vi: Event logged [IO_PAGE_FAULT device=23:01.2 domain=0x0000
> address=0x00000000fde1c18c flags=0x0070]
>
> Going in the opposite direction there are no issues, i.e. the AMD
> based host can successfully ioread/iowrite to the mapped in buffer
> exported by the Xeon host. Or if both hosts are Xeon's, then
> everything works fine also.
>
> I have looked high and low, and have not been able to interpret what
> the "flags=0x0070" represent. I assume they are indicating some write
> permission error, but was wondering if anybody here might know?

See the AMD IOMMU spec[1]. Figure 51. 0x0070 indicates the PE, RW and PR
bits are set which means a Write request to a present page was denied
because the peripheral did not have permission.

> More importantly, does anybody know why the AMD IOMMU might seemingly
> default to not allow Write operations to the exported memory? Is there
> some additional BIOS or kernel boot parameter setting that needs to be
> set?

Yeah, I don't think the IOMMU defaults to allow write operations to
exported memory. That would be extremely broken....

> lspci on the AMD hosts of the external PCI-e switch:
> 23:00.0 PCI bridge: PMC-Sierra Inc. Device 8536
> 23:00.1 Bridge: PMC-Sierra Inc. Device 8536
>
> The 23:00.1 BDF is the NTB bridge. The BDF (23:01.2) in the error
> message represents the "NTB translated" BDF of the request that came
> from the peer, i.e. the 01.2 is the proxy-id. Is there a chance that
> this proxy-id is causing some confusion for the AMD IOMMU?

I suspect the proxy IDs are the problem. On Intel hardware, we had to
add support so that it allowed requests for all proxy IDs for a given
device. We probably have to do something similar to the AMD IOMMU driver.

My guess is that the reason writes work and not reads is because the
write TLPs are posted and thus the switch doesn't apply the Proxy ID
seeing it doesn't expect a completion. Thus the IOMMU sees the TLPs as
coming from a permitted peripheral and doesn't complain.

Logan

[1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf

Logan Gunthorpe

unread,
Apr 22, 2019, 1:31:27 PM4/22/19
to Eric Pilmore, linux-ntb, linu...@vger.kernel.org, S Taylor, D Meyer


On 2019-04-22 11:14 a.m., Logan Gunthorpe wrote:
> My guess is that the reason writes work and not reads is because the
> write TLPs are posted and thus the switch doesn't apply the Proxy ID
> seeing it doesn't expect a completion. Thus the IOMMU sees the TLPs as
> coming from a permitted peripheral and doesn't complain.

Oh, oops, sounds like I got that backwards as you seem to indicate reads
work but not writes. That doesn't make as much sense to me, but I still
think it's a proxy_id problem.

Take a look at [1]. It reads to me like the AMD IOMMU only supports the
last DMA alias. So most of the proxy IDs for the switchtec device we
register are probably ignored...

One way or another I expect the working cases are because they come from
a specific proxy ID and the broken cases come from a proxy ID that the
AMD IOMMU doesn't consider.

Logan


[1]
https://elixir.bootlin.com/linux/latest/source/drivers/iommu/amd_iommu.c#L245

Sanjay R Mehta

unread,
Apr 23, 2019, 7:00:38 AM4/23/19
to epil...@gigaio.com, S Taylor, D Meyer, linux-ntb, linu...@vger.kernel.org

> From: *Eric Pilmore* <epil...@gigaio.com <mailto:epil...@gigaio.com>>
> Date: Sat, Apr 20, 2019 at 2:36 PM
> Subject: AMD IO_PAGE_FAULT w/NTB on Write ops?
> To: linux-ntb <linu...@googlegroups.com <mailto:linu...@googlegroups.com>>, <linu...@vger.kernel.org <mailto:linu...@vger.kernel.org>>
> Cc: S Taylor <sta...@gigaio.com <mailto:sta...@gigaio.com>>, D Meyer <dme...@gigaio.com <mailto:dme...@gigaio.com>>
>
>
> Hi Folks,
>
> Before I ask my questions, here is a little background on the
> environment I have:
> - 2 hosts: 1 Xeon based (Intel(R) Xeon(R) Gold 5115 CPU @ 2.40GHz),
>                 1 AMD based (AMD EPYC 7401 24-Core Processor)
> - Each host is interconnected via an external PCI-e (switchtec) switch.
> - The two hosts are exporting memory to each other via NTB.
> - IOMMU is enabled in both hosts. The Xeon platform requires some BIOS
> settings and a kernel parameter (intel_iommu=on), however as far as I
> have been able to determine, the AMD only requires the IOMMU BIOS
> setting to be enabled and no special kernel boot parameters. Does that
> sound right for AMD?
Yes. you are correct Eric.
> - Region of memory exported to each host is acquired/mapped via
> dma_alloc_coherent() using the "device" of the respective external
> PCI-e switch.
> - The dma_addr returned from the dma_alloc_coherent is relayed to the
> peer host who then adds that value (i.e. IOVA offset) to it's local
> PCI BAR representing the switch, and then ioremap()'s that resulting
> address to get a CPU virtual address to which it can now perform
> ioread/iowrite operations.
>
> What we have found is that the Xeon based host can successfully ioread
> to this mapped shared buffer, but whenever it attempts an iowrite to
> this region, it results in an IO_PAGE_FAULT on the AMD based host:
>
> AMD-Vi: Event logged [IO_PAGE_FAULT device=23:01.2 domain=0x0000
> address=0x00000000fde1c18c flags=0x0070]

the address in the above log looks to be physical address of memory window. Am I Right?

If yes then, the first parameter of dma_alloc_coherent() to be passed as below,

dma_alloc_coherent(&ntb->pdev->dev, ...)instead of dma_alloc_coherent(&ntb->dev, ...).

Hope this should solve your problem.

>
> Going in the opposite direction there are no issues, i.e. the AMD
> based host can successfully ioread/iowrite to the mapped in buffer
> exported by the Xeon host.  Or if both hosts are Xeon's, then
> everything works fine also.
>
> I have looked high and low, and have not been able to interpret what
> the "flags=0x0070" represent. I assume they are indicating some write
> permission error, but was wondering if anybody here might know?
>
> More importantly, does anybody know why the AMD IOMMU might seemingly
> default to not allow Write operations to the exported memory? Is there
> some additional BIOS or kernel boot parameter setting that needs to be
> set?
>
> lspci on the AMD hosts of the external PCI-e switch:
>    23:00.0 PCI bridge: PMC-Sierra Inc. Device 8536
>    23:00.1 Bridge: PMC-Sierra Inc. Device 8536
>
> The 23:00.1 BDF is the NTB bridge. The BDF (23:01.2) in the error
> message represents the "NTB translated" BDF of the request that came
> from the peer, i.e. the 01.2 is the proxy-id. Is there a chance that
> this proxy-id is causing some confusion for the AMD IOMMU?
>
> Would greatly appreciate any assistance!
>
> Thanks!
>
> --
> Eric Pilmore
> epil...@gigaio.com <mailto:epil...@gigaio.com>
> http://gigaio.com
> Phone: (858) 775 2514
>
> This e-mail message is intended only for the individual(s) to whom it
> is addressed and
> may contain information that is privileged, confidential, proprietary,
> or otherwise exempt
> from disclosure under applicable law. If you believe you have received
> this message in
> error, please advise the sender by return e-mail and delete it from
> your mailbox.
> Thank you.
>
> --
> You received this message because you are subscribed to the Google Groups "linux-ntb" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to linux-ntb+...@googlegroups.com <mailto:linux-ntb%2Bunsu...@googlegroups.com>.
> To post to this group, send email to linu...@googlegroups.com <mailto:linu...@googlegroups.com>.
> To view this discussion on the web visit https://groups.google.com/d/msgid/linux-ntb/CAOQPn8sX2G-Db-ZiFpP2SMKbkQnPyk63UZijAY0we%2BDoZsmDtQ%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.

Eric Pilmore

unread,
Apr 24, 2019, 6:04:17 PM4/24/19
to Sanjay R Mehta, S Taylor, D Meyer, linux-ntb, linu...@vger.kernel.org
On Tue, Apr 23, 2019 at 4:00 AM Sanjay R Mehta <sanm...@amd.com> wrote:
>
>
> > AMD-Vi: Event logged [IO_PAGE_FAULT device=23:01.2 domain=0x0000
> > address=0x00000000fde1c18c flags=0x0070]
>
> the address in the above log looks to be physical address of memory window. Am I Right?
>
> If yes then, the first parameter of dma_alloc_coherent() to be passed as below,
>
> dma_alloc_coherent(&ntb->pdev->dev, ...)instead of dma_alloc_coherent(&ntb->dev, ...).
>
> Hope this should solve your problem.

Hi Sanjay,

Thanks the for the response. We are using the correct device for the
dma_alloc_coherent(). Upon further investigation what we are finding
is that apparently the AMD IOMMU support can only manage one alias, as
opposed to Intel IOMMU support which can support multiple. Not clear
at this time if it's a software limitation in the AMD IOMMU kernel
support or an imposed limitation of the hardware. Still investigating.

Thanks,
Eric

Gary R Hook

unread,
May 9, 2019, 4:03:22 PM5/9/19
to Eric Pilmore, Mehta, Sanju, S Taylor, D Meyer, linux-ntb, linu...@vger.kernel.org
Please define 'alias'?

The IO_PAGE_FAULT error is described on page 142 of the AMD IOMMU spec,
document #48882. Easily found via a search.

The flags value of 0x0070 translates to PE, RW, PR. The page was
present, the transaction was a write, and the peripheral didn't have
permission. That implies that mapping hadn't been done.

Not being sure how that device presents, or what you're doing with IVHD
info, I can't comment further. I can say that the AMD IOMMU provides for
a single exclusion range, but as many unity ranges as you wish.

HTH

grh

Eric Pilmore

unread,
Jun 4, 2019, 5:16:06 PM6/4/19
to Gary R Hook, Mehta, Sanju, S Taylor, D Meyer, linux-ntb, linu...@vger.kernel.org
On Thu, May 9, 2019 at 1:03 PM Gary R Hook <gh...@amd.com> wrote:
>
> On 4/24/19 5:04 PM, Eric Pilmore wrote:
> >
> > Thanks the for the response. We are using the correct device for the
> > dma_alloc_coherent(). Upon further investigation what we are finding
> > is that apparently the AMD IOMMU support can only manage one alias, as
> > opposed to Intel IOMMU support which can support multiple. Not clear
> > at this time if it's a software limitation in the AMD IOMMU kernel
> > support or an imposed limitation of the hardware. Still investigating.
>
> Please define 'alias'?

Hi Gary,

I appreciate the response. Sorry for the late reply. Got sidetracked
with other stuff.

I will try to answer this as best I can. Sorry if my terminology might
be off as I'm still a relative newbie with some of this.

The "alias" is basically another BDF (or ProxyID) that wants to be
associated with the same IOMMU resources as some primary BDF.
Reference <drivers/pci/quirks.c>. In the scenario that we have we are
utilizing NTB and through this bridge will come requests (TLPs) that
will not necessarily have the ReqID as the BDF of the switch device
that contains this bridge. Instead, the ReqID will be a "translated"
(Proxy) BDF of sourcing devices on the other side of the
Non-Transparent Bridge. In our case our NTB is a Switchtec device and
the quirk quirk_switchtec_ntb_dma_alias() is used as a means of
associating these aliases (aka ProxyID or Translated ReqID) with the
NT endpoint in the local host. On Xeon platforms, the framework
supports allowing multiple aliases to be defined for a particular
IOMMU and everything works great. However, with the AMD cpu, it
appears the IOMMU framework is only accepting just one alias. Note
Logan's earlier response @ Mon, Apr 22, 10:31 AM. In our case the one
that is accepted is via the path for a processor Read, but Processor
Writes go through a slightly different path resulting in a different
ReqID. As Logan points out it seems since the AMD IOMMU code is only
accepting one alias, the Write ReqID looks foreign and thus results in
the IOMMU faults.

>
> The IO_PAGE_FAULT error is described on page 142 of the AMD IOMMU spec,
> document #48882. Easily found via a search.
>
> The flags value of 0x0070 translates to PE, RW, PR. The page was
> present, the transaction was a write, and the peripheral didn't have
> permission. That implies that mapping hadn't been done.
>
> Not being sure how that device presents, or what you're doing with IVHD
> info, I can't comment further. I can say that the AMD IOMMU provides for
> a single exclusion range, but as many unity ranges as you wish.

I'm currently not doing anything with IVHD. The devices on the other
side of the NTB that need to be aliased can be anything from a remote
Host processor, NVMe drive, GPU, etc., anything that wants to send a
memory transaction to the local host.

If you have any insight into how the AMD IOMMU support in the kernel
could be extended for multiple aliases, or whether there is a hardware
limitation that restricts it to just one, that would be greatly
appreciated.

Thanks,
Eric




--
Eric Pilmore

Kit Chow

unread,
Sep 6, 2019, 7:17:09 PM9/6/19
to linux-ntb, linu...@vger.kernel.org, Logan Gunthorpe, Eric Pilmore, S Taylor, D Meyer
This is a follow-up of the initial problems encountered trying to get the AMD Epyc 7401
server to do host to host communication through NTB. (please see thread for
background info).

The IO_PAGE_FAULT flags=0x0070 seen on write ops was in fact related to proxy ID setup
as Logan had suggested. The AMD iommu code only processed the 'last' proxy ID/dma alias;
the last proxy ID was associated with Reads and this allowed Read ops to succeed and
Write ops to fail. Adding support to process all of the proxy IDs in the AMD iommu code
(plus adding dma_map_resource support), the AMD Epyc server can now be configured in a
4 host NTB setup and communicate over NTB (tcp/ip over ntb_netdev) to the other 3 hosts.

The problem that we are now experiencing, for which I can use some help, with the AMD
Epyc 7401 server is very poor iperf performance over NTB/ntb_netdev.

The iperf numbers over NTB start off initially at around 800 Mbits/s and quickly degrades
down to the 20 Mbits/s range. Running 'top' during iperf, I see many instances of ksoftirqd
running which suggests that interrupts are overwhelming the interrupt processing.


ID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13321 root      20   0       0      0      0 I  33.2  0.0   0:15.62 kworker/58+
  528 root      20   0       0      0      0 S  31.6  0.0   0:02.80 ksoftirqd/+
  139 root      20   0       0      0      0 S  30.9  0.0   0:02.86 ksoftirqd/+
  536 root      20   0       0      0      0 S  29.9  0.0   0:02.60 ksoftirqd/+
  147 root      20   0       0      0      0 S  28.9  0.0   0:02.69 ksoftirqd/+
  131 root      20   0       0      0      0 S  28.3  0.0   0:02.74 ksoftirqd/+
  520 root      20   0       0      0      0 S  28.0  0.0   0:02.63 ksoftirqd/+
   82 root      20   0       0      0      0 S   3.0  0.0   0:00.35 ksoftirqd/9
   90 root      20   0       0      0      0 S   3.0  0.0   0:00.31 ksoftirqd/+
   98 root      20   0       0      0      0 S   3.0  0.0   0:00.37 ksoftirqd/+
  472 root      20   0       0      0      0 S   3.0  0.0   0:00.35 ksoftirqd/+
  488 root      20   0       0      0      0 S   3.0  0.0   0:00.37 ksoftirqd/+
  416 root      20   0       0      0      0 S   2.6  0.0   0:00.17 ksoftirqd/+
   25 root      20   0       0      0      0 S   2.3  0.0   0:00.17 ksoftirqd/2
  400 root      20   0       0      0      0 R   2.3  0.0   0:00.16 ksoftirqd/+
   17 root      20   0       0      0      0 R   2.0  0.0   0:00.17 ksoftirqd/1
  408 root      20   0       0      0      0 S   2.0  0.0   0:00.16 ksoftirqd/+
    8 root      20   0       0      0      0 R   1.6  0.0   0:00.16 ksoftirqd/0
  480 root      20   0       0      0      0 S   1.6  0.0   0:00.21 ksoftirqd/+
13434 user      20   0  246428  22248   2000 S   1.6  0.0   0:01.00 iperf
13516 user      20   0  163252   5792   3796 R   1.6  0.0   0:00.40 top
 7950 root      20   0       0      0      0 D   1.0  0.0   0:00.62 ccp-5-q2
 7951 root      20   0       0      0      0 S   1.0  0.0   0:00.67 ccp-5-q3
    9 root      20   0       0      0      0 I   0.3  0.0   0:00.31 rcu_sched
   57 root      20   0       0      0      0 S   0.3  0.0   0:00.01 ksoftirqd/6
   66 root      20   0       0      0      0 S   0.3  0.0   0:00.01 ksoftirqd/7
   74 root      20   0       0      0      0 S   0.3  0.0   0:00.01 ksoftirqd/8
  106 root      20   0       0      0      0 S   0.3  0.0   0:00.03 ksoftirqd/+
  115 root      20   0       0      0      0 S   0.3  0.0   0:00.02 ksoftirqd/+
  123 root      20   0       0      0      0 S   0.3  0.0   0:00.02 ksoftirqd/+



/proc/interrupts show lots of 'ccp-5' dma interrupt activity as well as ntb_netdev interrupt
activity. After eliminating netdev interrupts by configuring netdev to 'use_poll' and
leaving ccp, the poor iperf performance persists.

As a comparison, I can replace the ccp dma with the plx dma (found on the host adapter card)
on the AMD server and get a steady 9.4 Gbits/s with iperf over NTB.

I've optmimized for numa via numactl in all test runs.


So it appears that the iperf NTB performance issues on the AMD Epyc server are related to the
ccp dma and its interrupt processing.

Does anyone have any experience with the ccp dma that might be able to help?

Any help or suggestions on how to proceed would be very much appreciated.

Thanks
Kit

kc...@gigaio.com

Kit Chow

unread,
Sep 6, 2019, 7:48:37 PM9/6/19
to linux-ntb, linu...@vger.kernel.org, Logan Gunthorpe, Eric Pilmore (GigaIO)
This is a follow-up of the initial problems encountered trying to get
the AMD Epyc 7401server to do host to host communication through NTB.
(please see thread for background info).

The IO_PAGE_FAULT flags=0x0070 seen on write ops was in fact related to
proxy ID setup as Logan had suggested. The AMD iommu code only processed
the 'last' proxy ID/dma alias; the last proxy ID was associated with
Reads and this allowed Read ops to succeed and Write ops to fail. Adding
support to process all of the proxy IDs in the AMD iommu code (plus
adding dma_map_resource support), the AMD Epyc server can now be
configured in a 4 host NTB setup and communicate over NTB (tcp/ip over
ntb_netdev) to the other 3 hosts.

The problem that we are now experiencing, for which I can use some help,
with the AMD Epyc 7401 server is very poor iperf performance over
NTB/ntb_netdev.

The iperf numbers over NTB start off initially at around 800 Mbits/s and
quickly degrades down to the 20 Mbits/s range. Running 'top' during
iperf, I see many instances (up to 25+) of ksoftirqd running which
suggests that interrupts are overwhelming the interrupt processing.

Reply all
Reply to author
Forward
0 new messages