[PATCH v2 00/12] Support using MSI interrupts in ntb_transport

74 views
Skip to first unread message

Logan Gunthorpe

unread,
Feb 13, 2019, 12:55:06 PM2/13/19
to linux-...@vger.kernel.org, linu...@googlegroups.com, linu...@vger.kernel.org, io...@lists.linux-foundation.org, linux-k...@vger.kernel.org, Jon Mason, Bjorn Helgaas, Joerg Roedel, Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore, Logan Gunthorpe
Note: this version will likely trivially conflict with some cleanup
patches I sent to Bjorn. So this is meant for review purposes only.
If there are no objections, I'd like to look at getting it merged in
the next cycle through the NTB tree.

--

Changes in v2:

* Cleaned up the changes in intel_irq_remapping.c to make them
less confusing and add a comment. (Per discussion with Jacob and
Joerg)

* Fixed a nit from Bjorn and collected his Ack

* Added a Kconfig dependancy on CONFIG_PCI_MSI for CONFIG_NTB_MSI
as the Kbuild robot hit a random config that didn't build
without it.

* Worked in a callback for when the MSI descriptor changes so that
the clients can resend the new address and data values to the peer.
On my test system this was never necessary, but there may be
other platforms where this can occur. I tested this by hacking
in a path to rewrite the MSI descriptor when I change the cpu
affinity of an IRQ. There's a bit of uncertainty over the latency
of the change, but without hardware this can acctually occur on
we can't test this. This was the result of a discussion with Dave.

--

This patch series adds optional support for using MSI interrupts instead
of NTB doorbells in ntb_transport. This is desirable seeing doorbells on
current hardware are quite slow and therefore switching to MSI interrupts
provides a significant performance gain. On switchtec hardware, a simple
apples-to-apples comparison shows ntb_netdev/iperf numbers going from
3.88Gb/s to 14.1Gb/s when switching to MSI interrupts.

To do this, a couple changes are required outside of the NTB tree:

1) The IOMMU must know to accept MSI requests from aliased bused numbers
seeing NTB hardware typically sends proxied request IDs through
additional requester IDs. The first patch in this series adds support
for the Intel IOMMU. A quirk to add these aliases for switchtec hardware
was already accepted. See commit ad281ecf1c7d ("PCI: Add DMA alias quirk
for Microsemi Switchtec NTB") for a description of NTB proxy IDs and why
this is necessary.

2) NTB transport (and other clients) may often need more MSI interrupts
than the NTB hardware actually advertises support for. However, seeing
these interrupts will not be triggered by the hardware but through an
NTB memory window, the hardware does not actually need support or need
to know about them. Therefore we add the concept of Virtual MSI
interrupts which are allocated just like any other MSI interrupt but
are not programmed into the hardware's MSI table. This is done in
Patch 2 and then made use of in Patch 3.

The remaining patches in this series add a library for dealing with MSI
interrupts, a test client and finally support in ntb_transport.

The series is based off of v5.0-rc4 and I've tested it on top of a
of the patches I've already sent to the NTB tree (though they are
independent changes). A git repo is available here:

https://github.com/sbates130272/linux-p2pmem/ ntb_transport_msi_v2

Thanks,

Logan

--

Logan Gunthorpe (12):
iommu/vt-d: Implement dma_[un]map_resource()
NTB: ntb_transport: Ensure the destination buffer is mapped for TX DMA
iommu/vt-d: Add helper to set an IRTE to verify only the bus number
iommu/vt-d: Allow interrupts from the entire bus for aliased devices
PCI/MSI: Support allocating virtual MSI interrupts
PCI/switchtec: Add module parameter to request more interrupts
NTB: Introduce functions to calculate multi-port resource index
NTB: Rename ntb.c to support multiple source files in the module
NTB: Introduce MSI library
NTB: Introduce NTB MSI Test Client
NTB: Add ntb_msi_test support to ntb_test
NTB: Add MSI interrupt support to ntb_transport

drivers/iommu/intel-iommu.c | 23 +-
drivers/iommu/intel_irq_remapping.c | 32 +-
drivers/ntb/Kconfig | 11 +
drivers/ntb/Makefile | 3 +
drivers/ntb/{ntb.c => core.c} | 0
drivers/ntb/msi.c | 415 +++++++++++++++++++++++
drivers/ntb/ntb_transport.c | 197 ++++++++++-
drivers/ntb/test/Kconfig | 9 +
drivers/ntb/test/Makefile | 1 +
drivers/ntb/test/ntb_msi_test.c | 433 ++++++++++++++++++++++++
drivers/pci/msi.c | 55 ++-
drivers/pci/switch/switchtec.c | 12 +-
include/linux/msi.h | 8 +
include/linux/ntb.h | 143 ++++++++
include/linux/pci.h | 9 +
tools/testing/selftests/ntb/ntb_test.sh | 54 ++-
16 files changed, 1379 insertions(+), 26 deletions(-)
rename drivers/ntb/{ntb.c => core.c} (100%)
create mode 100644 drivers/ntb/msi.c
create mode 100644 drivers/ntb/test/ntb_msi_test.c

--
2.19.0

Logan Gunthorpe

unread,
Feb 13, 2019, 12:55:07 PM2/13/19
to linux-...@vger.kernel.org, linu...@googlegroups.com, linu...@vger.kernel.org, io...@lists.linux-foundation.org, linux-k...@vger.kernel.org, Jon Mason, Bjorn Helgaas, Joerg Roedel, Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore, Logan Gunthorpe, David Woodhouse, Jacob Pan
The current code uses set_irte_sid() with SVT_VERIFY_BUS and PCI_DEVID
to set the SID value. However, this is very confusing because, with
SVT_VERIFY_BUS, the SID value is not a PCI devfn address, but the start
and end bus numbers to match against.

According to the Intel Virtualization Technology for Directed I/O
Architecture Specification, Rev. 3.0, page 9-36:

The most significant 8-bits of the SID field contains the Startbus#,
and the least significant 8-bits of the SID field contains the Endbus#.
Interrupt requests that reference this IRTE must have a requester-id
whose bus# (most significant 8-bits of requester-id) has a value equal
to or within the Startbus# to Endbus# range.

So to make this more clear, introduce a new set_irte_verify_bus() that
explicitly takes a start bus and end bus so that we can stop abusing
the PCI_DEVID macro.

This helper function will be called a second time in an subsequent patch.

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Cc: David Woodhouse <dw...@infradead.org>
Cc: Joerg Roedel <jo...@8bytes.org>
Cc: Jacob Pan <jacob....@linux.intel.com>
---
drivers/iommu/intel_irq_remapping.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index 24d45b07f425..5a55bef8e379 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -294,6 +294,18 @@ static void set_irte_sid(struct irte *irte, unsigned int svt,
irte->sid = sid;
}

+/*
+ * Set an IRTE to match only the bus number. Interrupt requests that reference
+ * this IRTE must have a requester-id whose bus number is between or equal
+ * to the start_bus and end_bus arguments.
+ */
+static void set_irte_verify_bus(struct irte *irte, unsigned int start_bus,
+ unsigned int end_bus)
+{
+ set_irte_sid(irte, SVT_VERIFY_BUS, SQ_ALL_16,
+ (start_bus << 8) | end_bus);
+}
+
static int set_ioapic_sid(struct irte *irte, int apic)
{
int i;
@@ -391,9 +403,8 @@ static int set_msi_sid(struct irte *irte, struct pci_dev *dev)
* original device.
*/
if (PCI_BUS_NUM(data.alias) != data.pdev->bus->number)
- set_irte_sid(irte, SVT_VERIFY_BUS, SQ_ALL_16,
- PCI_DEVID(PCI_BUS_NUM(data.alias),
- dev->bus->number));
+ set_irte_verify_bus(irte, PCI_BUS_NUM(data.alias),
+ dev->bus->number);
else if (data.pdev->bus->number != dev->bus->number)
set_irte_sid(irte, SVT_VERIFY_SID_SQ, SQ_ALL_16, data.alias);
else
--
2.19.0

Logan Gunthorpe

unread,
Feb 13, 2019, 12:55:07 PM2/13/19
to linux-...@vger.kernel.org, linu...@googlegroups.com, linu...@vger.kernel.org, io...@lists.linux-foundation.org, linux-k...@vger.kernel.org, Jon Mason, Bjorn Helgaas, Joerg Roedel, Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore, Logan Gunthorpe
Introduce the module parameter 'use_msi' which, when set uses
MSI interrupts instead of doorbells for each queue pair (QP). T
he parameter is only available if NTB MSI support is configured into
the kernel. We also require there to be more than one memory window
(MW) so that an extra one is available to forward the APIC region.

To use MSIs, we request one interrupt per QP and forward the MSI address
and data to the peer using scratch pad registers (SPADS) above the MW
spads. (If there are not enough SPADS the MSI interrupt will not be used.)

Once registered, we simply use ntb_msi_peer_trigger and the recieving
ISR simply queues up the rxc_db_work for the queue.

This addition can significantly improve performance of ntb_transport.
In a simple, untuned, apples-to-apples comparision using ntb_netdev
and iperf with switchtec hardware, I see 3.88Gb/s without MSI
interrupts and 14.1Gb/s which is a more than 3x improvement.

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Cc: Jon Mason <jdm...@kudzu.us>
Cc: Dave Jiang <dave....@intel.com>
Cc: Allen Hubbe <all...@gmail.com>
---
drivers/ntb/ntb_transport.c | 169 +++++++++++++++++++++++++++++++++++-
1 file changed, 168 insertions(+), 1 deletion(-)

diff --git a/drivers/ntb/ntb_transport.c b/drivers/ntb/ntb_transport.c
index 526b65afc16a..90e3ea67d48a 100644
--- a/drivers/ntb/ntb_transport.c
+++ b/drivers/ntb/ntb_transport.c
@@ -93,6 +93,12 @@ static bool use_dma;
module_param(use_dma, bool, 0644);
MODULE_PARM_DESC(use_dma, "Use DMA engine to perform large data copy");

+static bool use_msi;
+#ifdef CONFIG_NTB_MSI
+module_param(use_msi, bool, 0644);
+MODULE_PARM_DESC(use_msi, "Use MSI interrupts instead of doorbells");
+#endif
+
static struct dentry *nt_debugfs_dir;

/* Only two-ports NTB devices are supported */
@@ -188,6 +194,11 @@ struct ntb_transport_qp {
u64 tx_err_no_buf;
u64 tx_memcpy;
u64 tx_async;
+
+ bool use_msi;
+ int msi_irq;
+ struct ntb_msi_desc msi_desc;
+ struct ntb_msi_desc peer_msi_desc;
};

struct ntb_transport_mw {
@@ -221,6 +232,10 @@ struct ntb_transport_ctx {
u64 qp_bitmap;
u64 qp_bitmap_free;

+ bool use_msi;
+ unsigned int msi_spad_offset;
+ u64 msi_db_mask;
+
bool link_is_up;
struct delayed_work link_work;
struct work_struct link_cleanup;
@@ -667,6 +682,114 @@ static int ntb_transport_setup_qp_mw(struct ntb_transport_ctx *nt,
return 0;
}

+static irqreturn_t ntb_transport_isr(int irq, void *dev)
+{
+ struct ntb_transport_qp *qp = dev;
+
+ tasklet_schedule(&qp->rxc_db_work);
+
+ return IRQ_HANDLED;
+}
+
+static void ntb_transport_setup_qp_peer_msi(struct ntb_transport_ctx *nt,
+ unsigned int qp_num)
+{
+ struct ntb_transport_qp *qp = &nt->qp_vec[qp_num];
+ int spad = qp_num * 2 + nt->msi_spad_offset;
+
+ if (!nt->use_msi)
+ return;
+
+ if (spad >= ntb_spad_count(nt->ndev))
+ return;
+
+ qp->peer_msi_desc.addr_offset =
+ ntb_peer_spad_read(qp->ndev, PIDX, spad);
+ qp->peer_msi_desc.data =
+ ntb_peer_spad_read(qp->ndev, PIDX, spad + 1);
+
+ dev_dbg(&qp->ndev->pdev->dev, "QP%d Peer MSI addr=%x data=%x\n",
+ qp_num, qp->peer_msi_desc.addr_offset, qp->peer_msi_desc.data);
+
+ if (qp->peer_msi_desc.addr_offset) {
+ qp->use_msi = true;
+ dev_info(&qp->ndev->pdev->dev,
+ "Using MSI interrupts for QP%d\n", qp_num);
+ }
+}
+
+static void ntb_transport_setup_qp_msi(struct ntb_transport_ctx *nt,
+ unsigned int qp_num)
+{
+ struct ntb_transport_qp *qp = &nt->qp_vec[qp_num];
+ int spad = qp_num * 2 + nt->msi_spad_offset;
+ int rc;
+
+ if (!nt->use_msi)
+ return;
+
+ if (spad >= ntb_spad_count(nt->ndev)) {
+ dev_warn_once(&qp->ndev->pdev->dev,
+ "Not enough SPADS to use MSI interrupts\n");
+ return;
+ }
+
+ ntb_spad_write(qp->ndev, spad, 0);
+ ntb_spad_write(qp->ndev, spad + 1, 0);
+
+ if (!qp->msi_irq) {
+ qp->msi_irq = ntbm_msi_request_irq(qp->ndev, ntb_transport_isr,
+ KBUILD_MODNAME, qp,
+ &qp->msi_desc);
+ if (qp->msi_irq < 0) {
+ dev_warn(&qp->ndev->pdev->dev,
+ "Unable to allocate MSI interrupt for qp%d\n",
+ qp_num);
+ return;
+ }
+ }
+
+ rc = ntb_spad_write(qp->ndev, spad, qp->msi_desc.addr_offset);
+ if (rc)
+ goto err_free_interrupt;
+
+ rc = ntb_spad_write(qp->ndev, spad + 1, qp->msi_desc.data);
+ if (rc)
+ goto err_free_interrupt;
+
+ dev_dbg(&qp->ndev->pdev->dev, "QP%d MSI %d addr=%x data=%x\n",
+ qp_num, qp->msi_irq, qp->msi_desc.addr_offset,
+ qp->msi_desc.data);
+
+ return;
+
+err_free_interrupt:
+ devm_free_irq(&nt->ndev->dev, qp->msi_irq, qp);
+}
+
+static void ntb_transport_msi_peer_desc_changed(struct ntb_transport_ctx *nt)
+{
+ int i;
+
+ dev_dbg(&nt->ndev->pdev->dev, "Peer MSI descriptors changed");
+
+ for (i = 0; i < nt->qp_count; i++)
+ ntb_transport_setup_qp_peer_msi(nt, i);
+}
+
+static void ntb_transport_msi_desc_changed(void *data)
+{
+ struct ntb_transport_ctx *nt = data;
+ int i;
+
+ dev_dbg(&nt->ndev->pdev->dev, "MSI descriptors changed");
+
+ for (i = 0; i < nt->qp_count; i++)
+ ntb_transport_setup_qp_msi(nt, i);
+
+ ntb_peer_db_set(nt->ndev, nt->msi_db_mask);
+}
+
static void ntb_free_mw(struct ntb_transport_ctx *nt, int num_mw)
{
struct ntb_transport_mw *mw = &nt->mw_vec[num_mw];
@@ -902,6 +1025,20 @@ static void ntb_transport_link_work(struct work_struct *work)
int rc = 0, i, spad;

/* send the local info, in the opposite order of the way we read it */
+
+ if (nt->use_msi) {
+ rc = ntb_msi_setup_mws(ndev);
+ if (rc) {
+ dev_warn(&pdev->dev,
+ "Failed to register MSI memory window: %d\n",
+ rc);
+ nt->use_msi = false;
+ }
+ }
+
+ for (i = 0; i < nt->qp_count; i++)
+ ntb_transport_setup_qp_msi(nt, i);
+
for (i = 0; i < nt->mw_count; i++) {
size = nt->mw_vec[i].phys_size;

@@ -959,6 +1096,7 @@ static void ntb_transport_link_work(struct work_struct *work)
struct ntb_transport_qp *qp = &nt->qp_vec[i];

ntb_transport_setup_qp_mw(nt, i);
+ ntb_transport_setup_qp_peer_msi(nt, i);

if (qp->client_ready)
schedule_delayed_work(&qp->link_work, 0);
@@ -1132,6 +1270,19 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
return -ENOMEM;

nt->ndev = ndev;
+
+ /*
+ * If we are using MSI, and have at least one extra memory window,
+ * we will reserve the last MW for the MSI window.
+ */
+ if (use_msi && mw_count > 1) {
+ rc = ntb_msi_init(ndev, ntb_transport_msi_desc_changed);
+ if (!rc) {
+ mw_count -= 1;
+ nt->use_msi = true;
+ }
+ }
+
spad_count = ntb_spad_count(ndev);

/* Limit the MW's based on the availability of scratchpads */
@@ -1145,6 +1296,8 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
max_mw_count_for_spads = (spad_count - MW0_SZ_HIGH) / 2;
nt->mw_count = min(mw_count, max_mw_count_for_spads);

+ nt->msi_spad_offset = nt->mw_count * 2 + MW0_SZ_HIGH;
+
nt->mw_vec = kcalloc_node(mw_count, sizeof(*nt->mw_vec),
GFP_KERNEL, node);
if (!nt->mw_vec) {
@@ -1175,6 +1328,12 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
qp_bitmap = ntb_db_valid_mask(ndev);

qp_count = ilog2(qp_bitmap);
+ if (nt->use_msi) {
+ qp_count -= 1;
+ nt->msi_db_mask = 1 << qp_count;
+ ntb_db_clear_mask(ndev, nt->msi_db_mask);
+ }
+
if (max_num_clients && max_num_clients < qp_count)
qp_count = max_num_clients;
else if (nt->mw_count < qp_count)
@@ -1598,7 +1757,10 @@ static void ntb_tx_copy_callback(void *data,

iowrite32(entry->flags | DESC_DONE_FLAG, &hdr->flags);

- ntb_peer_db_set(qp->ndev, BIT_ULL(qp->qp_num));
+ if (qp->use_msi)
+ ntb_msi_peer_trigger(qp->ndev, PIDX, &qp->peer_msi_desc);
+ else
+ ntb_peer_db_set(qp->ndev, BIT_ULL(qp->qp_num));

/* The entry length can only be zero if the packet is intended to be a
* "link down" or similar. Since no payload is being sent in these
@@ -2265,6 +2427,11 @@ static void ntb_transport_doorbell_callback(void *data, int vector)
u64 db_bits;
unsigned int qp_num;

+ if (ntb_db_read(nt->ndev) & nt->msi_db_mask) {
+ ntb_transport_msi_peer_desc_changed(nt);
+ ntb_db_clear(nt->ndev, nt->msi_db_mask);
+ }
+
db_bits = (nt->qp_bitmap & ~nt->qp_bitmap_free &
ntb_db_vector_mask(nt->ndev, vector));

--
2.19.0

Logan Gunthorpe

unread,
Feb 13, 2019, 12:55:09 PM2/13/19
to linux-...@vger.kernel.org, linu...@googlegroups.com, linu...@vger.kernel.org, io...@lists.linux-foundation.org, linux-k...@vger.kernel.org, Jon Mason, Bjorn Helgaas, Joerg Roedel, Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore, Logan Gunthorpe
Introduce a tool to test NTB MSI interrupts similar to the other
NTB test tools. This tool creates a debugfs directory for each
NTB device with the following files:

port
irqX_occurrences
peerX/port
peerX/count
peerX/trigger

The 'port' file tells the user the local port number and the
'occurrences' files tell the number of local interrupts that
have been received for each interrupt.

For each peer, the 'port' file and the 'count' file tell you the
peer's port number and number of interrupts respectively. Writing
the interrupt number to the 'trigger' file triggers the interrupt
handler for the peer which should increment their corresponding
'occurrences' file. The 'ready' file indicates if a peer is ready,
writing to this file blocks until it is ready.

The module parameter num_irqs can be used to set the number of
local interrupts. By default this is 4. This is only limited by
the number of unused MSI interrupts registered by the hardware
(this will require support of the hardware driver) and there must
be at least 2*num_irqs + 1 spads registers available.

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Cc: Jon Mason <jdm...@kudzu.us>
Cc: Dave Jiang <dave....@intel.com>
Cc: Allen Hubbe <all...@gmail.com>
---
drivers/ntb/test/Kconfig | 9 +
drivers/ntb/test/Makefile | 1 +
drivers/ntb/test/ntb_msi_test.c | 433 ++++++++++++++++++++++++++++++++
3 files changed, 443 insertions(+)
create mode 100644 drivers/ntb/test/ntb_msi_test.c

diff --git a/drivers/ntb/test/Kconfig b/drivers/ntb/test/Kconfig
index a5d0eda44438..a3f3e2638935 100644
--- a/drivers/ntb/test/Kconfig
+++ b/drivers/ntb/test/Kconfig
@@ -25,3 +25,12 @@ config NTB_PERF
to and from the window without additional software interaction.

If unsure, say N.
+
+config NTB_MSI_TEST
+ tristate "NTB MSI Test Client"
+ depends on NTB_MSI
+ help
+ This tool demonstrates the use of the NTB MSI library to
+ send MSI interrupts between peers.
+
+ If unsure, say N.
diff --git a/drivers/ntb/test/Makefile b/drivers/ntb/test/Makefile
index 9e77e0b761c2..d2895ca995e4 100644
--- a/drivers/ntb/test/Makefile
+++ b/drivers/ntb/test/Makefile
@@ -1,3 +1,4 @@
obj-$(CONFIG_NTB_PINGPONG) += ntb_pingpong.o
obj-$(CONFIG_NTB_TOOL) += ntb_tool.o
obj-$(CONFIG_NTB_PERF) += ntb_perf.o
+obj-$(CONFIG_NTB_MSI_TEST) += ntb_msi_test.o
diff --git a/drivers/ntb/test/ntb_msi_test.c b/drivers/ntb/test/ntb_msi_test.c
new file mode 100644
index 000000000000..99d826ed9c34
--- /dev/null
+++ b/drivers/ntb/test/ntb_msi_test.c
@@ -0,0 +1,433 @@
+// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
+
+#include <linux/module.h>
+#include <linux/debugfs.h>
+#include <linux/ntb.h>
+#include <linux/pci.h>
+#include <linux/radix-tree.h>
+#include <linux/workqueue.h>
+
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION("0.1");
+MODULE_AUTHOR("Logan Gunthorpe <log...@deltatee.com>");
+MODULE_DESCRIPTION("Test for sending MSI interrupts over an NTB memory window");
+
+static int num_irqs = 4;
+module_param(num_irqs, int, 0644);
+MODULE_PARM_DESC(num_irqs, "number of irqs to use");
+
+struct ntb_msit_ctx {
+ struct ntb_dev *ntb;
+ struct dentry *dbgfs_dir;
+ struct work_struct setup_work;
+
+ struct ntb_msit_isr_ctx {
+ int irq_idx;
+ int irq_num;
+ int occurrences;
+ struct ntb_msit_ctx *nm;
+ struct ntb_msi_desc desc;
+ } *isr_ctx;
+
+ struct ntb_msit_peer {
+ struct ntb_msit_ctx *nm;
+ int pidx;
+ int num_irqs;
+ struct completion init_comp;
+ struct ntb_msi_desc *msi_desc;
+ } peers[];
+};
+
+static struct dentry *ntb_msit_dbgfs_topdir;
+
+static irqreturn_t ntb_msit_isr(int irq, void *dev)
+{
+ struct ntb_msit_isr_ctx *isr_ctx = dev;
+ struct ntb_msit_ctx *nm = isr_ctx->nm;
+
+ dev_dbg(&nm->ntb->dev, "Interrupt Occurred: %d",
+ isr_ctx->irq_idx);
+
+ isr_ctx->occurrences++;
+
+ return IRQ_HANDLED;
+}
+
+static void ntb_msit_setup_work(struct work_struct *work)
+{
+ struct ntb_msit_ctx *nm = container_of(work, struct ntb_msit_ctx,
+ setup_work);
+ int irq_count = 0;
+ int irq;
+ int ret;
+ uintptr_t i;
+
+ ret = ntb_msi_setup_mws(nm->ntb);
+ if (ret) {
+ dev_err(&nm->ntb->dev, "Unable to setup MSI windows: %d\n",
+ ret);
+ return;
+ }
+
+ for (i = 0; i < num_irqs; i++) {
+ nm->isr_ctx[i].irq_idx = i;
+ nm->isr_ctx[i].nm = nm;
+
+ if (!nm->isr_ctx[i].irq_num) {
+ irq = ntbm_msi_request_irq(nm->ntb, ntb_msit_isr,
+ KBUILD_MODNAME,
+ &nm->isr_ctx[i],
+ &nm->isr_ctx[i].desc);
+ if (irq < 0)
+ break;
+
+ nm->isr_ctx[i].irq_num = irq;
+ }
+
+ ret = ntb_spad_write(nm->ntb, 2 * i + 1,
+ nm->isr_ctx[i].desc.addr_offset);
+ if (ret)
+ break;
+
+ ret = ntb_spad_write(nm->ntb, 2 * i + 2,
+ nm->isr_ctx[i].desc.data);
+ if (ret)
+ break;
+
+ irq_count++;
+ }
+
+ ntb_spad_write(nm->ntb, 0, irq_count);
+ ntb_peer_db_set(nm->ntb, BIT(ntb_port_number(nm->ntb)));
+}
+
+static void ntb_msit_desc_changed(void *ctx)
+{
+ struct ntb_msit_ctx *nm = ctx;
+ int i;
+
+ dev_dbg(&nm->ntb->dev, "MSI Descriptors Changed\n");
+
+ for (i = 0; i < num_irqs; i++) {
+ ntb_spad_write(nm->ntb, 2 * i + 1,
+ nm->isr_ctx[i].desc.addr_offset);
+ ntb_spad_write(nm->ntb, 2 * i + 2,
+ nm->isr_ctx[i].desc.data);
+ }
+
+ ntb_peer_db_set(nm->ntb, BIT(ntb_port_number(nm->ntb)));
+}
+
+static void ntb_msit_link_event(void *ctx)
+{
+ struct ntb_msit_ctx *nm = ctx;
+
+ if (!ntb_link_is_up(nm->ntb, NULL, NULL))
+ return;
+
+ schedule_work(&nm->setup_work);
+}
+
+static void ntb_msit_copy_peer_desc(struct ntb_msit_ctx *nm, int peer)
+{
+ int i;
+ struct ntb_msi_desc *desc = nm->peers[peer].msi_desc;
+ int irq_count = nm->peers[peer].num_irqs;
+
+ for (i = 0; i < irq_count; i++) {
+ desc[i].addr_offset = ntb_peer_spad_read(nm->ntb, peer,
+ 2 * i + 1);
+ desc[i].data = ntb_peer_spad_read(nm->ntb, peer, 2 * i + 2);
+ }
+
+ dev_info(&nm->ntb->dev, "Found %d interrupts on peer %d\n",
+ irq_count, peer);
+
+ complete_all(&nm->peers[peer].init_comp);
+}
+
+static void ntb_msit_db_event(void *ctx, int vec)
+{
+ struct ntb_msit_ctx *nm = ctx;
+ struct ntb_msi_desc *desc;
+ u64 peer_mask = ntb_db_read(nm->ntb);
+ u32 irq_count;
+ int peer;
+
+ ntb_db_clear(nm->ntb, peer_mask);
+
+ for (peer = 0; peer < sizeof(peer_mask) * 8; peer++) {
+ if (!(peer_mask & BIT(peer)))
+ continue;
+
+ irq_count = ntb_peer_spad_read(nm->ntb, peer, 0);
+ if (irq_count == -1)
+ continue;
+
+ desc = kcalloc(irq_count, sizeof(*desc), GFP_ATOMIC);
+ if (!desc)
+ continue;
+
+ kfree(nm->peers[peer].msi_desc);
+ nm->peers[peer].msi_desc = desc;
+ nm->peers[peer].num_irqs = irq_count;
+
+ ntb_msit_copy_peer_desc(nm, peer);
+ }
+}
+
+static const struct ntb_ctx_ops ntb_msit_ops = {
+ .link_event = ntb_msit_link_event,
+ .db_event = ntb_msit_db_event,
+};
+
+static int ntb_msit_dbgfs_trigger(void *data, u64 idx)
+{
+ struct ntb_msit_peer *peer = data;
+
+ if (idx >= peer->num_irqs)
+ return -EINVAL;
+
+ dev_dbg(&peer->nm->ntb->dev, "trigger irq %llu on peer %u\n",
+ idx, peer->pidx);
+
+ return ntb_msi_peer_trigger(peer->nm->ntb, peer->pidx,
+ &peer->msi_desc[idx]);
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(ntb_msit_trigger_fops, NULL,
+ ntb_msit_dbgfs_trigger, "%llu\n");
+
+static int ntb_msit_dbgfs_port_get(void *data, u64 *port)
+{
+ struct ntb_msit_peer *peer = data;
+
+ *port = ntb_peer_port_number(peer->nm->ntb, peer->pidx);
+
+ return 0;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(ntb_msit_port_fops, ntb_msit_dbgfs_port_get,
+ NULL, "%llu\n");
+
+static int ntb_msit_dbgfs_count_get(void *data, u64 *count)
+{
+ struct ntb_msit_peer *peer = data;
+
+ *count = peer->num_irqs;
+
+ return 0;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(ntb_msit_count_fops, ntb_msit_dbgfs_count_get,
+ NULL, "%llu\n");
+
+static int ntb_msit_dbgfs_ready_get(void *data, u64 *ready)
+{
+ struct ntb_msit_peer *peer = data;
+
+ *ready = try_wait_for_completion(&peer->init_comp);
+
+ return 0;
+}
+
+static int ntb_msit_dbgfs_ready_set(void *data, u64 ready)
+{
+ struct ntb_msit_peer *peer = data;
+
+ return wait_for_completion_interruptible(&peer->init_comp);
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(ntb_msit_ready_fops, ntb_msit_dbgfs_ready_get,
+ ntb_msit_dbgfs_ready_set, "%llu\n");
+
+static int ntb_msit_dbgfs_occurrences_get(void *data, u64 *occurrences)
+{
+ struct ntb_msit_isr_ctx *isr_ctx = data;
+
+ *occurrences = isr_ctx->occurrences;
+
+ return 0;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(ntb_msit_occurrences_fops,
+ ntb_msit_dbgfs_occurrences_get,
+ NULL, "%llu\n");
+
+static int ntb_msit_dbgfs_local_port_get(void *data, u64 *port)
+{
+ struct ntb_msit_ctx *nm = data;
+
+ *port = ntb_port_number(nm->ntb);
+
+ return 0;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(ntb_msit_local_port_fops,
+ ntb_msit_dbgfs_local_port_get,
+ NULL, "%llu\n");
+
+static void ntb_msit_create_dbgfs(struct ntb_msit_ctx *nm)
+{
+ struct pci_dev *pdev = nm->ntb->pdev;
+ char buf[32];
+ int i;
+ struct dentry *peer_dir;
+
+ nm->dbgfs_dir = debugfs_create_dir(pci_name(pdev),
+ ntb_msit_dbgfs_topdir);
+ debugfs_create_file("port", 0400, nm->dbgfs_dir, nm,
+ &ntb_msit_local_port_fops);
+
+ for (i = 0; i < ntb_peer_port_count(nm->ntb); i++) {
+ nm->peers[i].pidx = i;
+ nm->peers[i].nm = nm;
+ init_completion(&nm->peers[i].init_comp);
+
+ snprintf(buf, sizeof(buf), "peer%d", i);
+ peer_dir = debugfs_create_dir(buf, nm->dbgfs_dir);
+
+ debugfs_create_file_unsafe("trigger", 0200, peer_dir,
+ &nm->peers[i],
+ &ntb_msit_trigger_fops);
+
+ debugfs_create_file_unsafe("port", 0400, peer_dir,
+ &nm->peers[i], &ntb_msit_port_fops);
+
+ debugfs_create_file_unsafe("count", 0400, peer_dir,
+ &nm->peers[i],
+ &ntb_msit_count_fops);
+
+ debugfs_create_file_unsafe("ready", 0600, peer_dir,
+ &nm->peers[i],
+ &ntb_msit_ready_fops);
+ }
+
+ for (i = 0; i < num_irqs; i++) {
+ snprintf(buf, sizeof(buf), "irq%d_occurrences", i);
+ debugfs_create_file_unsafe(buf, 0400, nm->dbgfs_dir,
+ &nm->isr_ctx[i],
+ &ntb_msit_occurrences_fops);
+ }
+}
+
+static void ntb_msit_remove_dbgfs(struct ntb_msit_ctx *nm)
+{
+ debugfs_remove_recursive(nm->dbgfs_dir);
+}
+
+static int ntb_msit_probe(struct ntb_client *client, struct ntb_dev *ntb)
+{
+ struct ntb_msit_ctx *nm;
+ size_t struct_size;
+ int peers;
+ int ret;
+
+ peers = ntb_peer_port_count(ntb);
+ if (peers <= 0)
+ return -EINVAL;
+
+ if (ntb_spad_is_unsafe(ntb) || ntb_spad_count(ntb) < 2 * num_irqs + 1) {
+ dev_err(&ntb->dev, "NTB MSI test requires at least %d spads for %d irqs\n",
+ 2 * num_irqs + 1, num_irqs);
+ return -EFAULT;
+ }
+
+ ret = ntb_spad_write(ntb, 0, -1);
+ if (ret) {
+ dev_err(&ntb->dev, "Unable to write spads: %d\n", ret);
+ return ret;
+ }
+
+ ret = ntb_db_clear_mask(ntb, GENMASK(peers - 1, 0));
+ if (ret) {
+ dev_err(&ntb->dev, "Unable to clear doorbell mask: %d\n", ret);
+ return ret;
+ }
+
+ ret = ntb_msi_init(ntb, ntb_msit_desc_changed);
+ if (ret) {
+ dev_err(&ntb->dev, "Unable to initialize MSI library: %d\n",
+ ret);
+ return ret;
+ }
+
+ struct_size = sizeof(*nm) + sizeof(*nm->peers) * peers;
+
+ nm = devm_kzalloc(&ntb->dev, struct_size, GFP_KERNEL);
+ if (!nm)
+ return -ENOMEM;
+
+ nm->isr_ctx = devm_kcalloc(&ntb->dev, num_irqs, sizeof(*nm->isr_ctx),
+ GFP_KERNEL);
+ if (!nm->isr_ctx)
+ return -ENOMEM;
+
+ INIT_WORK(&nm->setup_work, ntb_msit_setup_work);
+ nm->ntb = ntb;
+
+ ntb_msit_create_dbgfs(nm);
+
+ ret = ntb_set_ctx(ntb, nm, &ntb_msit_ops);
+ if (ret)
+ goto remove_dbgfs;
+
+ if (!nm->isr_ctx)
+ goto remove_dbgfs;
+
+ ntb_link_enable(ntb, NTB_SPEED_AUTO, NTB_WIDTH_AUTO);
+
+ return 0;
+
+remove_dbgfs:
+ ntb_msit_remove_dbgfs(nm);
+ devm_kfree(&ntb->dev, nm->isr_ctx);
+ devm_kfree(&ntb->dev, nm);
+ return ret;
+}
+
+static void ntb_msit_remove(struct ntb_client *client, struct ntb_dev *ntb)
+{
+ struct ntb_msit_ctx *nm = ntb->ctx;
+ int i;
+
+ ntb_link_disable(ntb);
+ ntb_db_set_mask(ntb, ntb_db_valid_mask(ntb));
+ ntb_msi_clear_mws(ntb);
+
+ for (i = 0; i < ntb_peer_port_count(ntb); i++)
+ kfree(nm->peers[i].msi_desc);
+
+ ntb_clear_ctx(ntb);
+ ntb_msit_remove_dbgfs(nm);
+}
+
+static struct ntb_client ntb_msit_client = {
+ .ops = {
+ .probe = ntb_msit_probe,
+ .remove = ntb_msit_remove
+ }
+};
+
+static int __init ntb_msit_init(void)
+{
+ int ret;
+
+ if (debugfs_initialized())
+ ntb_msit_dbgfs_topdir = debugfs_create_dir(KBUILD_MODNAME,
+ NULL);
+
+ ret = ntb_register_client(&ntb_msit_client);
+ if (ret)
+ debugfs_remove_recursive(ntb_msit_dbgfs_topdir);
+
+ return ret;
+}
+module_init(ntb_msit_init);
+
+static void __exit ntb_msit_exit(void)
+{
+ ntb_unregister_client(&ntb_msit_client);
+ debugfs_remove_recursive(ntb_msit_dbgfs_topdir);
+}
+module_exit(ntb_msit_exit);
--
2.19.0

Logan Gunthorpe

unread,
Feb 13, 2019, 12:55:12 PM2/13/19
to linux-...@vger.kernel.org, linu...@googlegroups.com, linu...@vger.kernel.org, io...@lists.linux-foundation.org, linux-k...@vger.kernel.org, Jon Mason, Bjorn Helgaas, Joerg Roedel, Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore, Logan Gunthorpe
The NTB MSI library allows passing MSI interrupts across a memory
window. This offers similar functionality to doorbells or messages
except will often have much better latency and the client can
potentially use significantly more remote interrupts than typical hardware
provides for doorbells. (Which can be important in high-multiport
setups.)

The library utilizes one memory window per peer and uses the highest
index memory windows. Before any ntb_msi function may be used, the user
must call ntb_msi_init(). It may then setup and tear down the memory
windows when the link state changes using ntb_msi_setup_mws() and
ntb_msi_clear_mws().

The peer which receives the interrupt must call ntb_msim_request_irq()
to assign the interrupt handler (this function is functionally
similar to devm_request_irq()) and the returned descriptor must be
transferred to the peer which can use it to trigger the interrupt.
The triggering peer, once having received the descriptor, can
trigger the interrupt by calling ntb_msi_peer_trigger().

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Cc: Jon Mason <jdm...@kudzu.us>
Cc: Dave Jiang <dave....@intel.com>
Cc: Allen Hubbe <all...@gmail.com>
---
drivers/ntb/Kconfig | 11 ++
drivers/ntb/Makefile | 3 +-
drivers/ntb/msi.c | 415 +++++++++++++++++++++++++++++++++++++++++++
include/linux/ntb.h | 73 ++++++++
4 files changed, 501 insertions(+), 1 deletion(-)
create mode 100644 drivers/ntb/msi.c

diff --git a/drivers/ntb/Kconfig b/drivers/ntb/Kconfig
index 95944e52fa36..5760764052be 100644
--- a/drivers/ntb/Kconfig
+++ b/drivers/ntb/Kconfig
@@ -12,6 +12,17 @@ menuconfig NTB

if NTB

+config NTB_MSI
+ bool "MSI Interrupt Support"
+ depends on PCI_MSI
+ help
+ Support using MSI interrupt forwarding instead of (or in addition to)
+ hardware doorbells. MSI interrupts typically offer lower latency
+ than doorbells and more MSI interrupts can be made available to
+ clients. However this requires an extra memory window and support
+ in the hardware driver for creating the MSI interrupts.
+
+ If unsure, say N.
source "drivers/ntb/hw/Kconfig"

source "drivers/ntb/test/Kconfig"
diff --git a/drivers/ntb/Makefile b/drivers/ntb/Makefile
index 537226f8e78d..cc27ad2ef150 100644
--- a/drivers/ntb/Makefile
+++ b/drivers/ntb/Makefile
@@ -1,4 +1,5 @@
obj-$(CONFIG_NTB) += ntb.o hw/ test/
obj-$(CONFIG_NTB_TRANSPORT) += ntb_transport.o

-ntb-y := core.o
+ntb-y := core.o
+ntb-$(CONFIG_NTB_MSI) += msi.o
diff --git a/drivers/ntb/msi.c b/drivers/ntb/msi.c
new file mode 100644
index 000000000000..5d4bd7a63924
--- /dev/null
+++ b/drivers/ntb/msi.c
@@ -0,0 +1,415 @@
+// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
+
+#include <linux/irq.h>
+#include <linux/module.h>
+#include <linux/ntb.h>
+#include <linux/msi.h>
+#include <linux/pci.h>
+
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION("0.1");
+MODULE_AUTHOR("Logan Gunthorpe <log...@deltatee.com>");
+MODULE_DESCRIPTION("NTB MSI Interrupt Library");
+
+struct ntb_msi {
+ u64 base_addr;
+ u64 end_addr;
+
+ void (*desc_changed)(void *ctx);
+
+ u32 *peer_mws[];
+};
+
+/**
+ * ntb_msi_init() - Initialize the MSI context
+ * @ntb: NTB device context
+ *
+ * This function must be called before any other ntb_msi function.
+ * It initializes the context for MSI operations and maps
+ * the peer memory windows.
+ *
+ * This function reserves the last N outbound memory windows (where N
+ * is the number of peers).
+ *
+ * Return: Zero on success, otherwise a negative error number.
+ */
+int ntb_msi_init(struct ntb_dev *ntb,
+ void (*desc_changed)(void *ctx))
+{
+ phys_addr_t mw_phys_addr;
+ resource_size_t mw_size;
+ size_t struct_size;
+ int peer_widx;
+ int peers;
+ int ret;
+ int i;
+
+ peers = ntb_peer_port_count(ntb);
+ if (peers <= 0)
+ return -EINVAL;
+
+ struct_size = sizeof(*ntb->msi) + sizeof(*ntb->msi->peer_mws) * peers;
+
+ ntb->msi = devm_kzalloc(&ntb->dev, struct_size, GFP_KERNEL);
+ if (!ntb->msi)
+ return -ENOMEM;
+
+ ntb->msi->desc_changed = desc_changed;
+
+ for (i = 0; i < peers; i++) {
+ peer_widx = ntb_peer_mw_count(ntb) - 1 - i;
+
+ ret = ntb_peer_mw_get_addr(ntb, peer_widx, &mw_phys_addr,
+ &mw_size);
+ if (ret)
+ goto unroll;
+
+ ntb->msi->peer_mws[i] = devm_ioremap(&ntb->dev, mw_phys_addr,
+ mw_size);
+ if (!ntb->msi->peer_mws[i]) {
+ ret = -EFAULT;
+ goto unroll;
+ }
+ }
+
+ return 0;
+
+unroll:
+ for (i = 0; i < peers; i++)
+ if (ntb->msi->peer_mws[i])
+ devm_iounmap(&ntb->dev, ntb->msi->peer_mws[i]);
+
+ devm_kfree(&ntb->dev, ntb->msi);
+ ntb->msi = NULL;
+ return ret;
+}
+EXPORT_SYMBOL(ntb_msi_init);
+
+/**
+ * ntb_msi_setup_mws() - Initialize the MSI inbound memory windows
+ * @ntb: NTB device context
+ *
+ * This function sets up the required inbound memory windows. It should be
+ * called from a work function after a link up event.
+ *
+ * Over the entire network, this function will reserves the last N
+ * inbound memory windows for each peer (where N is the number of peers).
+ *
+ * ntb_msi_init() must be called before this function.
+ *
+ * Return: Zero on success, otherwise a negative error number.
+ */
+int ntb_msi_setup_mws(struct ntb_dev *ntb)
+{
+ struct msi_desc *desc;
+ u64 addr;
+ int peer, peer_widx;
+ resource_size_t addr_align, size_align, size_max;
+ resource_size_t mw_size = SZ_32K;
+ resource_size_t mw_min_size = mw_size;
+ int i;
+ int ret;
+
+ if (!ntb->msi)
+ return -EINVAL;
+
+ desc = first_msi_entry(&ntb->pdev->dev);
+ addr = desc->msg.address_lo + ((uint64_t)desc->msg.address_hi << 32);
+
+ for (peer = 0; peer < ntb_peer_port_count(ntb); peer++) {
+ peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
+ if (peer_widx < 0)
+ return peer_widx;
+
+ ret = ntb_mw_get_align(ntb, peer, peer_widx, &addr_align,
+ NULL, NULL);
+ if (ret)
+ return ret;
+
+ addr &= ~(addr_align - 1);
+ }
+
+ for (peer = 0; peer < ntb_peer_port_count(ntb); peer++) {
+ peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
+ if (peer_widx < 0) {
+ ret = peer_widx;
+ goto error_out;
+ }
+
+ ret = ntb_mw_get_align(ntb, peer, peer_widx, NULL,
+ &size_align, &size_max);
+ if (ret)
+ goto error_out;
+
+ mw_size = round_up(mw_size, size_align);
+ mw_size = max(mw_size, size_max);
+ if (mw_size < mw_min_size)
+ mw_min_size = mw_size;
+
+ ret = ntb_mw_set_trans(ntb, peer, peer_widx,
+ addr, mw_size);
+ if (ret)
+ goto error_out;
+ }
+
+ ntb->msi->base_addr = addr;
+ ntb->msi->end_addr = addr + mw_min_size;
+
+ return 0;
+
+error_out:
+ for (i = 0; i < peer; i++) {
+ peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
+ if (peer_widx < 0)
+ continue;
+
+ ntb_mw_clear_trans(ntb, i, peer_widx);
+ }
+
+ return ret;
+}
+EXPORT_SYMBOL(ntb_msi_setup_mws);
+
+/**
+ * ntb_msi_clear_mws() - Clear all inbound memory windows
+ * @ntb: NTB device context
+ *
+ * This function tears down the resources used by ntb_msi_setup_mws().
+ */
+void ntb_msi_clear_mws(struct ntb_dev *ntb)
+{
+ int peer;
+ int peer_widx;
+
+ for (peer = 0; peer < ntb_peer_port_count(ntb); peer++) {
+ peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
+ if (peer_widx < 0)
+ continue;
+
+ ntb_mw_clear_trans(ntb, peer, peer_widx);
+ }
+}
+EXPORT_SYMBOL(ntb_msi_clear_mws);
+
+struct ntb_msi_devres {
+ struct ntb_dev *ntb;
+ struct msi_desc *entry;
+ struct ntb_msi_desc *msi_desc;
+};
+
+static int ntb_msi_set_desc(struct ntb_dev *ntb, struct msi_desc *entry,
+ struct ntb_msi_desc *msi_desc)
+{
+ u64 addr;
+
+ addr = entry->msg.address_lo +
+ ((uint64_t)entry->msg.address_hi << 32);
+
+ if (addr < ntb->msi->base_addr || addr >= ntb->msi->end_addr) {
+ dev_warn_once(&ntb->dev,
+ "IRQ %d: MSI Address not within the memory window (%llx, [%llx %llx])\n",
+ entry->irq, addr, ntb->msi->base_addr,
+ ntb->msi->end_addr);
+ return -EFAULT;
+ }
+
+ msi_desc->addr_offset = addr - ntb->msi->base_addr;
+ msi_desc->data = entry->msg.data;
+
+ return 0;
+}
+
+static void ntb_msi_write_msg(struct msi_desc *entry, void *data)
+{
+ struct ntb_msi_devres *dr = data;
+
+ WARN_ON(ntb_msi_set_desc(dr->ntb, entry, dr->msi_desc));
+
+ if (dr->ntb->msi->desc_changed)
+ dr->ntb->msi->desc_changed(dr->ntb->ctx);
+}
+
+static void ntbm_msi_callback_release(struct device *dev, void *res)
+{
+ struct ntb_msi_devres *dr = res;
+
+ dr->entry->write_msi_msg = NULL;
+ dr->entry->write_msi_msg_data = NULL;
+}
+
+static int ntbm_msi_setup_callback(struct ntb_dev *ntb, struct msi_desc *entry,
+ struct ntb_msi_desc *msi_desc)
+{
+ struct ntb_msi_devres *dr;
+
+ dr = devres_alloc(ntbm_msi_callback_release,
+ sizeof(struct ntb_msi_devres), GFP_KERNEL);
+ if (!dr)
+ return -ENOMEM;
+
+ dr->ntb = ntb;
+ dr->entry = entry;
+ dr->msi_desc = msi_desc;
+
+ devres_add(&ntb->dev, dr);
+
+ dr->entry->write_msi_msg = ntb_msi_write_msg;
+ dr->entry->write_msi_msg_data = dr;
+
+ return 0;
+}
+
+/**
+ * ntbm_msi_request_threaded_irq() - allocate an MSI interrupt
+ * @ntb: NTB device context
+ * @handler: Function to be called when the IRQ occurs
+ * @thread_fn: Function to be called in a threaded interrupt context. NULL
+ * for clients which handle everything in @handler
+ * @devname: An ascii name for the claiming device, dev_name(dev) if NULL
+ * @dev_id: A cookie passed back to the handler function
+ *
+ * This function assigns an interrupt handler to an unused
+ * MSI interrupt and returns the descriptor used to trigger
+ * it. The descriptor can then be sent to a peer to trigger
+ * the interrupt.
+ *
+ * The interrupt resource is managed with devres so it will
+ * be automatically freed when the NTB device is torn down.
+ *
+ * If an IRQ allocated with this function needs to be freed
+ * separately, ntbm_free_irq() must be used.
+ *
+ * Return: IRQ number assigned on success, otherwise a negative error number.
+ */
+int ntbm_msi_request_threaded_irq(struct ntb_dev *ntb, irq_handler_t handler,
+ irq_handler_t thread_fn,
+ const char *name, void *dev_id,
+ struct ntb_msi_desc *msi_desc)
+{
+ struct msi_desc *entry;
+ struct irq_desc *desc;
+ int ret;
+
+ if (!ntb->msi)
+ return -EINVAL;
+
+ for_each_pci_msi_entry(entry, ntb->pdev) {
+ desc = irq_to_desc(entry->irq);
+ if (desc->action)
+ continue;
+
+ ret = devm_request_threaded_irq(&ntb->dev, entry->irq, handler,
+ thread_fn, 0, name, dev_id);
+ if (ret)
+ continue;
+
+ if (ntb_msi_set_desc(ntb, entry, msi_desc)) {
+ devm_free_irq(&ntb->dev, entry->irq, dev_id);
+ continue;
+ }
+
+ ret = ntbm_msi_setup_callback(ntb, entry, msi_desc);
+ if (ret) {
+ devm_free_irq(&ntb->dev, entry->irq, dev_id);
+ return ret;
+ }
+
+
+ return entry->irq;
+ }
+
+ return -ENODEV;
+}
+EXPORT_SYMBOL(ntbm_msi_request_threaded_irq);
+
+static int ntbm_msi_callback_match(struct device *dev, void *res, void *data)
+{
+ struct ntb_dev *ntb = dev_ntb(dev);
+ struct ntb_msi_devres *dr = res;
+
+ return dr->ntb == ntb && dr->entry == data;
+}
+
+/**
+ * ntbm_msi_free_irq() - free an interrupt
+ * @ntb: NTB device context
+ * @irq: Interrupt line to free
+ * @dev_id: Device identity to free
+ *
+ * This function should be used to manually free IRQs allocated with
+ * ntbm_request_[threaded_]irq().
+ */
+void ntbm_msi_free_irq(struct ntb_dev *ntb, unsigned int irq, void *dev_id)
+{
+ struct msi_desc *entry = irq_get_msi_desc(irq);
+
+ entry->write_msi_msg = NULL;
+ entry->write_msi_msg_data = NULL;
+
+ WARN_ON(devres_destroy(&ntb->dev, ntbm_msi_callback_release,
+ ntbm_msi_callback_match, entry));
+
+ devm_free_irq(&ntb->dev, irq, dev_id);
+}
+EXPORT_SYMBOL(ntbm_msi_free_irq);
+
+/**
+ * ntb_msi_peer_trigger() - Trigger an interrupt handler on a peer
+ * @ntb: NTB device context
+ * @peer: Peer index
+ * @desc: MSI descriptor data which triggers the interrupt
+ *
+ * This function triggers an interrupt on a peer. It requires
+ * the descriptor structure to have been passed from that peer
+ * by some other means.
+ *
+ * Return: Zero on success, otherwise a negative error number.
+ */
+int ntb_msi_peer_trigger(struct ntb_dev *ntb, int peer,
+ struct ntb_msi_desc *desc)
+{
+ int idx;
+
+ if (!ntb->msi)
+ return -EINVAL;
+
+ idx = desc->addr_offset / sizeof(*ntb->msi->peer_mws[peer]);
+
+ ntb->msi->peer_mws[peer][idx] = desc->data;
+
+ return 0;
+}
+EXPORT_SYMBOL(ntb_msi_peer_trigger);
+
+/**
+ * ntb_msi_peer_addr() - Get the DMA address to trigger a peer's MSI interrupt
+ * @ntb: NTB device context
+ * @peer: Peer index
+ * @desc: MSI descriptor data which triggers the interrupt
+ * @msi_addr: Physical address to trigger the interrupt
+ *
+ * This function allows using DMA engines to trigger an interrupt
+ * (for example, trigger an interrupt to process the data after
+ * sending it). To trigger the interrupt, write @desc.data to the address
+ * returned in @msi_addr
+ *
+ * Return: Zero on success, otherwise a negative error number.
+ */
+int ntb_msi_peer_addr(struct ntb_dev *ntb, int peer,
+ struct ntb_msi_desc *desc,
+ phys_addr_t *msi_addr)
+{
+ int peer_widx = ntb_peer_mw_count(ntb) - 1 - peer;
+ phys_addr_t mw_phys_addr;
+ int ret;
+
+ ret = ntb_peer_mw_get_addr(ntb, peer_widx, &mw_phys_addr, NULL);
+ if (ret)
+ return ret;
+
+ if (msi_addr)
+ *msi_addr = mw_phys_addr + desc->addr_offset;
+
+ return 0;
+}
+EXPORT_SYMBOL(ntb_msi_peer_addr);
diff --git a/include/linux/ntb.h b/include/linux/ntb.h
index f5c69d853489..b9c61ee3c734 100644
--- a/include/linux/ntb.h
+++ b/include/linux/ntb.h
@@ -58,9 +58,11 @@

#include <linux/completion.h>
#include <linux/device.h>
+#include <linux/interrupt.h>

struct ntb_client;
struct ntb_dev;
+struct ntb_msi;
struct pci_dev;

/**
@@ -425,6 +427,10 @@ struct ntb_dev {
spinlock_t ctx_lock;
/* block unregister until device is fully released */
struct completion released;
+
+ #ifdef CONFIG_NTB_MSI
+ struct ntb_msi *msi;
+ #endif
};
#define dev_ntb(__dev) container_of((__dev), struct ntb_dev, dev)

@@ -1572,4 +1578,71 @@ static inline int ntb_peer_highest_mw_idx(struct ntb_dev *ntb, int pidx)
return ntb_mw_count(ntb, pidx) - ret - 1;
}

+struct ntb_msi_desc {
+ u32 addr_offset;
+ u32 data;
+};
+
+#ifdef CONFIG_NTB_MSI
+
+int ntb_msi_init(struct ntb_dev *ntb, void (*desc_changed)(void *ctx));
+int ntb_msi_setup_mws(struct ntb_dev *ntb);
+void ntb_msi_clear_mws(struct ntb_dev *ntb);
+int ntbm_msi_request_threaded_irq(struct ntb_dev *ntb, irq_handler_t handler,
+ irq_handler_t thread_fn,
+ const char *name, void *dev_id,
+ struct ntb_msi_desc *msi_desc);
+void ntbm_msi_free_irq(struct ntb_dev *ntb, unsigned int irq, void *dev_id);
+int ntb_msi_peer_trigger(struct ntb_dev *ntb, int peer,
+ struct ntb_msi_desc *desc);
+int ntb_msi_peer_addr(struct ntb_dev *ntb, int peer,
+ struct ntb_msi_desc *desc,
+ phys_addr_t *msi_addr);
+
+#else /* not CONFIG_NTB_MSI */
+
+static inline int ntb_msi_init(struct ntb_dev *ntb,
+ void (*desc_changed)(void *ctx))
+{
+ return -EOPNOTSUPP;
+}
+static inline int ntb_msi_setup_mws(struct ntb_dev *ntb)
+{
+ return -EOPNOTSUPP;
+}
+static inline void ntb_msi_clear_mws(struct ntb_dev *ntb) {}
+static inline int ntbm_msi_request_threaded_irq(struct ntb_dev *ntb,
+ irq_handler_t handler,
+ irq_handler_t thread_fn,
+ const char *name, void *dev_id,
+ struct ntb_msi_desc *msi_desc)
+{
+ return -EOPNOTSUPP;
+}
+static inline void ntbm_msi_free_irq(struct ntb_dev *ntb, unsigned int irq,
+ void *dev_id) {}
+static inline int ntb_msi_peer_trigger(struct ntb_dev *ntb, int peer,
+ struct ntb_msi_desc *desc)
+{
+ return -EOPNOTSUPP;
+}
+static inline int ntb_msi_peer_addr(struct ntb_dev *ntb, int peer,
+ struct ntb_msi_desc *desc,
+ phys_addr_t *msi_addr)
+{
+ return -EOPNOTSUPP;
+
+}
+
+#endif /* CONFIG_NTB_MSI */
+
+static inline int ntbm_msi_request_irq(struct ntb_dev *ntb,
+ irq_handler_t handler,
+ const char *name, void *dev_id,
+ struct ntb_msi_desc *msi_desc)
+{
+ return ntbm_msi_request_threaded_irq(ntb, handler, NULL, name,
+ dev_id, msi_desc);
+}
+
#endif
--
2.19.0

Logan Gunthorpe

unread,
Feb 13, 2019, 12:55:12 PM2/13/19
to linux-...@vger.kernel.org, linu...@googlegroups.com, linu...@vger.kernel.org, io...@lists.linux-foundation.org, linux-k...@vger.kernel.org, Jon Mason, Bjorn Helgaas, Joerg Roedel, Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore, Logan Gunthorpe, David Woodhouse
Currently the Intel IOMMU uses the default dma_[un]map_resource()
implementations does nothing and simply returns the physical address
unmodified.

However, this doesn't create the IOVA entries necessary for addresses
mapped this way to work when the IOMMU is enabled. Thus, when the
IOMMU is enabled, drivers relying on dma_map_resource() will trigger
DMAR errors. We see this when running ntb_transport with the IOMMU
enabled, DMA, and switchtec hardware.

The implementation for intel_map_resource() is nearly identical to
intel_map_page(), we just have to re-create __intel_map_single().
dma_unmap_resource() uses intel_unmap_page() directly as the
functions are identical.

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Cc: David Woodhouse <dw...@infradead.org>
Cc: Joerg Roedel <jo...@8bytes.org>
---
drivers/iommu/intel-iommu.c | 23 ++++++++++++++++-------
1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 78188bf7e90d..ad737e16575b 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3649,11 +3649,9 @@ static int iommu_no_mapping(struct device *dev)
return 0;
}

-static dma_addr_t __intel_map_page(struct device *dev, struct page *page,
- unsigned long offset, size_t size, int dir,
- u64 dma_mask)
+static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr,
+ size_t size, int dir, u64 dma_mask)
{
- phys_addr_t paddr = page_to_phys(page) + offset;
struct dmar_domain *domain;
phys_addr_t start_paddr;
unsigned long iova_pfn;
@@ -3715,7 +3713,15 @@ static dma_addr_t intel_map_page(struct device *dev, struct page *page,
enum dma_data_direction dir,
unsigned long attrs)
{
- return __intel_map_page(dev, page, offset, size, dir, *dev->dma_mask);
+ return __intel_map_single(dev, page_to_phys(page) + offset, size,
+ dir, *dev->dma_mask);
+}
+
+static dma_addr_t intel_map_resource(struct device *dev, phys_addr_t phys_addr,
+ size_t size, enum dma_data_direction dir,
+ unsigned long attrs)
+{
+ return __intel_map_single(dev, phys_addr, size, dir, *dev->dma_mask);
}

static void intel_unmap(struct device *dev, dma_addr_t dev_addr, size_t size)
@@ -3806,8 +3812,9 @@ static void *intel_alloc_coherent(struct device *dev, size_t size,
return NULL;
memset(page_address(page), 0, size);

- *dma_handle = __intel_map_page(dev, page, 0, size, DMA_BIDIRECTIONAL,
- dev->coherent_dma_mask);
+ *dma_handle = __intel_map_single(dev, page_to_phys(page), size,
+ DMA_BIDIRECTIONAL,
+ dev->coherent_dma_mask);
if (*dma_handle != DMA_MAPPING_ERROR)
return page_address(page);
if (!dma_release_from_contiguous(dev, page, size >> PAGE_SHIFT))
@@ -3924,6 +3931,8 @@ static const struct dma_map_ops intel_dma_ops = {
.unmap_sg = intel_unmap_sg,
.map_page = intel_map_page,
.unmap_page = intel_unmap_page,
+ .map_resource = intel_map_resource,
+ .unmap_resource = intel_unmap_page,
.dma_supported = dma_direct_supported,
};

--
2.19.0

Logan Gunthorpe

unread,
Feb 13, 2019, 12:57:33 PM2/13/19
to linux-...@vger.kernel.org, linu...@googlegroups.com, linu...@vger.kernel.org, io...@lists.linux-foundation.org, linux-k...@vger.kernel.org, Jon Mason, Bjorn Helgaas, Joerg Roedel, Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore, David Woodhouse
Oops, sorry. Please ignore the first two patches in this series. They
have already been merged independently.

Logan

Joerg Roedel

unread,
Feb 26, 2019, 4:34:48 AM2/26/19
to Logan Gunthorpe, linux-...@vger.kernel.org, linu...@googlegroups.com, linu...@vger.kernel.org, io...@lists.linux-foundation.org, linux-k...@vger.kernel.org, Jon Mason, Bjorn Helgaas, Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore
On Wed, Feb 13, 2019 at 10:54:42AM -0700, Logan Gunthorpe wrote:
> iommu/vt-d: Add helper to set an IRTE to verify only the bus number
> iommu/vt-d: Allow interrupts from the entire bus for aliased devices

Applied these two to the iommu tree, thanks.

Logan Gunthorpe

unread,
Feb 26, 2019, 11:18:46 AM2/26/19
to Joerg Roedel, linux-...@vger.kernel.org, linu...@googlegroups.com, linu...@vger.kernel.org, io...@lists.linux-foundation.org, linux-k...@vger.kernel.org, Jon Mason, Bjorn Helgaas, Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore
Thanks!

Logan

Serge Semin

unread,
Mar 6, 2019, 3:26:37 PM3/6/19
to Logan Gunthorpe, linux-...@vger.kernel.org, linu...@googlegroups.com, linu...@vger.kernel.org, io...@lists.linux-foundation.org, linux-k...@vger.kernel.org, Jon Mason, Bjorn Helgaas, Joerg Roedel, Allen Hubbe, Dave Jiang, Eric Pilmore
On Wed, Feb 13, 2019 at 10:54:51AM -0700, Logan Gunthorpe wrote:

Hi

> The NTB MSI library allows passing MSI interrupts across a memory
> window. This offers similar functionality to doorbells or messages
> except will often have much better latency and the client can
> potentially use significantly more remote interrupts than typical hardware
> provides for doorbells. (Which can be important in high-multiport
> setups.)
>
> The library utilizes one memory window per peer and uses the highest
> index memory windows. Before any ntb_msi function may be used, the user
> must call ntb_msi_init(). It may then setup and tear down the memory
> windows when the link state changes using ntb_msi_setup_mws() and
> ntb_msi_clear_mws().
>
> The peer which receives the interrupt must call ntb_msim_request_irq()
> to assign the interrupt handler (this function is functionally
> similar to devm_request_irq()) and the returned descriptor must be
> transferred to the peer which can use it to trigger the interrupt.
> The triggering peer, once having received the descriptor, can
> trigger the interrupt by calling ntb_msi_peer_trigger().
>

The library is very useful, thanks for sharing it with us.

Here are my two general concerns regarding the implementation.
(More specific comments are further in the letter.)

First of all, It might be unsafe to have some resources consumed by NTB
MSI or some other library without a simple way to warn NTB client drivers
about their attempts to access that resources, since it might lead to random
errors. When I thought about implementing a transport library based on the
Message/Spad+Doorbell registers, I had in mind to create an internal bits-field
array with the resources busy-flags. If, for instance, some message or
scratchpad register is occupied by the library (MSI, transport or some else),
then it would be impossible to access these resources directly through NTB API
methods. So NTB client driver shall retrieve an error in an attempt to
write/read data to/from busy message or scratchpad register, or in an attempt
to set some occupied doorbell bit. The same thing can be done for memory windows.

Second tiny concern is about documentation. Since there is a special file for
all NTB-related doc, it would be good to have some description about the
NTB MSI library there as well:
Documentation/ntb.txt
Shouldn't we use the __iomem attribute here since later the devm_ioremap() is
used to map MWs at these pointers?
Simpler and faster cleanup-code would be:

+ unroll:
+ for (--i; i >= 0; --i)
Alas calling the ntb_mw_set_trans() method isn't enough to fully initialize
NTB Memory Windows. Yes, the library will work for Intel/AMD/Switchtec
(two-ports legacy configuration), but will fail for IDT due to being based on
the outbound MW xlat interface. So the library at this stage isn't portable
across all NTB hardware. In order to make it working the translation address is
supposed to be transferred to the peer side, where a peer code should call
ntb_peer_mw_set_trans() method with the retrieved xlat address.
See documentation for details:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/ntb.txt

ntb_perf driver can be also used as a reference of the portable NTB MWs setup.

So I'd suggest to add some method like ntb_msi_peer_setup_mws() or similar
which is supposed to be called on the peer side with a translation address
or some common descriptor containing the address passed to the function
argument.

It seems to me the test driver should be also altered to support this case.

> + }
> +
> + ntb->msi->base_addr = addr;
> + ntb->msi->end_addr = addr + mw_min_size;
> +
> + return 0;
> +
> +error_out:
> + for (i = 0; i < peer; i++) {
> + peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
> + if (peer_widx < 0)
> + continue;
> +
> + ntb_mw_clear_trans(ntb, i, peer_widx);
> + }

The same cleanup pattern can be utilized here:
+error_out:
+ for (--peer; peer >= 0; --peer) {
+ peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
+ ntb_mw_clear_trans(ntb, i, peer_widx);
+ }

So you won't need "i" variable here anymore. You also don't need to check the
return value of ntb_peer_highest_mw_idx() in the cleanup loop because it
was already checked in the main algo code.
Similarly something like ntb_msi_peer_clear_mws() should be added to
unset a translation address on the peer side.
Shouldn't we use iowrite32() here instead of direct access to the IO-memory?
I'd align the macro-condition to the most left position:
+#ifdef CONFIG_NTB_MSI
> --
> You received this message because you are subscribed to the Google Groups "linux-ntb" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to linux-ntb+...@googlegroups.com.
> To post to this group, send email to linu...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/linux-ntb/20190213175454.7506-10-logang%40deltatee.com.
> For more options, visit https://groups.google.com/d/optout.

Serge Semin

unread,
Mar 6, 2019, 3:44:05 PM3/6/19
to Logan Gunthorpe, linux-...@vger.kernel.org, linu...@googlegroups.com, linu...@vger.kernel.org, io...@lists.linux-foundation.org, linux-k...@vger.kernel.org, Jon Mason, Bjorn Helgaas, Joerg Roedel, Allen Hubbe, Dave Jiang, Eric Pilmore
Hi

On Wed, Feb 13, 2019 at 10:54:52AM -0700, Logan Gunthorpe wrote:
> Introduce a tool to test NTB MSI interrupts similar to the other
> NTB test tools. This tool creates a debugfs directory for each
> NTB device with the following files:
>
> port
> irqX_occurrences
> peerX/port
> peerX/count
> peerX/trigger
>
> The 'port' file tells the user the local port number and the
> 'occurrences' files tell the number of local interrupts that
> have been received for each interrupt.
>
> For each peer, the 'port' file and the 'count' file tell you the
> peer's port number and number of interrupts respectively. Writing
> the interrupt number to the 'trigger' file triggers the interrupt
> handler for the peer which should increment their corresponding
> 'occurrences' file. The 'ready' file indicates if a peer is ready,
> writing to this file blocks until it is ready.
>
> The module parameter num_irqs can be used to set the number of
> local interrupts. By default this is 4. This is only limited by
> the number of unused MSI interrupts registered by the hardware
> (this will require support of the hardware driver) and there must
> be at least 2*num_irqs + 1 spads registers available.
>

Alas the test driver is not going to work with IDT NTB hardware,
since it uses Spad only, which aren't supported by IDT devices. IDT NTB
PCIe functions provide message registers to communicate with peers. See
ntb_perf driver code for reference of portable driver design.

In order to make the test driver portable the following alterations
are needed:
1) Add NTB Message register API usage together with Scratchpad+Doorbell.
2) When NTB MSI library is updated with functions like
ntb_msi_peer_setup_mws()/ntb_msi_peer_clear_mws() they are needed to be
utilized in the test driver as well to set a translation address on the
peer side.
> --
> You received this message because you are subscribed to the Google Groups "linux-ntb" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to linux-ntb+...@googlegroups.com.
> To post to this group, send email to linu...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/linux-ntb/20190213175454.7506-11-logang%40deltatee.com.

Logan Gunthorpe

unread,
Mar 6, 2019, 4:36:06 PM3/6/19
to Serge Semin, linux-...@vger.kernel.org, linu...@googlegroups.com, linu...@vger.kernel.org, io...@lists.linux-foundation.org, linux-k...@vger.kernel.org, Jon Mason, Bjorn Helgaas, Joerg Roedel, Allen Hubbe, Dave Jiang, Eric Pilmore


On 2019-03-06 1:26 p.m., Serge Semin wrote:
> First of all, It might be unsafe to have some resources consumed by NTB
> MSI or some other library without a simple way to warn NTB client drivers
> about their attempts to access that resources, since it might lead to random
> errors. When I thought about implementing a transport library based on the
> Message/Spad+Doorbell registers, I had in mind to create an internal bits-field
> array with the resources busy-flags. If, for instance, some message or
> scratchpad register is occupied by the library (MSI, transport or some else),
> then it would be impossible to access these resources directly through NTB API
> methods. So NTB client driver shall retrieve an error in an attempt to
> write/read data to/from busy message or scratchpad register, or in an attempt
> to set some occupied doorbell bit. The same thing can be done for memory windows.

Yes, it would be nice to have a generic library to manage all the
resources, but right now we don't and it's unfair to expect us to take
on this work to get the features we care about merged. Right now, it's
not at all unsafe as the client is quite capable of ensuring it has the
resources for the MSI library. The changes for ntb_transport to ensure
this are quite reasonable.

> Second tiny concern is about documentation. Since there is a special file for
> all NTB-related doc, it would be good to have some description about the
> NTB MSI library there as well:
> Documentation/ntb.txt

Sure, I'll add a short blurb for v3. Though, I noticed it's quite out of
date since your changes. Especially in the ntb_tool section...

>> + u32 *peer_mws[];
>
> Shouldn't we use the __iomem attribute here since later the devm_ioremap() is
> used to map MWs at these pointers?

Yes, will change for v3.


> Simpler and faster cleanup-code would be:

> + unroll:
> + for (--i; i >= 0; --i)
> + devm_iounmap(&ntb->dev, ntb->msi->peer_mws[i]);

Faster, maybe, but I would not consider this simpler. It's much more
complicated to reason about and ensure it's correct. I prefer my way
because I don't care about speed, but I do care about readability.


> Alas calling the ntb_mw_set_trans() method isn't enough to fully initialize
> NTB Memory Windows. Yes, the library will work for Intel/AMD/Switchtec
> (two-ports legacy configuration), but will fail for IDT due to being based on
> the outbound MW xlat interface. So the library at this stage isn't portable
> across all NTB hardware. In order to make it working the translation address is
> supposed to be transferred to the peer side, where a peer code should call
> ntb_peer_mw_set_trans() method with the retrieved xlat address.
> See documentation for details:

> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/ntb.txt
>
> ntb_perf driver can be also used as a reference of the portable NTB MWs setup.

Gross. Well, given that ntb_transport doesn't even support this and we
don't really have a sensible library to transfer this information, I'm
going to leave it as is for now. Someone can update ntb_msi when they
update ntb_transport, preferably after we have a nice library to handle
the transfers for us seeing I absolutely do not want to replicate the
mess in ntb_perf.

Actually, if we had a generic spad/msg communication library, it would
probably be better to have a common ntb_mw_set_trans() function that
uses the communications library to send the data and automatically call
ntb_peer_mw_set_trans() on the peer. That way we don't have to push this
mess into the clients.

> The same cleanup pattern can be utilized here:
> +error_out:
> + for (--peer; peer >= 0; --peer) {
> + peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
> + ntb_mw_clear_trans(ntb, i, peer_widx);
> + }
>
> So you won't need "i" variable here anymore. You also don't need to check the
> return value of ntb_peer_highest_mw_idx() in the cleanup loop because it
> was already checked in the main algo code.

See above.

>> +EXPORT_SYMBOL(ntb_msi_clear_mws);
>> +
>
> Similarly something like ntb_msi_peer_clear_mws() should be added to
> unset a translation address on the peer side.

Well, we can table that for when ntb_msi supports the peer MW setting
functions.
>> +int ntb_msi_peer_trigger(struct ntb_dev *ntb, int peer,
>> + struct ntb_msi_desc *desc)
>> +{
>> + int idx;
>> +
>> + if (!ntb->msi)
>> + return -EINVAL;
>> +
>> + idx = desc->addr_offset / sizeof(*ntb->msi->peer_mws[peer]);
>> +
>> + ntb->msi->peer_mws[peer][idx] = desc->data;
>> +
>
> Shouldn't we use iowrite32() here instead of direct access to the IO-memory?

Yes, as above I'll fix it for v3.

>> @@ -425,6 +427,10 @@ struct ntb_dev {
>> spinlock_t ctx_lock;
>> /* block unregister until device is fully released */
>> struct completion released;
>> +
>> + #ifdef CONFIG_NTB_MSI
>> + struct ntb_msi *msi;
>> + #endif
>
> I'd align the macro-condition to the most left position:
> +#ifdef CONFIG_NTB_MSI
> + struct ntb_msi *msi;
> +#endif

Fixed for v3.


Logan

Logan Gunthorpe

unread,
Mar 6, 2019, 4:39:38 PM3/6/19
to Serge Semin, linux-...@vger.kernel.org, linu...@googlegroups.com, linu...@vger.kernel.org, io...@lists.linux-foundation.org, linux-k...@vger.kernel.org, Jon Mason, Bjorn Helgaas, Joerg Roedel, Allen Hubbe, Dave Jiang, Eric Pilmore


On 2019-03-06 1:44 p.m., Serge Semin wrote:
> Alas the test driver is not going to work with IDT NTB hardware,
> since it uses Spad only, which aren't supported by IDT devices. IDT NTB
> PCIe functions provide message registers to communicate with peers. See
> ntb_perf driver code for reference of portable driver design.
>
> In order to make the test driver portable the following alterations
> are needed:
> 1) Add NTB Message register API usage together with Scratchpad+Doorbell.
> 2) When NTB MSI library is updated with functions like
> ntb_msi_peer_setup_mws()/ntb_msi_peer_clear_mws() they are needed to be
> utilized in the test driver as well to set a translation address on the
> peer side.

Per, the discussion on the previous email, we can update this later once
we actually have sensible infrastructure for it. The mess in ntb_perf is
not something I want to replicate and we don't have any interest in
creating a large amount of infrastructure to support hardware we don't
care about.

Logan

Serge Semin

unread,
Mar 6, 2019, 6:13:28 PM3/6/19
to Logan Gunthorpe, linux-...@vger.kernel.org, linu...@googlegroups.com, linu...@vger.kernel.org, io...@lists.linux-foundation.org, linux-k...@vger.kernel.org, Jon Mason, Bjorn Helgaas, Joerg Roedel, Allen Hubbe, Dave Jiang, Eric Pilmore
Ok. Thanks.
If you want you can add some info to the ntb_tool section as well. If you
don't have time, I'll update it next time I submit anything new to the
subsystem.

-Sergey
> --
> You received this message because you are subscribed to the Google Groups "linux-ntb" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to linux-ntb+...@googlegroups.com.
> To post to this group, send email to linu...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/linux-ntb/5b420eb7-5010-aae3-e9bd-1c612af409ae%40deltatee.com.
Reply all
Reply to author
Forward
0 new messages