[RFC PATCH 00/13] Switchtec NTB Support

188 views
Skip to first unread message

Logan Gunthorpe

unread,
Jun 15, 2017, 4:37:54 PM6/15/17
to linu...@googlegroups.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, Jon Mason, Dave Jiang, Allen Hubbe, Bjorn Helgaas, Greg Kroah-Hartman, Kurt Schwemmer, Stephen Bates, Logan Gunthorpe
Hi,

This patchset implements Non-Transparent Bridge (NTB) support for the
Microsemi Switchtec series of switches. We're looking for some
review from the community at this point but hope to get it upstreamed
for v4.14.

Switchtec NTB support is configured over the same function and bar
as the management endpoint. Thus, the new driver hooks into the
management driver which we had merged in v4.12. We use the class
interface API to register an NTB device for every switchtec device
which supports NTB (not all do).

The Switchtec hardware supports doorbells, memory windows and messages.
Seeing there is no native scratchpad support, 128 spads are emulated
through the use of a pre-setup memory window. The switch has 64
doorbells which are shared between the two partitions and a
configurable set of memory windows. While the hardware supports more
than 2 partitions, this driver only supports the first two seeing
the current NTB API only supports two hosts.

The driver has been tested with ntb_netdev and fully passes the
ntb_test script.

This patchset is based off of v4.12-rc5 and can be found in this
git repo:

https://github.com/sbates130272/linux-p2pmem.git switchtec_ntb

Thanks,

Logan


Logan Gunthorpe (13):
switchtec: move structure definitions into a common header
switchtec: export class symbol for use in upper layer driver
switchtec: add ntb hardware register definitions
switchtec: add link event notifier block
switchtec_ntb: introduce initial ntb driver
switchtec_ntb: initialize hardware for memory windows
switchtec_ntb: initialize hardware for doorbells and messages
switchtec_ntb: add skeleton ntb driver
switchtec_ntb: add link management
switchtec_ntb: implement doorbell registers
switchtec_ntb: implement scratchpad registers
switchtec_ntb: add memory window support
switchtec_ntb: update switchtec documentation with notes for ntb

Documentation/switchtec.txt | 12 +
MAINTAINERS | 2 +
drivers/ntb/hw/Kconfig | 1 +
drivers/ntb/hw/Makefile | 1 +
drivers/ntb/hw/mscc/Kconfig | 9 +
drivers/ntb/hw/mscc/Makefile | 1 +
drivers/ntb/hw/mscc/switchtec_ntb.c | 1144 +++++++++++++++++++++++++++++++++++
drivers/pci/switch/switchtec.c | 319 ++--------
include/linux/ntb.h | 3 +
include/linux/switchtec.h | 365 +++++++++++
10 files changed, 1601 insertions(+), 256 deletions(-)
create mode 100644 drivers/ntb/hw/mscc/Kconfig
create mode 100644 drivers/ntb/hw/mscc/Makefile
create mode 100644 drivers/ntb/hw/mscc/switchtec_ntb.c
create mode 100644 include/linux/switchtec.h

--
2.11.0

Allen Hubbe

unread,
Jun 16, 2017, 9:53:19 AM6/16/17
to Logan Gunthorpe, linu...@googlegroups.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, Jon Mason, Dave Jiang, Bjorn Helgaas, Greg Kroah-Hartman, Kurt Schwemmer, Stephen Bates, Serge Semin, Sergey...@t-platforms.ru
From: Logan Gunthorpe
> Hi,
>
> This patchset implements Non-Transparent Bridge (NTB) support for the
> Microsemi Switchtec series of switches. We're looking for some
> review from the community at this point but hope to get it upstreamed
> for v4.14.
>
> Switchtec NTB support is configured over the same function and bar
> as the management endpoint. Thus, the new driver hooks into the
> management driver which we had merged in v4.12. We use the class
> interface API to register an NTB device for every switchtec device
> which supports NTB (not all do).
>
> The Switchtec hardware supports doorbells, memory windows and messages.
> Seeing there is no native scratchpad support, 128 spads are emulated
> through the use of a pre-setup memory window. The switch has 64
> doorbells which are shared between the two partitions and a
> configurable set of memory windows. While the hardware supports more
> than 2 partitions, this driver only supports the first two seeing
> the current NTB API only supports two hosts.

See what is staged in https://github.com/jonmason/ntb.git ntb-next, with the addition of multi-peer support by Serge. It would be good at this stage to understand whether the api changes there would also support the Switchtec driver, and what if anything must change, or be planned to change, to support the Switchtec driver.

Thanks for providing the patch set for the Switchtec driver. My first impression is that it is a good patch set. Only to be included, it needs to be reconciled with the api changes in ntb-next. I will follow up with a more detailed review of patches in this series, but sending this now as I don't want to delay your review of ntb-next.

Logan Gunthorpe

unread,
Jun 16, 2017, 10:10:26 AM6/16/17
to Allen Hubbe, linu...@googlegroups.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, Jon Mason, Dave Jiang, Bjorn Helgaas, Greg Kroah-Hartman, Kurt Schwemmer, Stephen Bates, Serge Semin, Sergey...@t-platforms.ru


On 16/06/17 07:53 AM, Allen Hubbe wrote:
> See what is staged in https://github.com/jonmason/ntb.git ntb-next, with the addition of multi-peer support by Serge. It would be good at this stage to understand whether the api changes there would also support the Switchtec driver, and what if anything must change, or be planned to change, to support the Switchtec driver.

Ah, yes I had seen that patchset some time ago but I wasn't aware of
it's status or that it was queued up in ntb-next. I think it will be no
problem to reconcile with the switchtec driver and I'll rebase onto
ntb-next for the next posting of the patch set. However, I *may* save
full multi-host switchtec support for a follow up submission. My initial
impression is the new API will support the switchtec hardware well.

Thanks,

Logan

Allen Hubbe

unread,
Jun 16, 2017, 11:34:58 AM6/16/17
to Logan Gunthorpe, linu...@googlegroups.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, Jon Mason, Dave Jiang, Bjorn Helgaas, Greg Kroah-Hartman, Kurt Schwemmer, Stephen Bates, Serge Semin, Sergey...@t-platforms.ru
From: Logan Gunthorpe
Alright!

In code review, I really only have found minor nits. Overall, the driver looks good.

In switchtec_ntb_part_op, there is a delay of up to 50s (1000 * 50ms). This looks like a thread context, so it could involve the scheduler for the delay instead of spinning for up to 50s before bailing.

There are a few instances like this:
> + dev_dbg(&stdev->dev, "%s\n", __func__);

Where the printing of __func__ could be controlled by dyndbg=+pf. The debug message could be more useful.

In switchtec_ntb_db_set_mask and friends, an in-memory copy of the mask bits is protected by a spinlock. Elsewhere, you noted that the db bits are shared between all ports, so the db bitset is chopped up to be shared between the ports. Is the db mask also shared, and how is the spinlock sufficient for synchronizing access to the mask bits between multiple ports?

The IDT switch also does not have hardware scratchpads. Could the code you wrote for emulated scratchpads be made into shared library code for ntb drivers? Also, some ntb clients may not need scratchpad support. If it is not natively supported by a driver, can the emulated scratchpad support be an optional feature?

>
> Thanks,
>
> Logan

Serge Semin

unread,
Jun 16, 2017, 12:33:04 PM6/16/17
to Logan Gunthorpe, Allen Hubbe, linu...@googlegroups.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, Jon Mason, Dave Jiang, Bjorn Helgaas, Greg Kroah-Hartman, Kurt Schwemmer, Stephen Bates, Sergey...@t-platforms.ru
Hello Logan.
Thanks for the new hardware driver. It's really great to see NTB subsystem
being developed.

New NTB API is going to be merged to mainline kernel within next features
merge window, so it's really recommended to use that API for new hardware.
Could you please rabase your driver on top of the tree?
https://github.com/jonmason/ntb.git

> See what is staged in https://github.com/jonmason/ntb.git ntb-next, with
> the addition of multi-peer support by Serge. It would be good at this
> stage to understand whether the api changes there would also support the
> Switchtec driver, and what if anything must change, or be planned to change,
> to support the Switchtec driver.

My impression is that, there is no need for new NTB API changes, since it
was specifically designed for new type of multi-port switches with
additional message registers, which is exactly supported by switchtec
device.

> The Switchtec hardware supports doorbells, memory windows and messages.
> Seeing there is no native scratchpad support, 128 spads are emulated
> through the use of a pre-setup memory window.

According to the NTB philosophy, it's not good to have any hardware
emulation within hardware driver. Hardware driver must reflect the only
hardware abilities, nothing else. Could you please get rid of Scratchpad
emulation and add messaging as new NTB API has got a proper callback
functions for it?
Hmmm, I haven't seen the actual code (see my last comment), but
according to my impression of API, it's impossible to have memory window
accessed while NTB link is down, but Scratchpads still can be used.
In this case, if you got Scratchpads emulated over memory windows,
you must have got NTB link enabled before NTB device is registered, which
makes ntb_link_* methods kind of being useless unless Switchtec hardware
supports NTB link getting up/down for individual memory windows...

> configurable set of memory windows. While the hardware supports more
> than 2 partitions, this driver only supports the first two seeing
> the current NTB API only supports two hosts.

New NTB API is updated so to have access to any of peer ports. IDT driver
has got a special translation table to access peer functionality just by
providing an index to corresponding API callback. You can use it as
reference to have Switchtec driver accordingly altered. It would be vastly
useful to have one more multi-port hardware driver in the tree.

> switchtec: move structure definitions into a common header
> switchtec: export class symbol for use in upper layer driver
> switchtec: add ntb hardware register definitions
> switchtec: add link event notifier block
> switchtec_ntb: introduce initial ntb driver
> switchtec_ntb: initialize hardware for memory windows
> switchtec_ntb: initialize hardware for doorbells and messages
> switchtec_ntb: add skeleton ntb driver
> switchtec_ntb: add link management
> switchtec_ntb: implement doorbell registers
> switchtec_ntb: implement scratchpad registers
> switchtec_ntb: add memory window support
> switchtec_ntb: update switchtec documentation with notes for ntb

If I'm not mistaken, these patches can be combined the way so to have
just two big functionally split patches:
1) NTB: Microsemt Switchtec switch management driver alterations for NTB
2) NTB: Add Microsemi Switchtec PCIe-switches support
It would really simplify the review. Could you please combine them?


Thanks for submitting the patches. We really appreciate this and looking
forward to have it completely reviewed and added to the kernel tree.

Regards
-Sergey

Logan Gunthorpe

unread,
Jun 16, 2017, 12:47:43 PM6/16/17
to Allen Hubbe, linu...@googlegroups.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, Jon Mason, Dave Jiang, Bjorn Helgaas, Greg Kroah-Hartman, Kurt Schwemmer, Stephen Bates, Serge Semin, Sergey...@t-platforms.ru


On 16/06/17 09:34 AM, Allen Hubbe wrote:
> In code review, I really only have found minor nits. Overall, the driver looks good.

Great, thanks for such a quick review!

> In switchtec_ntb_part_op, there is a delay of up to 50s (1000 * 50ms). This looks like a thread context, so it could involve the scheduler for the delay instead of spinning for up to 50s before bailing.

Good point. If I were to change this to msleep_interruptible would that
be acceptable?

> There are a few instances like this:
>> + dev_dbg(&stdev->dev, "%s\n", __func__);

> Where the printing of __func__ could be controlled by dyndbg=+pf. The debug message could be more useful.

Ok, I'll change that.

> In switchtec_ntb_db_set_mask and friends, an in-memory copy of the mask bits is protected by a spinlock. Elsewhere, you noted that the db bits are shared between all ports, so the db bitset is chopped up to be shared between the ports. Is the db mask also shared, and how is the spinlock sufficient for synchronizing access to the mask bits between multiple ports?

Well, there are 64 doorbells that are shared between ports but each port
has it's own in and out registers for the doorbells. So triggering
doorbell one on one port's ODB actually triggers it on every ports IDB.
So these are shared only in the sense that each port needs to know which
dbs it cares about. Seeing each port has their own registers they don't
have to worry about synchronization.

The mask is only protected by a spin lock seeing multiple callers of
db_set_mask and db_clr_mask on the same port may step on each others
toes. So if two processes try to mask different bits they both must get
masked in the end and therefore some kind of synchronization must be
involved.

> The IDT switch also does not have hardware scratchpads. Could the code you wrote for emulated scratchpads be made into shared library code for ntb drivers? Also, some ntb clients may not need scratchpad support. If it is not natively supported by a driver, can the emulated scratchpad support be an optional feature?

Hmm, interesting idea. A few pieces could possibly be made common but it
depends mostly on hardware having the resources to make use of it.
Switchtec has extra LUT memory windows that made this possible. Unless
you object I'm inclined to leave it as is and I'd be happy to work with
the IDT folks to create a common solution in the future.

Logan

Logan Gunthorpe

unread,
Jun 16, 2017, 1:09:15 PM6/16/17
to Serge Semin, Allen Hubbe, linu...@googlegroups.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, Jon Mason, Dave Jiang, Bjorn Helgaas, Greg Kroah-Hartman, Kurt Schwemmer, Stephen Bates, Sergey...@t-platforms.ru


On 16/06/17 10:33 AM, Serge Semin wrote:
> New NTB API is going to be merged to mainline kernel within next features
> merge window, so it's really recommended to use that API for new hardware.
> Could you please rabase your driver on top of the tree?
> https://github.com/jonmason/ntb.git

Yes, Allen's already pointed this out. I'll be sure to fix it up before
a final submission.

> According to the NTB philosophy, it's not good to have any hardware
> emulation within hardware driver. Hardware driver must reflect the only
> hardware abilities, nothing else. Could you please get rid of Scratchpad
> emulation and add messaging as new NTB API has got a proper callback
> functions for it?

I disagree completely. Practicality trumps philosophy in every case. I
need emulated scratchpads for ntb_transport to work and I'm not getting
rid of it (thus breaking things that work well) just because of your
philosophical beliefs.

> Hmmm, I haven't seen the actual code (see my last comment), but
> according to my impression of API, it's impossible to have memory window
> accessed while NTB link is down, but Scratchpads still can be used.
> In this case, if you got Scratchpads emulated over memory windows,
> you must have got NTB link enabled before NTB device is registered, which
> makes ntb_link_* methods kind of being useless unless Switchtec hardware
> supports NTB link getting up/down for individual memory windows...

Nothing in-kernel actually uses the peer's scratchpads while the link is
down and all clients seem specifically designed to wait until the link
event to set them. So I think you're either wrong about that rule or we
should change the rule going forward.

I'm not sure what you're referring to about the link stuff; as
implemented, our link management works just fine.

> New NTB API is updated so to have access to any of peer ports. IDT driver
> has got a special translation table to access peer functionality just by
> providing an index to corresponding API callback. You can use it as
> reference to have Switchtec driver accordingly altered. It would be vastly
> useful to have one more multi-port hardware driver in the tree.

Yes, I expect we will get there eventually, it doesn't sound like much
work. However, it's client support that's really going to make this
interesting and worthwhile. That seems like the real challenge right now.

> If I'm not mistaken, these patches can be combined the way so to have
> just two big functionally split patches:
> 1) NTB: Microsemt Switchtec switch management driver alterations for NTB
> 2) NTB: Add Microsemi Switchtec PCIe-switches support
> It would really simplify the review. Could you please combine them?

Seems like every time I make a submission, someone either wants patches
to be smaller and split up more or bigger and combined. I happen to
agree with the people who prefer smaller patches and I think these
provide good chunks for reviewers to look at. So, no, I'm not going to
change this. Feel free to apply the patches to a git tree or view it on
our github branch if you want to see the code combined.

Thanks,

Logan

Serge Semin

unread,
Jun 16, 2017, 1:38:58 PM6/16/17
to Logan Gunthorpe, Allen Hubbe, linu...@googlegroups.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, Jon Mason, Dave Jiang, Bjorn Helgaas, Greg Kroah-Hartman, Kurt Schwemmer, Stephen Bates, Sergey...@t-platforms.ru
On Fri, Jun 16, 2017 at 10:47:21AM -0600, Logan Gunthorpe <log...@deltatee.com> wrote:
>
>
> On 16/06/17 09:34 AM, Allen Hubbe wrote:
> > In code review, I really only have found minor nits. Overall, the driver looks good.
>
> Great, thanks for such a quick review!
>
> > In switchtec_ntb_part_op, there is a delay of up to 50s (1000 * 50ms). This looks like a thread context, so it could involve the scheduler for the delay instead of spinning for up to 50s before bailing.
>
> Good point. If I were to change this to msleep_interruptible would that
> be acceptable?
>
> > There are a few instances like this:
> >> + dev_dbg(&stdev->dev, "%s\n", __func__);
>
> > Where the printing of __func__ could be controlled by dyndbg=+pf. The debug message could be more useful.
>
> Ok, I'll change that.
>
> > In switchtec_ntb_db_set_mask and friends, an in-memory copy of the mask bits is protected by a spinlock. Elsewhere, you noted that the db bits are shared between all ports, so the db bitset is chopped up to be shared between the ports. Is the db mask also shared, and how is the spinlock sufficient for synchronizing access to the mask bits between multiple ports?
>
> Well, there are 64 doorbells that are shared between ports but each port
> has it's own in and out registers for the doorbells. So triggering
> doorbell one on one port's ODB actually triggers it on every ports IDB.
> So these are shared only in the sense that each port needs to know which
> dbs it cares about. Seeing each port has their own registers they don't
> have to worry about synchronization.
>

It's exactly the way the IDT hardware works. There is Global doorbell
registers, directly connected to doorbell registers of each port, unless
doorbell routing isn't configured differently by global switch configuration.
Each port has got it's own in and out doorbell registers as well as the
doorbell mask preventing the corresponding doorbell bit from generating
interrupt.

>
> The mask is only protected by a spin lock seeing multiple callers of
> db_set_mask and db_clr_mask on the same port may step on each others
> toes. So if two processes try to mask different bits they both must get
> masked in the end and therefore some kind of synchronization must be
> involved.
>
> > The IDT switch also does not have hardware scratchpads. Could the code you wrote for emulated scratchpads be made into shared library code for ntb drivers? Also, some ntb clients may not need scratchpad support. If it is not natively supported by a driver, can the emulated scratchpad support be an optional feature?
>
> Hmm, interesting idea. A few pieces could possibly be made common but it
> depends mostly on hardware having the resources to make use of it.
> Switchtec has extra LUT memory windows that made this possible. Unless
> you object I'm inclined to leave it as is and I'd be happy to work with
> the IDT folks to create a common solution in the future.
>
> Logan

Alas there are no IDT folks here. It's only me for now, who was responsible
for NTB API alteration (together with much of help from others NTB guys) and
IDT NTB driver development fitting that API.

Regards,
-Sergey

Allen Hubbe

unread,
Jun 16, 2017, 2:09:19 PM6/16/17
to Logan Gunthorpe, linu...@googlegroups.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, Jon Mason, Dave Jiang, Bjorn Helgaas, Greg Kroah-Hartman, Kurt Schwemmer, Stephen Bates, Serge Semin, Sergey...@t-platforms.ru
From: Logan Gunthorpe
> On 16/06/17 09:34 AM, Allen Hubbe wrote:
> > In code review, I really only have found minor nits. Overall, the driver looks good.
>
> Great, thanks for such a quick review!
>
> > In switchtec_ntb_part_op, there is a delay of up to 50s (1000 * 50ms). This looks like a thread
> context, so it could involve the scheduler for the delay instead of spinning for up to 50s before
> bailing.
>
> Good point. If I were to change this to msleep_interruptible would that
> be acceptable?

I would be satisfied.

>
> > There are a few instances like this:
> >> + dev_dbg(&stdev->dev, "%s\n", __func__);
>
> > Where the printing of __func__ could be controlled by dyndbg=+pf. The debug message could be more
> useful.
>
> Ok, I'll change that.
>
> > In switchtec_ntb_db_set_mask and friends, an in-memory copy of the mask bits is protected by a
> spinlock. Elsewhere, you noted that the db bits are shared between all ports, so the db bitset is
> chopped up to be shared between the ports. Is the db mask also shared, and how is the spinlock
> sufficient for synchronizing access to the mask bits between multiple ports?
>
> Well, there are 64 doorbells that are shared between ports but each port
> has it's own in and out registers for the doorbells. So triggering
> doorbell one on one port's ODB actually triggers it on every ports IDB.
> So these are shared only in the sense that each port needs to know which
> dbs it cares about. Seeing each port has their own registers they don't
> have to worry about synchronization.
>
> The mask is only protected by a spin lock seeing multiple callers of
> db_set_mask and db_clr_mask on the same port may step on each others
> toes. So if two processes try to mask different bits they both must get
> masked in the end and therefore some kind of synchronization must be
> involved.

Thanks for clearing that up. Now I understand, each port has its own independent set of mask bits. So, while the doorbell numbers are assigned globally, the registers themselves are per port. For the mask bits, the mask behavior only affects the local port.

>
> > The IDT switch also does not have hardware scratchpads. Could the code you wrote for emulated
> scratchpads be made into shared library code for ntb drivers? Also, some ntb clients may not need
> scratchpad support. If it is not natively supported by a driver, can the emulated scratchpad support
> be an optional feature?
>
> Hmm, interesting idea. A few pieces could possibly be made common but it
> depends mostly on hardware having the resources to make use of it.
> Switchtec has extra LUT memory windows that made this possible. Unless
> you object I'm inclined to leave it as is and I'd be happy to work with
> the IDT folks to create a common solution in the future.

Alright. I'll leave it to you to find and reconcile common functionalities of the drivers. What about making spad emulation optional?

>
> Logan

There was a comment on irc.oftc.net #ntb wishing for patch v2 to be fewer patches. Something like, 1/2: prep the existing switch driver, 2/2: introduce the ntb driver.

Serge Semin

unread,
Jun 16, 2017, 2:38:19 PM6/16/17
to Logan Gunthorpe, Allen Hubbe, linu...@googlegroups.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, Jon Mason, Dave Jiang, Bjorn Helgaas, Greg Kroah-Hartman, Kurt Schwemmer, Stephen Bates, Sergey...@t-platforms.ru
On Fri, Jun 16, 2017 at 11:08:52AM -0600, Logan Gunthorpe <log...@deltatee.com> wrote:
>
>
> On 16/06/17 10:33 AM, Serge Semin wrote:
> > New NTB API is going to be merged to mainline kernel within next features
> > merge window, so it's really recommended to use that API for new hardware.
> > Could you please rabase your driver on top of the tree?
> > https://github.com/jonmason/ntb.git
>
> Yes, Allen's already pointed this out. I'll be sure to fix it up before
> a final submission.
>
> > According to the NTB philosophy, it's not good to have any hardware
> > emulation within hardware driver. Hardware driver must reflect the only
> > hardware abilities, nothing else. Could you please get rid of Scratchpad
> > emulation and add messaging as new NTB API has got a proper callback
> > functions for it?
>
> I disagree completely. Practicality trumps philosophy in every case. I
> need emulated scratchpads for ntb_transport to work and I'm not getting
> rid of it (thus breaking things that work well) just because of your
> philosophical beliefs.
>

It's the way the NTB API was created for, to have set of functions to access
NTB devices in the similar way. These aren't my beliefs, it's the way it was
created. I agree it can be optional, but it shouldn't be made as the basics
of the driver. It is called NTB "hardware" driver after all, not "emulating" or
"abstracting" driver.
ntb_transport could work without Scratchpads, if it's properly altered to
use NTB messaging. This should be the way to make things compatible, but not
making the hardware driver suitable for just one ntb_transport.

> > Hmmm, I haven't seen the actual code (see my last comment), but
> > according to my impression of API, it's impossible to have memory window
> > accessed while NTB link is down, but Scratchpads still can be used.
> > In this case, if you got Scratchpads emulated over memory windows,
> > you must have got NTB link enabled before NTB device is registered, which
> > makes ntb_link_* methods kind of being useless unless Switchtec hardware
> > supports NTB link getting up/down for individual memory windows...
>
> Nothing in-kernel actually uses the peer's scratchpads while the link is
> down and all clients seem specifically designed to wait until the link
> event to set them. So I think you're either wrong about that rule or we
> should change the rule going forward.
>
> I'm not sure what you're referring to about the link stuff; as
> implemented, our link management works just fine.
>
> > New NTB API is updated so to have access to any of peer ports. IDT driver
> > has got a special translation table to access peer functionality just by
> > providing an index to corresponding API callback. You can use it as
> > reference to have Switchtec driver accordingly altered. It would be vastly
> > useful to have one more multi-port hardware driver in the tree.
>
> Yes, I expect we will get there eventually, it doesn't sound like much
> work. However, it's client support that's really going to make this
> interesting and worthwhile. That seems like the real challenge right now.
>
> > If I'm not mistaken, these patches can be combined the way so to have
> > just two big functionally split patches:
> > 1) NTB: Microsemy Switchtec switch management driver alterations for NTB
> > 2) NTB: Add Microsemi Switchtec PCIe-switches support
> > It would really simplify the review. Could you please combine them?
>
> Seems like every time I make a submission, someone either wants patches
> to be smaller and split up more or bigger and combined. I happen to
> agree with the people who prefer smaller patches and I think these
> provide good chunks for reviewers to look at. So, no, I'm not going to
> change this. Feel free to apply the patches to a git tree or view it on
> our github branch if you want to see the code combined.

It's not like my whim or something, but the way it's usually done.
https://kernelnewbies.org/PatchPhilosophy

Cite from there:
"Each patch should group changes into a logical sequence. Bug fixes must
come first in the patchset, then new features. This is because we need to be
able to backport bug fixes to older kernels, and they should not depend on
new features."

You grouped the patches in according to your logical view or development
progress (I don't know for sure), but it's not obvious for reviewers.
From my perspective your new Microsemi Switchtec NTB driver is just one
feature. I don't know who would think differently so to split the solid
driver up for review. Switchtec management driver alteration might be the
same - just one fix. It's much easier for you to have your commits squashed,
than for me to look at your git tree, than get back to your patchset looking
for a necessary peace of patch and commenting it there.

Regards,
-Sergey

>
> Thanks,
>
> Logan
>
> --
> You received this message because you are subscribed to the Google Groups "linux-ntb" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to linux-ntb+...@googlegroups.com.
> To post to this group, send email to linu...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/linux-ntb/883bdb76-972c-7de9-0208-2d0933f192d4%40deltatee.com.
> For more options, visit https://groups.google.com/d/optout.

Logan Gunthorpe

unread,
Jun 16, 2017, 3:17:42 PM6/16/17
to Allen Hubbe, linu...@googlegroups.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, Jon Mason, Dave Jiang, Bjorn Helgaas, Greg Kroah-Hartman, Kurt Schwemmer, Stephen Bates, Serge Semin, Sergey...@t-platforms.ru


On 16/06/17 12:08 PM, Allen Hubbe wrote:
> Alright. I'll leave it to you to find and reconcile common functionalities of the drivers. What about making spad emulation optional?

Ok.

I don't see the point of making spad emulation optional. Who would want
to disable it and what would be the benefit?

Logan

Logan Gunthorpe

unread,
Jun 16, 2017, 3:35:22 PM6/16/17
to Serge Semin, Allen Hubbe, linu...@googlegroups.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, Jon Mason, Dave Jiang, Bjorn Helgaas, Greg Kroah-Hartman, Kurt Schwemmer, Stephen Bates, Sergey...@t-platforms.ru


On 16/06/17 12:38 PM, Serge Semin wrote:
> On Fri, Jun 16, 2017 at 11:08:52AM -0600, Logan Gunthorpe <log...@deltatee.com> wrote:
> It's the way the NTB API was created for, to have set of functions to access
> NTB devices in the similar way. These aren't my beliefs, it's the way it was
> created. I agree it can be optional, but it shouldn't be made as the basics
> of the driver. It is called NTB "hardware" driver after all, not "emulating" or
> "abstracting" driver.

Just more philosophy. You haven't given any good reason to remove the
functionality. Vague references to the way things were created aren't
compelling arguments. Better to cite code and point out actual problems.

> ntb_transport could work without Scratchpads, if it's properly altered to
> use NTB messaging. This should be the way to make things compatible, but not
> making the hardware driver suitable for just one ntb_transport.

Ok, well when all the NTB clients no longer require using scratchpads
and we can all abide by the rule that clients must function without
them. Then, I'll remove the emulation. Until then, it stays.

> It's not like my whim or something, but the way it's usually done.
> https://kernelnewbies.org/PatchPhilosophy

> Cite from there:
> "Each patch should group changes into a logical sequence. Bug fixes must
> come first in the patchset, then new features. This is because we need to be
> able to backport bug fixes to older kernels, and they should not depend on
> new features."

You should probably read that again because it doesn't actually support
your point (in fact it's saying something quite unrelated). It is also
probably a good idea to read the rest of the seciton you cite:

"The idea here is that you should break changes up in such a way that it
will be easy to review."

"When creating a new feature patchset, you may need to break up your
changes into multiple commits. "

"Clean up patches that are over 200 lines long are discouraged, because
they are hard to review. Break those patches up into smaller patches. "

Also, to quote Greg Kroah-Hartman from my last series[1]:

"That's one big patch to review, would you want to do that?

Can you break it up into smaller parts?"

> You grouped the patches in according to your logical view or development
> progress (I don't know for sure), but it's not obvious for reviewers.
> From my perspective your new Microsemi Switchtec NTB driver is just one
> feature. I don't know who would think differently so to split the solid
> driver up for review. Switchtec management driver alteration might be the
> same - just one fix. It's much easier for you to have your commits squashed,
> than for me to look at your git tree, than get back to your patchset looking
> for a necessary peace of patch and commenting it there.

Well you're free to think that but, in my experience, your opinion
differs significantly from the rest of the kernel community which I
personally agree with.

Now, if you'd like to actually review the code I'd be happy to address
any concerns you find. I won't be responding to any more philosophical
arguments or bike-shedding over the format of the patch.

Logan

[1] https://lkml.org/lkml/2017/1/31/637

Serge Semin

unread,
Jun 16, 2017, 4:20:50 PM6/16/17
to Logan Gunthorpe, Allen Hubbe, linu...@googlegroups.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, Jon Mason, Dave Jiang, Bjorn Helgaas, Greg Kroah-Hartman, Kurt Schwemmer, Stephen Bates, Sergey...@t-platforms.ru
On Fri, Jun 16, 2017 at 01:34:59PM -0600, Logan Gunthorpe <log...@deltatee.com> wrote:
>
>
> On 16/06/17 12:38 PM, Serge Semin wrote:
> > On Fri, Jun 16, 2017 at 11:08:52AM -0600, Logan Gunthorpe <log...@deltatee.com> wrote:
> > It's the way the NTB API was created for, to have set of functions to access
> > NTB devices in the similar way. These aren't my beliefs, it's the way it was
> > created. I agree it can be optional, but it shouldn't be made as the basics
> > of the driver. It is called NTB "hardware" driver after all, not "emulating" or
> > "abstracting" driver.
>
> Just more philosophy. You haven't given any good reason to remove the
> functionality. Vague references to the way things were created aren't
> compelling arguments. Better to cite code and point out actual problems.
>

Actual problem is the design of your driver. Of course you can disagree as much as
you want.

> > ntb_transport could work without Scratchpads, if it's properly altered to
> > use NTB messaging. This should be the way to make things compatible, but not
> > making the hardware driver suitable for just one ntb_transport.
>
> Ok, well when all the NTB clients no longer require using scratchpads
> and we can all abide by the rule that clients must function without
> them. Then, I'll remove the emulation. Until then, it stays.
>
> > It's not like my whim or something, but the way it's usually done.
> > https://kernelnewbies.org/PatchPhilosophy
>
> > Cite from there:
> > "Each patch should group changes into a logical sequence. Bug fixes must
> > come first in the patchset, then new features. This is because we need to be
> > able to backport bug fixes to older kernels, and they should not depend on
> > new features."
>
> You should probably read that again because it doesn't actually support
> your point (in fact it's saying something quite unrelated). It is also
> probably a good idea to read the rest of the seciton you cite:
>
> "The idea here is that you should break changes up in such a way that it
> will be easy to review."
>
> "When creating a new feature patchset, you may need to break up your
> changes into multiple commits. "
>
> "Clean up patches that are over 200 lines long are discouraged, because
> they are hard to review. Break those patches up into smaller patches. "
>

This doesn't prove your way of splitting patchset is correct, but supports
my point. As well as the sentence about the logical sentence in addition
to the thing about easy review.

> Also, to quote Greg Kroah-Hartman from my last series[1]:
>
> "That's one big patch to review, would you want to do that?
>
> Can you break it up into smaller parts?"
>
> > You grouped the patches in according to your logical view or development
> > progress (I don't know for sure), but it's not obvious for reviewers.
> > From my perspective your new Microsemi Switchtec NTB driver is just one
> > feature. I don't know who would think differently so to split the solid
> > driver up for review. Switchtec management driver alteration might be the
> > same - just one fix. It's much easier for you to have your commits squashed,
> > than for me to look at your git tree, than get back to your patchset looking
> > for a necessary peace of patch and commenting it there.
>
> Well you're free to think that but, in my experience, your opinion
> differs significantly from the rest of the kernel community which I
> personally agree with.
>

And your quotation doesn't prove you are right. Greg asked you to split at
least the documentation. He had point to ask it, since it's logically correct.
You wasn't arguing with him, was you? But in this case you have sent the
set of incremental patches of your own code, so I don't see how it can be
easier for review, than a combined text.

> Now, if you'd like to actually review the code I'd be happy to address
> any concerns you find. I won't be responding to any more philosophical
> arguments or bike-shedding over the format of the patch.
>

I don't want to review a patchset, which isn't properly formated.

> Logan
>
> [1] https://lkml.org/lkml/2017/1/31/637
>
> --
> You received this message because you are subscribed to the Google Groups "linux-ntb" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to linux-ntb+...@googlegroups.com.
> To post to this group, send email to linu...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/linux-ntb/33b6c321-c0af-7340-8e8e-e929a00005c7%40deltatee.com.

Greg Kroah-Hartman

unread,
Jun 17, 2017, 1:10:06 AM6/17/17
to Serge Semin, Logan Gunthorpe, Allen Hubbe, linu...@googlegroups.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, Jon Mason, Dave Jiang, Bjorn Helgaas, Kurt Schwemmer, Stephen Bates, Sergey...@t-platforms.ru
On Fri, Jun 16, 2017 at 11:21:00PM +0300, Serge Semin wrote:
> On Fri, Jun 16, 2017 at 01:34:59PM -0600, Logan Gunthorpe <log...@deltatee.com> wrote:
> > Now, if you'd like to actually review the code I'd be happy to address
> > any concerns you find. I won't be responding to any more philosophical
> > arguments or bike-shedding over the format of the patch.
> >
>
> I don't want to review a patchset, which isn't properly formated.

Ah, but the patchset does seem to properly formatted. At least it's
easy for me to review as-published, while a much smaller number of
patches, making much larger individual patches, would be much much
harder to review.

But what do I know...

Oh wait, I review more kernel patches than anyone else :)

Logan, given that you need to rebase these on the "new" ntb api (and why
the hell is that tree on github? We can't take kernel git pulls from
github), is it worth reviewing this patch series as-is, or do you want
us to wait?

thanks,

greg k-h

Logan Gunthorpe

unread,
Jun 17, 2017, 12:11:58 PM6/17/17
to Greg Kroah-Hartman, Serge Semin, Allen Hubbe, linu...@googlegroups.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, Jon Mason, Dave Jiang, Bjorn Helgaas, Kurt Schwemmer, Stephen Bates, Sergey...@t-platforms.ru


On 16/06/17 11:09 PM, 'Greg Kroah-Hartman' wrote:
> Ah, but the patchset does seem to properly formatted. At least it's
> easy for me to review as-published, while a much smaller number of
> patches, making much larger individual patches, would be much much
> harder to review.
>
> But what do I know...
>
> Oh wait, I review more kernel patches than anyone else :)

Thanks Greg.

> Logan, given that you need to rebase these on the "new" ntb api (and why
> the hell is that tree on github? We can't take kernel git pulls from
> github), is it worth reviewing this patch series as-is, or do you want
> us to wait?

I think initial review at this time will still be useful. I don't expect
the patchset will change _that_ much.

Logan

Logan Gunthorpe

unread,
Jun 17, 2017, 12:15:41 PM6/17/17
to Greg Kroah-Hartman, Serge Semin, Allen Hubbe, linu...@googlegroups.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, Jon Mason, Dave Jiang, Bjorn Helgaas, Kurt Schwemmer, Stephen Bates, Sergey...@t-platforms.ru


On 16/06/17 11:09 PM, 'Greg Kroah-Hartman' wrote:
> Ah, but the patchset does seem to properly formatted. At least it's
> easy for me to review as-published, while a much smaller number of
> patches, making much larger individual patches, would be much much
> harder to review.
>
> But what do I know...
>
> Oh wait, I review more kernel patches than anyone else :)

Thanks Greg!

> Logan, given that you need to rebase these on the "new" ntb api (and why
> the hell is that tree on github? We can't take kernel git pulls from
> github), is it worth reviewing this patch series as-is, or do you want
> us to wait?

I think review at this time is still appropriate. The new ntb api mostly
just adds indexes for the port so I don't expect things to change too much.

Thanks for the review so far.

Logan


Jon Mason

unread,
Jun 19, 2017, 3:14:31 PM6/19/17
to Greg Kroah-Hartman, Serge Semin, Logan Gunthorpe, Allen Hubbe, linu...@googlegroups.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, Dave Jiang, Bjorn Helgaas, Kurt Schwemmer, Stephen Bates, Sergey...@t-platforms.ru
Well, Linus has been taking my pull request from it since v3.12. He did
call me out initially for requesting it initially, but was amenible
after I GPG signed all of my pull requests (and had a sufficient number
of people he "knew" in my ring). But all of that has been sorted out
now.

The reason it is on Github is for the Wiki of NTB HW and it's usage
https://github.com/jonmason/ntb/wiki
It's gotten a bit stale, but was very useful back in the v3.12 days :)
(Also, I am using this as a call to update the Wiki!)

Thanks,
Jon

Jon Mason

unread,
Jun 19, 2017, 4:07:11 PM6/19/17
to Logan Gunthorpe, linu...@googlegroups.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, Dave Jiang, Allen Hubbe, Bjorn Helgaas, Greg Kroah-Hartman, Kurt Schwemmer, Stephen Bates
On Thu, Jun 15, 2017 at 02:37:16PM -0600, Logan Gunthorpe wrote:
> Hi,
>
> This patchset implements Non-Transparent Bridge (NTB) support for the
> Microsemi Switchtec series of switches. We're looking for some
> review from the community at this point but hope to get it upstreamed
> for v4.14.
>
> Switchtec NTB support is configured over the same function and bar
> as the management endpoint. Thus, the new driver hooks into the
> management driver which we had merged in v4.12. We use the class
> interface API to register an NTB device for every switchtec device
> which supports NTB (not all do).
>
> The Switchtec hardware supports doorbells, memory windows and messages.
> Seeing there is no native scratchpad support, 128 spads are emulated
> through the use of a pre-setup memory window. The switch has 64
> doorbells which are shared between the two partitions and a
> configurable set of memory windows. While the hardware supports more
> than 2 partitions, this driver only supports the first two seeing
> the current NTB API only supports two hosts.
>
> The driver has been tested with ntb_netdev and fully passes the
> ntb_test script.
>
> This patchset is based off of v4.12-rc5 and can be found in this
> git repo:
>
> https://github.com/sbates130272/linux-p2pmem.git switchtec_ntb

I think this code is of quality enough to go from an RFC to a standard
patch, and we can nit pick it to death there ;-)

Please rebase on ntb-next (which I believe you are already doing), and
resbutmit.

Also, we'll need to decide how to handle the patches to
drivers/pci/switches
We have 2 options here.
1. We can split it up into 2 separate series, and have Bjorn take the
drivers/pci/switches in his, and I'll take the NTB portions. The down
side of this is that I'll have to wait for his to get pulled in first,
which will push this back a full kernel release (which based on your
comment above doesn't seem your prefered choice)
2. I can pull in the drivers/pci/switches patches into my NTB tree,
but I'll want Bjorn's ack on those patches before I'll pull them in.

I'm thinking that we'll want to keep this series all in one place.
So, #2 sounds like the best option. But, I need Bjorns $0.02 on this.


FYI, I ran smatch on the patches and got this. Please correct them in
v2 (or v1 of the non-RFC...however you want to think of it).
drivers/pci/switch/switchtec.c:484 switchtec_dev_read() error: double unlock 'mutex:&stdev->mrpc_mutex'
drivers/pci/switch/switchtec.c:506 switchtec_dev_read() error: double unlock 'mutex:&stdev->mrpc_mutex'
drivers/pci/switch/switchtec.c:513 switchtec_dev_read() warn: inconsistent returns 'mutex:&stdev->mrpc_mutex'.


Thanks,
Jon
> --
> You received this message because you are subscribed to the Google Groups "linux-ntb" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to linux-ntb+...@googlegroups.com.
> To post to this group, send email to linu...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/linux-ntb/20170615203729.9009-1-logang%40deltatee.com.

Logan Gunthorpe

unread,
Jun 19, 2017, 4:27:49 PM6/19/17
to Jon Mason, linu...@googlegroups.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, Dave Jiang, Allen Hubbe, Bjorn Helgaas, Greg Kroah-Hartman, Kurt Schwemmer, Stephen Bates


On 19/06/17 02:07 PM, Jon Mason wrote:
> I think this code is of quality enough to go from an RFC to a standard
> patch, and we can nit pick it to death there ;-)

Thanks!

> Please rebase on ntb-next (which I believe you are already doing), and
> resbutmit.

I'll try to get the rebase done and all the feedback so far applied by
the end of the week and resend a v1.

> I'm thinking that we'll want to keep this series all in one place.
> So, #2 sounds like the best option. But, I need Bjorns $0.02 on this.

I was thinking #2 was the best choice as well but really it's for you
maintainers to decide. And, yes, we'd want to get Bjorn's ack.

> FYI, I ran smatch on the patches and got this. Please correct them in
> v2 (or v1 of the non-RFC...however you want to think of it).
> drivers/pci/switch/switchtec.c:484 switchtec_dev_read() error: double unlock 'mutex:&stdev->mrpc_mutex'
> drivers/pci/switch/switchtec.c:506 switchtec_dev_read() error: double unlock 'mutex:&stdev->mrpc_mutex'
> drivers/pci/switch/switchtec.c:513 switchtec_dev_read() warn: inconsistent returns 'mutex:&stdev->mrpc_mutex'.

This looks like a false positive to me. The code looks correct. smatch
may have been confused by the fact that the lock is taken by two calls
to the static function 'lock_mutex_and_test_alive'.

This is also part of the switchtec management driver that's already in
the kernel and not part of the NTB related patches I sent.

Logan

Jon Mason

unread,
Jun 19, 2017, 5:09:34 PM6/19/17
to Logan Gunthorpe, linu...@googlegroups.com, linu...@vger.kernel.org, linux-kernel, Dave Jiang, Allen Hubbe, Bjorn Helgaas, Greg Kroah-Hartman, Kurt Schwemmer, Stephen Bates
On Mon, Jun 19, 2017 at 4:27 PM, Logan Gunthorpe <log...@deltatee.com> wrote:
>
>
> On 19/06/17 02:07 PM, Jon Mason wrote:
>> I think this code is of quality enough to go from an RFC to a standard
>> patch, and we can nit pick it to death there ;-)
>
> Thanks!
>
>> Please rebase on ntb-next (which I believe you are already doing), and
>> resbutmit.
>
> I'll try to get the rebase done and all the feedback so far applied by
> the end of the week and resend a v1.
>
>> I'm thinking that we'll want to keep this series all in one place.
>> So, #2 sounds like the best option. But, I need Bjorns $0.02 on this.
>
> I was thinking #2 was the best choice as well but really it's for you
> maintainers to decide. And, yes, we'd want to get Bjorn's ack.

Bjorn is a busy guy, but I'm sure he'll get to it soon enough.

>> FYI, I ran smatch on the patches and got this. Please correct them in
>> v2 (or v1 of the non-RFC...however you want to think of it).
>> drivers/pci/switch/switchtec.c:484 switchtec_dev_read() error: double unlock 'mutex:&stdev->mrpc_mutex'
>> drivers/pci/switch/switchtec.c:506 switchtec_dev_read() error: double unlock 'mutex:&stdev->mrpc_mutex'
>> drivers/pci/switch/switchtec.c:513 switchtec_dev_read() warn: inconsistent returns 'mutex:&stdev->mrpc_mutex'.
>
> This looks like a false positive to me. The code looks correct. smatch
> may have been confused by the fact that the lock is taken by two calls
> to the static function 'lock_mutex_and_test_alive'.
>
> This is also part of the switchtec management driver that's already in
> the kernel and not part of the NTB related patches I sent.

Ah, no problem. I'm not a master user of smatch, but I do find it
very useful. So, I do not know of a way to run it against only the
patches being applied, and I'm not sure that is possible to do so.
Either way, I'm sure someone will address that then, given that it is
already in the kernel.

BTW, I don't think I said so, but I'm really excited to have another
piece of NTB HW supported in the kernel.

Thanks,
Jon

>
> Logan

Logan Gunthorpe

unread,
Jun 22, 2017, 12:17:50 PM6/22/17
to Jon Mason, Allen Hubbe, linu...@googlegroups.com, linux-...@vger.kernel.org, Dave Jiang, Serge Semin, Kurt Schwemmer, Stephen Bates, Greg Kroah-Hartman
Hey Guys,

I've run into some subtle issues with the new API:

It has to do with splitting mw_get_range into mw_get_align and
peer_mw_get_addr.

The original mw_get_range returned the size of the /local/ memory
window's size, address and alignment requirements. The ntb clients then
take the local size and transmit it via spads to the peer which would
use it in setting up the memory window. However, it made the assumption
that the alignment restrictions were symmetric on both hosts seeing they
were not sent across the link.

The new API makes a sensible change for this in that mw_get_align
appears to be intended to return the alignment restrictions (and now
size) of the peer. This helps a bit for the Switchtec driver but appears
to be a semantic change that wasn't really reflected in the changes to
the other NTB code. So, I see a couple of issues:

1) With our hardware, we can't actually know anything about the peer's
memory windows until the peer has finished its setup (ie. the link is
up). However, all the clients call the function during probe, before the
link is ready. There's really no good reason for this, so I think we
should change the clients so that mw_get_align is called only when the
link is up.

2) The changes to the Intel and AMD driver for mw_get_align sets
*max_size to the local pci resource size. (Thus making the assumption
that the local is the same as the peer, which is wrong). max_size isn't
actually used for anything so it's not _really_ an issue, but I do think
it's confusing and incorrect. I'd suggest we remove max_size until
something actually needs it, or at least set it to zero in cases where
the hardware doesn't support returning the size of the peer's memory
window (ie. in the Intel and AMD drivers).

Thoughts?

Logan

Allen Hubbe

unread,
Jun 22, 2017, 2:33:11 PM6/22/17
to Logan Gunthorpe, Jon Mason, linu...@googlegroups.com, linux-...@vger.kernel.org, Dave Jiang, Serge Semin, Kurt Schwemmer, Stephen Bates, Greg Kroah-Hartman
From: Logan Gunthorpe
You're right, and the b2b_split in the Intel driver even makes use of different primary/secondary bar sizes. For Intel and AMD, it would make more sense to use the secondary bar size here. The size of the secondary bar still not necessarily valid end-to-end, because in b2b the peer's primary bar size could be even smaller.

I'm not entirely convinced that this should represent the end-to-end size of local and peer memory window configurations. I think it should represent the largest side that would be valid to pass to ntb_mw_set_trans(). Then, the peers should communicate their respective max sizes (along with translation addresses, etc) before setting up the translations, and that exchange will ensure that the size finally used is valid end-to-end.

>
> Thoughts?
>
> Logan

Logan Gunthorpe

unread,
Jun 22, 2017, 2:40:44 PM6/22/17
to Allen Hubbe, Jon Mason, linu...@googlegroups.com, linux-...@vger.kernel.org, Dave Jiang, Serge Semin, Kurt Schwemmer, Stephen Bates, Greg Kroah-Hartman
But why would the client ever need to use the max_size instead of the
actual size of the bar as retrieved and exchanged from peer_mw_get_addr?

Logan

Allen Hubbe

unread,
Jun 22, 2017, 6:13:13 PM6/22/17
to Logan Gunthorpe, Jon Mason, linu...@googlegroups.com, linux-...@vger.kernel.org, Dave Jiang, Serge Semin, Kurt Schwemmer, Stephen Bates, Greg Kroah-Hartman
From: Logan Gunthorpe
> On 6/22/2017 12:32 PM, Allen Hubbe wrote:
> > From: Logan Gunthorpe
> >> 2) The changes to the Intel and AMD driver for mw_get_align sets
> >> *max_size to the local pci resource size. (Thus making the assumption
> >> that the local is the same as the peer, which is wrong). max_size isn't
> >> actually used for anything so it's not _really_ an issue, but I do think
> >> it's confusing and incorrect. I'd suggest we remove max_size until
> >> something actually needs it, or at least set it to zero in cases where
> >> the hardware doesn't support returning the size of the peer's memory
> >> window (ie. in the Intel and AMD drivers).
> >
> > You're right, and the b2b_split in the Intel driver even makes use of different primary/secondary
> bar sizes. For Intel and AMD, it would make more sense to use the secondary bar size here. The size
> of the secondary bar still not necessarily valid end-to-end, because in b2b the peer's primary bar
> size could be even smaller.
> >
> > I'm not entirely convinced that this should represent the end-to-end size of local and peer memory
> window configurations. I think it should represent the largest side that would be valid to pass to
> ntb_mw_set_trans(). Then, the peers should communicate their respective max sizes (along with
> translation addresses, etc) before setting up the translations, and that exchange will ensure that the
> size finally used is valid end-to-end.
>
> But why would the client ever need to use the max_size instead of the
> actual size of the bar as retrieved and exchanged from peer_mw_get_addr?

The resource size given by peer_mw_get_addr might be different than the max_size given by ntb_mw_get_align.

I am most familiar with the ntb_hw_intel driver and that type of ntb hardware. The peer_mw_get_addr size is of the primary bar on the side to be the source of the translated writes (or reads). In b2b topology, at least, the first translation of that write lands it on the secondary bar of the peer ntb. That size of that bar could different than the first. The second translation lands the write in memory (eg). So, the end-to-end translation is limited by the first AND second sizes.

The first point is, the *max_size returned by intel_ntb_mw_get_align looks wrong. That should be the size of the secondary bar, not the resource size of the primary bar, of that device.

The second point is, because the sizes returned by peer_mw_get_addr, and ntb_mw_get_align, may be different, the two sides should communicate and reconcile the address and size information when setting up the translations.

Logan Gunthorpe

unread,
Jun 22, 2017, 6:19:21 PM6/22/17
to Allen Hubbe, Jon Mason, linu...@googlegroups.com, linux-...@vger.kernel.org, Dave Jiang, Serge Semin, Kurt Schwemmer, Stephen Bates, Greg Kroah-Hartman
On 6/22/2017 4:12 PM, Allen Hubbe wrote:
> The resource size given by peer_mw_get_addr might be different than the max_size given by ntb_mw_get_align.
>
> I am most familiar with the ntb_hw_intel driver and that type of ntb hardware. The peer_mw_get_addr size is of the primary bar on the side to be the source of the translated writes (or reads). In b2b topology, at least, the first translation of that write lands it on the secondary bar of the peer ntb. That size of that bar could different than the first. The second translation lands the write in memory (eg). So, the end-to-end translation is limited by the first AND second sizes.
>
> The first point is, the *max_size returned by intel_ntb_mw_get_align looks wrong. That should be the size of the secondary bar, not the resource size of the primary bar, of that device.
>
> The second point is, because the sizes returned by peer_mw_get_addr, and ntb_mw_get_align, may be different, the two sides should communicate and reconcile the address and size information when setting up the translations.

Ok, that makes some sense. So drivers that don't need this should return
-1 or something like that? Though, I'd still suggest that until a driver
actually needs this, and it's implemented correctly in the clients, we
should leave it out.

Any thoughts on changing the semantics of mw_get_align so it must be
called with the link up?

Logan




Allen Hubbe

unread,
Jun 22, 2017, 6:43:08 PM6/22/17
to Logan Gunthorpe, Jon Mason, linu...@googlegroups.com, linux-...@vger.kernel.org, Dave Jiang, Serge Semin, Kurt Schwemmer, Stephen Bates, Greg Kroah-Hartman
From: Logan Gunthorpe
> Any thoughts on changing the semantics of mw_get_align so it must be
> called with the link up?

The intention of these is that these calls return information from the local port. The calls themselves don't reach across the link to the peer, but the information returned from the local port needs to be communicated for setting up the translation end-to-end.

I would like to understand why this hardware needs link up. Are there registers on the local port that are only valid after link up?

Can you post snippets of how ntb_mw_get_align and ntb_peer_mw_get_addr might be implemented for the Switchtec?

Logan Gunthorpe

unread,
Jun 22, 2017, 6:49:28 PM6/22/17
to Allen Hubbe, Jon Mason, linu...@googlegroups.com, linux-...@vger.kernel.org, Dave Jiang, Serge Semin, Kurt Schwemmer, Stephen Bates, Greg Kroah-Hartman
On 6/22/2017 4:42 PM, Allen Hubbe wrote:
> From: Logan Gunthorpe
>> Any thoughts on changing the semantics of mw_get_align so it must be
>> called with the link up?
>
> The intention of these is that these calls return information from the local port. The calls themselves don't reach across the link to the peer, but the information returned from the local port needs to be communicated for setting up the translation end-to-end.

Ok, well if it's from the local port, then splitting up mw_get_range
into peer_mw_get_addr and mw_get_align is confusing because one has the
peer designation and the other doesn't. And all the clients apply the
alignments to the remote bar so they'd technically need to transfer them
across the link somehow.

> I would like to understand why this hardware needs link up. Are there registers on the local port that are only valid after link up?

We only need the link up if we are trying to find the alignment
requirements (and max_size) of the peer's bar. In theory, switchtec
could have different sizes of bars on both sides of the link and
different alignment requirements. Though, in practice, this is probably
unlikely.

> Can you post snippets of how ntb_mw_get_align and ntb_peer_mw_get_addr might be implemented for the Switchtec?

Hmm, yes, but lets sort out my confusion on the semantics per above first.

Logan

Allen Hubbe

unread,
Jun 23, 2017, 9:18:33 AM6/23/17
to Logan Gunthorpe, Jon Mason, linu...@googlegroups.com, linux-...@vger.kernel.org, Dave Jiang, Serge Semin, Kurt Schwemmer, Stephen Bates, Greg Kroah-Hartman
From: Logan Gunthorpe
> On 6/22/2017 4:42 PM, Allen Hubbe wrote:
> > From: Logan Gunthorpe
> >> Any thoughts on changing the semantics of mw_get_align so it must be
> >> called with the link up?
> >
> > The intention of these is that these calls return information from the local port. The calls
> themselves don't reach across the link to the peer, but the information returned from the local port
> needs to be communicated for setting up the translation end-to-end.
>
> Ok, well if it's from the local port, then splitting up mw_get_range
> into peer_mw_get_addr and mw_get_align is confusing because one has the
> peer designation and the other doesn't. And all the clients apply the
> alignments to the remote bar so they'd technically need to transfer them
> across the link somehow.

By "remote" do you mean the source or destination of a write?

Yes, clients should transfer the address and size information to the peer.

> > I would like to understand why this hardware needs link up. Are there registers on the local port
> that are only valid after link up?
>
> We only need the link up if we are trying to find the alignment
> requirements (and max_size) of the peer's bar. In theory, switchtec

Maybe this is the confusion. None of these api calls are to reach across to the peer port, as to get the size of the peer's bar. They are to get information from the local port, or to configure the local port.

Some mw configuration is done at the destination, as for Intel and AMD, and some configuration is done on the source side, for IDT. The local configuration of the port on one side could depend on information from the remote port on the other side. For example in IDT, the mw translation configured on the source side needs to know destination address information. Likewise, if there is any difference in the size of the range that can be translated by ports on opposing sides, that needs to be negotiated.

> could have different sizes of bars on both sides of the link and
> different alignment requirements. Though, in practice, this is probably
> unlikely.
>
> > Can you post snippets of how ntb_mw_get_align and ntb_peer_mw_get_addr might be implemented for the
> Switchtec?
>
> Hmm, yes, but lets sort out my confusion on the semantics per above first.

Some snippets of code would help me understand your interpretation of the api semantics more exactly.

> Logan

Logan Gunthorpe

unread,
Jun 23, 2017, 12:51:12 PM6/23/17
to Allen Hubbe, Jon Mason, linu...@googlegroups.com, linux-...@vger.kernel.org, Dave Jiang, Serge Semin, Kurt Schwemmer, Stephen Bates, Greg Kroah-Hartman


On 23/06/17 07:18 AM, Allen Hubbe wrote:
> By "remote" do you mean the source or destination of a write?

Look at how these calls are used in ntb_transport and ntb_perf:

They both call ntb_peer_mw_get_addr to get the size of the BAR. The size
is sent via spads to the other side. The other side then uses
ntb_mw_get_align and applies align_size to the received size.

> Yes, clients should transfer the address and size information to the peer.

But then they also need to technically transfer the alignment
information as well. Which neither of the clients do.

> Maybe this is the confusion. None of these api calls are to reach across to the peer port, as to get the size of the peer's bar. They are to get information from the local port, or to configure the local port.

I like the rule that these api calls are not to reach across the port.
But then API looks very wrong. Why are we calling one peer_mw_get addr
and the other mw_get_align? And why does mw_get_align have a peer index?


> Some snippets of code would help me understand your interpretation of the api semantics more exactly.

I'm not sure the code to best show this in code, but let me try
describing an example situation:

Lets say we have the following mws on each side (this is something that
is possible with Switchtec hardware):

Host A BARs:
mwA0: 2MB size, aligned to 4k, size aligned to 4k
mwA1: 4MB size, aligned to 4k, size aligned to 4k
mwA2: 64k size, aligned to 64k, size aligned to 64k

Host B BARs:
mwB0: 2MB size, aligned to 4k, size aligned to 4k
mwB1: 64k size, aligned to 64k, size aligned to 64k

So on Host A: peer_mw_get_addr(idx=1) should return size=4M (from mwA1),
but it's not clear what mw_get_align(widx=1) should do. I see two
possibilities:

1) Given that it has the opposite sense of peer_mw_get_addr (ie. there
is no peer in it's name) and that this new api also has a peer index, it
should return align_size=64k (from mwB1). However, in order to do this,
Host B must be fully configured (technically the link doesn't have to be
fully up, but having a link up is the only way for a client to tell if
Host B is configured or not).

2) Given your assertion that these APIs should never reach across the
link, then one could say it should return align_size=4k. However, in
this situation the name is really confusing. And the fact that it has a
pidx argument makes no sense. And the way ntb_transport and ntb_perf use
it is wrong because they will end up masking the 64K size of mwB1 with
the 4k align_size from mwA1.

Does that make more sense?

Thanks,

Logan


Allen Hubbe

unread,
Jun 23, 2017, 3:07:43 PM6/23/17
to Logan Gunthorpe, Jon Mason, linu...@googlegroups.com, linux-...@vger.kernel.org, Dave Jiang, Serge Semin, Kurt Schwemmer, Stephen Bates, Greg Kroah-Hartman
From: Logan Gunthorpe
> On 23/06/17 07:18 AM, Allen Hubbe wrote:
> > By "remote" do you mean the source or destination of a write?
>
> Look at how these calls are used in ntb_transport and ntb_perf:
>
> They both call ntb_peer_mw_get_addr to get the size of the BAR. The size
> is sent via spads to the other side. The other side then uses
> ntb_mw_get_align and applies align_size to the received size.
>
> > Yes, clients should transfer the address and size information to the peer.
>
> But then they also need to technically transfer the alignment
> information as well. Which neither of the clients do.

The client's haven't been fully ported to the multi-port api yet. They were only minimally changed to call the new api, but so far other than that they have only been made to work as they had before.

> > Maybe this is the confusion. None of these api calls are to reach across to the peer port, as to
> get the size of the peer's bar. They are to get information from the local port, or to configure the
> local port.
>
> I like the rule that these api calls are not to reach across the port.
> But then API looks very wrong. Why are we calling one peer_mw_get addr
> and the other mw_get_align? And why does mw_get_align have a peer index?

I regret that the term "peer" is used to distinguish the mw api. Better names perhaps should be ntb_outbound_mw_foo, ntb_inbound_mw_foo; or ntb_src_mw_foo, ntb_dest_mw_foo. I like outbound/inbound, although the names are longer, maybe out/in would be ok.

> And why does mw_get_align have a peer index?

Good question. Serge?

For that matter, why do we not also have peer mw idx in the set of parameters. Working through the example below, it looks like we are lacking a way to say Outbound MW1 on A corresponds with Inbound MW0 on B. It looks like we can only indicate that Port A (not which Outbound MW of Port A) corresponds with Inbound MW0 on B.

> > Some snippets of code would help me understand your interpretation of the api semantics more
> exactly.
>
> I'm not sure the code to best show this in code, but let me try
> describing an example situation:
>
> Lets say we have the following mws on each side (this is something that
> is possible with Switchtec hardware):
>
> Host A BARs:
> mwA0: 2MB size, aligned to 4k, size aligned to 4k
> mwA1: 4MB size, aligned to 4k, size aligned to 4k
> mwA2: 64k size, aligned to 64k, size aligned to 64k
>
> Host B BARs:
> mwB0: 2MB size, aligned to 4k, size aligned to 4k
> mwB1: 64k size, aligned to 64k, size aligned to 64k

If those are BARs, that corresponds to "outbound", writing something to the BAR at mwA0.

A more complete picture might be:

Host A BARs (aka "outbound" or "peer" memory windows):
peer_mwA0: resource at 0xA00000000 - 0xA00200000 (2MB)
peer_mwA1: resource at 0xA10000000 - 0xA10400000 (4MB)
peer_mwA2: resource at 0xA20000000 - 0xa20010000 (64k)

Host A MWs (aka "inbound" memory windows):
mwA0: 64k max size, aligned to 64k, size aligned to 64k
mwA1: 2MB max size, aligned to 4k, size aligned to 4k

Host A sees Host B on port index 1


Host B BARs (aka "outbound" or "peer" memory windows):
peer_mwB0: resource at 0xB00000000 - 0xB00200000 (2MB)
peer_mwB1: resource at 0xB10000000 - 0xB10010000 (64k)

Host B MWs (aka "inbound" memory windows):
mwB0: 1MB size, aligned to 4k, size aligned to 4k
mwB1: 2MB size, aligned to 4k, size aligned to 4k

Host B sees Host A on port index 4


Outbound memory (aka "peer mw") windows come with a pci resource. We can get the size of the resource, it's physical address, and set up outbound translation if the hardware has that (IDT).

Inbound memory windows (aka "mw") are only used to set up inbound translation, if the hardware has that (Intel, AMD).

To set up end-to-end memory window so that A can write to B, let's use peer_mwA1 and mwB0.

A: ntb_peer_mw_get_addr(peer_mwA1) -> base 0xA10000000, size 4MB
B: ntb_mw_get_align(port4**, mwB0) -> aligned 4k, aligned 4k, max size 1MB
** Serge: do we need port info here, why?

Side A has a resource size of 4MB, but B only supports inbound translation up to 1MB. Side A can only use the first quarter of the 4MB resource.

Side B needs to allocate memory aligned to 4k (the dma address must be aligned to 4k after dma mapping), and a multiple of 4k in size. B may need to set inbound translation so that incoming writes go into this memory. A may also need to set outbound translation.

A: ntb_peer_mw_set_trans(port1**, peer_mwA1, dma_mem_addr, dma_mem_size)
B: ntb_mw_set_trans(port4**, mwB0, dma_mem_addr, dma_mem_size)
** Serge: do we also need the opposing side MW index here?

** Logan: would those changes to the api suit your needs?

Logan Gunthorpe

unread,
Jun 23, 2017, 4:39:08 PM6/23/17
to Allen Hubbe, Jon Mason, linu...@googlegroups.com, linux-...@vger.kernel.org, Dave Jiang, Serge Semin, Kurt Schwemmer, Stephen Bates, Greg Kroah-Hartman


On 23/06/17 01:07 PM, Allen Hubbe wrote:
> The client's haven't been fully ported to the multi-port api yet. They were only minimally changed to call the new api, but so far other than that they have only been made to work as they had before.

So is it intended to eventually send the align parameters via spads?
This seems like it would take a lot of spads or multiplexing the spads
with a few doorbells. This gets a bit nasty.


> If those are BARs, that corresponds to "outbound", writing something to the BAR at mwA0.
> A more complete picture might be:
>
> Host A BARs (aka "outbound" or "peer" memory windows):
> peer_mwA0: resource at 0xA00000000 - 0xA00200000 (2MB)
> peer_mwA1: resource at 0xA10000000 - 0xA10400000 (4MB)
> peer_mwA2: resource at 0xA20000000 - 0xa20010000 (64k)
>
> Host A MWs (aka "inbound" memory windows):
> mwA0: 64k max size, aligned to 64k, size aligned to 64k
> mwA1: 2MB max size, aligned to 4k, size aligned to 4k

I don't really like the separation of inbound and output as you describe
it. It doesn't really match my hardware. In switchtec, each partition
has some number of BARs and each BAR has a single translation which sets
the peer and destination address. The translation really exists inside
the switch hardware, not on either side. But any translation can be
programmed by any peer. Saying that there's an opposite inbound window
to every outbound window is not an accurate abstraction for us.

I _suspect_ the IDT hardware is similar but, based on Serge's driver, I
think the translation can only be programmed by the peer that the BAR is
resident in (as opposed to from any side like in the switchtec hardwer).
(This poses some problems for getting the IDT code to actually work with
existing clients.)

> Outbound memory (aka "peer mw") windows come with a pci resource. We can get the size of the resource, it's physical address, and set up outbound translation if the hardware has that (IDT).
>
> Inbound memory windows (aka "mw") are only used to set up inbound translation, if the hardware has that (Intel, AMD).
>
> To set up end-to-end memory window so that A can write to B, let's use peer_mwA1 and mwB0.
>
> A: ntb_peer_mw_get_addr(peer_mwA1) -> base 0xA10000000, size 4MB
> B: ntb_mw_get_align(port4**, mwB0) -> aligned 4k, aligned 4k, max size 1MB
> ** Serge: do we need port info here, why?
>
> Side A has a resource size of 4MB, but B only supports inbound translation up to 1MB. Side A can only use the first quarter of the 4MB resource.
>
> Side B needs to allocate memory aligned to 4k (the dma address must be aligned to 4k after dma mapping), and a multiple of 4k in size. B may need to set inbound translation so that incoming writes go into this memory. A may also need to set outbound translation.
>
> A: ntb_peer_mw_set_trans(port1**, peer_mwA1, dma_mem_addr, dma_mem_size)
> B: ntb_mw_set_trans(port4**, mwB0, dma_mem_addr, dma_mem_size)
> ** Serge: do we also need the opposing side MW index here?
>
> ** Logan: would those changes to the api suit your needs?

Not really, no. Except for the confusion with the mw_get_align issue the
new API, as it is, suits my hardware well. What you're proposing doesn't
fix my issue and doesn't match my hardware. Though, I interpreted
ntb_peer_mw_set_trans somewhat differently from what you describe. I did
not expect the client would need to call both functions but some clients
could optionally use ntb_peer_mw_set_trans to set the translation from
the opposite side (thus needing to send the DMA address over spads or
msgs). Though, without an actual in-kernel user it's hard to know what
is actually intended.

It's worth noticing that the IDT driver only provides peer_mw_set_trans
and not mw_set_trans. I assumed it's because the hardware's memory
windows can only be configured from the opposite side.

Pragmatically, the only change I need for everything to work as I expect
is for mw_get_align to be called only after link up. However, given all
the confusion I'm wondering if these changes are even ready for
upstream. Without actual in-kernel client code it's hard to know if the
API is correct or that everyone is even interpreting it in the same way.

Logan

Serge Semin

unread,
Jun 23, 2017, 5:46:48 PM6/23/17
to Allen Hubbe, Logan Gunthorpe, Jon Mason, linu...@googlegroups.com, linux-...@vger.kernel.org, Dave Jiang, Kurt Schwemmer, Stephen Bates, Greg Kroah-Hartman
On Fri, Jun 23, 2017 at 03:07:19PM -0400, Allen Hubbe <Allen...@dell.com> wrote:
Hello Allen,

> From: Logan Gunthorpe
> > On 23/06/17 07:18 AM, Allen Hubbe wrote:
> > > By "remote" do you mean the source or destination of a write?
> >
> > Look at how these calls are used in ntb_transport and ntb_perf:
> >
> > They both call ntb_peer_mw_get_addr to get the size of the BAR. The size
> > is sent via spads to the other side. The other side then uses
> > ntb_mw_get_align and applies align_size to the received size.
> >
> > > Yes, clients should transfer the address and size information to the peer.
> >
> > But then they also need to technically transfer the alignment
> > information as well. Which neither of the clients do.
>
> The client's haven't been fully ported to the multi-port api yet. They were only minimally changed to call the new api, but so far other than that they have only been made to work as they had before.
>
> > > Maybe this is the confusion. None of these api calls are to reach across to the peer port, as to
> > get the size of the peer's bar. They are to get information from the local port, or to configure the
> > local port.
> >
> > I like the rule that these api calls are not to reach across the port.
> > But then API looks very wrong. Why are we calling one peer_mw_get addr
> > and the other mw_get_align? And why does mw_get_align have a peer index?
>
> I regret that the term "peer" is used to distinguish the mw api. Better names perhaps should be ntb_outbound_mw_foo, ntb_inbound_mw_foo; or ntb_src_mw_foo, ntb_dest_mw_foo. I like outbound/inbound, although the names are longer, maybe out/in would be ok.
>

As you remember we discussed the matter when new NTB API was being developed.
So we decided to have the following designation:
ntb_mw_* - NTB MW API connected with inbound memory windows configurations,
ntb_peer_mw_* - NTB MW API connected with outbound memory windows configurations.
Yes, suffixes like "in/out" might give better notion of the methods purpose,
but the current notation is in coherency with the rest of API, and as long as
user gets into NTB API, one won't have much difficulties with understanding.

> > And why does mw_get_align have a peer index?
>
> Good question. Serge?
>
> For that matter, why do we not also have peer mw idx in the set of parameters. Working through the example below, it looks like we are lacking a way to say Outbound MW1 on A corresponds with Inbound MW0 on B. It looks like we can only indicate that Port A (not which Outbound MW of Port A) corresponds with Inbound MW0 on B.
>

Before I give any explanation, could you please study the new NTB API documentation:
https://github.com/jonmason/ntb/blob/ntb-next/Documentation/ntb.txt
Particularly you are interested in lines 31 - 111. It will give you a better
description of the way all MW-related methods work for the start. The detailed
multi-port example is given in the end of this email.
That's not actually right. We can't connect any outbound to any inbound MW,
since each of them can be of different type, thus having different
restrictions. Like IDT got MWs with direct address translation and MWs
based on Look-up tables (I suppose the same might be with Switchtec). All
inbound MWs information provided by new NTB API actually belongs to
corresponding outbound MWs.

>
> A: ntb_peer_mw_get_addr(peer_mwA1) -> base 0xA10000000, size 4MB
> B: ntb_mw_get_align(port4**, mwB0) -> aligned 4k, aligned 4k, max size 1MB
> ** Serge: do we need port info here, why?
>
> Side A has a resource size of 4MB, but B only supports inbound translation up to 1MB. Side A can only use the first quarter of the 4MB resource.
>
> Side B needs to allocate memory aligned to 4k (the dma address must be aligned to 4k after dma mapping), and a multiple of 4k in size. B may need to set inbound translation so that incoming writes go into this memory. A may also need to set outbound translation.
>
> A: ntb_peer_mw_set_trans(port1**, peer_mwA1, dma_mem_addr, dma_mem_size)
> B: ntb_mw_set_trans(port4**, mwB0, dma_mem_addr, dma_mem_size)
> ** Serge: do we also need the opposing side MW index here?
>
> ** Logan: would those changes to the api suit your needs?
>

Alright folks. I'll explain the whole architecture by example of simple
multi-port device. Suppose we got some abstract three-NTB-ports
device Z: {Port A, Port B, Port C}. For our particular case lets say, that each
port can have up to three outbound memory windows configured (obviously it's more
than enough to have ports connected to each other, but in general each port can
have different number of outbound memory windows). So to speak each port
can access up to three remote memory regions. In this way, each port
shall have the following set of inbound and outbound memory windows:
Port A: 3 outbound MWs to access Port B or Port C memory,
6 inbound MWs to have Port B and Port C accessing Port A memory
Port B: 3 outbound MWs to access Port A or Port C memory,
6 inbound MWs to have Port A and Port C accessing Port B memory
Port C: 3 outbound MWs to access Port A or Port B memory,
6 inbound MWs to have Port A and Port B accessing Port C memory
Suppose also, that each outbound memory window got different restrictions like:
translation address alignment, memory window size alignment and maximum size of
memory window. Lets say, that our abstract device has some similarity with
Intel/AMD device (I'll explain the difference with IDT later), but obviously
got multiple ports.

Now when description is over, I would like to have Port A accessing Port B memory
over outbound memory window with index 1 - midx1 (index starts from zero).
To set the connection up, the following steps must be performed within new NTB API.

1) Port B needs to allocated a memory region for inbound memory window. But first
it needs to know what kind of address and size should be of this region. For this
purpose it calls:
ntb_mw_get_align(Port A, midx1, &addr_align, &size_align, &size_max);
As you can see, we get translation address alignment (addr_align), memory region
alignment (size_align) and its maximum possible size (size_max) to have inbound
memory allocated properly for Port A outbound memory windows - midx1.

2) The actual inbound memory region can be allocated using say
dma_alloc_coherent() or something else. Using the restrictions from 1) we can
make sure, that it is correct for corresponding outbound MW.

3) Now the translation address of the memory window needs to be set to NTBi device.
For this purpose we need to call:
ntb_mw_set_trans(Port A, midx1, trans_addr, mw_size);
Which actually implies, that the memory region with translation address trans_addr
and size mw_size is set for outbound memory window midx1 of Port A.

4) Since Port B finished the inbound memory window configuration, Port A
needs to be somehow notified of it and informed about inbound memory window
parameters, particularly memory window index (if it isn't predefined by logic)
and memory window size (again if it isn't predefined by logic). It might be done
using any of preferable API methods like doorbell and scratchpads or message
registers. After this Port A start it's configuration.

5) Port A got information from Port B about inbound MW configured for its
outbound memory window midx1. The only thing left to do is to get actual
mapping info like outbound memory window base address and size. It's done
by calling
ntb_peer_mw_get_addr(midx1, map_base, map_size);
From now map_base and map_size can be used to map actual outbound MW using
for instance ioremap_wc(), so to have Port B memory accessed over retrieved
virtual address.
It must be noted in this matter, that in general map_size doesn't need to
be equal to mw_size. But obviously Port A logic needs to know mw_size for
proper communications.


Hopefully the explanation was full enough to grab a gist of multi-port NTB API.
As I said it was based on assumption, that the abstract device got more
similarities with Intel/AMD hardware (except multiportness of course).
On the other way, the IDT hardware is configured a bit different.
The difference is located at step 3) of the algorithm above. Due to
IDT hardware peculiarity (particularly due to Lookup table entry access
registers), the translation address and size of memory region can't
be setup on the inbound MW side. Instead the translation address (trans_addr)
and size (mw_addr) must be transfered to Port A, so it would set them up using
the method:
ntb_peer_mw_set_trans(Port B, midx1, trans_addr, mw_size);
The rest of the configuration steps are the same.

Regards,
-Sergey

> > So on Host A: peer_mw_get_addr(idx=1) should return size=4M (from mwA1),
> > but it's not clear what mw_get_align(widx=1) should do. I see two
> > possibilities:
> >
> > 1) Given that it has the opposite sense of peer_mw_get_addr (ie. there
> > is no peer in it's name) and that this new api also has a peer index, it
> > should return align_size=64k (from mwB1). However, in order to do this,
> > Host B must be fully configured (technically the link doesn't have to be
> > fully up, but having a link up is the only way for a client to tell if
> > Host B is configured or not).
> >
> > 2) Given your assertion that these APIs should never reach across the
> > link, then one could say it should return align_size=4k. However, in
> > this situation the name is really confusing. And the fact that it has a
> > pidx argument makes no sense. And the way ntb_transport and ntb_perf use
> > it is wrong because they will end up masking the 64K size of mwB1 with
> > the 4k align_size from mwA1.
> >
> > Does that make more sense?
> >
> > Thanks,
> >
> > Logan
> >
>
> --
> You received this message because you are subscribed to the Google Groups "linux-ntb" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to linux-ntb+...@googlegroups.com.
> To post to this group, send email to linu...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/linux-ntb/000101d2ec53%24f2830840%24d78918c0%24%40dell.com.

Logan Gunthorpe

unread,
Jun 23, 2017, 5:59:55 PM6/23/17
to Serge Semin, Allen Hubbe, Jon Mason, linu...@googlegroups.com, linux-...@vger.kernel.org, Dave Jiang, Kurt Schwemmer, Stephen Bates, Greg Kroah-Hartman
Hey,

Thanks Serge for the detailed explanation. This is pretty much exactly
as I had thought it should be interpreted. My only problem remains that
my hardware can't provide ntb_mw_get_align until the port it is asking
about has configured itself. The easiest way to solve this is to only
allow access when the link to that port is up. It's not a complicated
change and would actually simplify ntb_transport and ntb_perf a little.

Is this ok with you Allen? If it is, I can include a patch in my next
switchtec submission.

Thanks,

Logan

Allen Hubbe

unread,
Jun 23, 2017, 6:02:19 PM6/23/17
to Logan Gunthorpe, Serge Semin, Jon Mason, linu...@googlegroups.com, linux-...@vger.kernel.org, Dave Jiang, Kurt Schwemmer, Stephen Bates, Greg Kroah-Hartman
From: Logan Gunthorpe
Ok.

>
> Thanks,
>
> Logan

Allen Hubbe

unread,
Jun 23, 2017, 6:07:03 PM6/23/17
to Logan Gunthorpe, Jon Mason, linu...@googlegroups.com, linux-...@vger.kernel.org, Dave Jiang, Serge Semin, Kurt Schwemmer, Stephen Bates, Greg Kroah-Hartman
From: Logan Gunthorpe
> But any translation can be
> programmed by any peer.

That doesn't seem safe. Even though it can be done as you say, would it not be better to have each specific translation under the control of exactly one driver?

If drivers can reach across and set the translation of any peer bar, they would still need to negotiate among N peers which one sets which other's translation.

Logan Gunthorpe

unread,
Jun 23, 2017, 7:06:49 PM6/23/17
to Allen Hubbe, Jon Mason, linu...@googlegroups.com, linux-...@vger.kernel.org, Dave Jiang, Serge Semin, Kurt Schwemmer, Stephen Bates, Greg Kroah-Hartman
Yup. In a dual host setup its not a problem seeing only the one peer
will ever set any of the memory windows. In the multi-host setup, there
has to be implicit or explicit negotiation as to which memory window
connects to which peer.

The hardware also implements locking so if two peers try to mess with
the same translation, the worst that can happen is the last peer's
settings will take effect or one peer will see an error.

Logan

Reply all
Reply to author
Forward
0 new messages