Infiniband specification

1,846 views
Skip to first unread message

Alex Forencich

unread,
Jul 12, 2021, 4:56:52 PM7/12/21
to corund...@googlegroups.com
For anyone interested in helping with the implementation of RoCEv2 in
Corundum; here are links to version 1.4 of the IB specification from
IBTA, which covers RoCE and RoCEv2:

volume 1 v1.4 (IB protocol layer spec):
https://cw.infinibandta.org/document/dl/8567
<https://cw.infinibandta.org/document/dl/8567>

volume 2 v1.4 (IB physical layer spec):
https://cw.infinibandta.org/document/dl/8566
<https://cw.infinibandta.org/document/dl/8566>

Volume 1 is the main thing to look at; volume 2 is really only useful
for the infiniband physical layer, which does not apply to RoCE.


On a standards-related note, if anyone needs a copy of IEEE 802.3
(Ethernet physical layer spec), it can be obtained for free from IEEE here:

https://ieeexplore.ieee.org/browse/standards/get-program/page
<https://ieeexplore.ieee.org/browse/standards/get-program/page>


Additionally, it seems that the PCIe gen 4 spec can also be found via
some creative googling; just make sure you get the 1.0 version not the
0.3 draft version.

--
Alex Forencich

Alex Forencich

unread,
Jul 19, 2021, 4:39:07 PM7/19/21
to corund...@googlegroups.com

Yeah, I think we're also going to have to implement PFC pause frames for this to work correctly.  That's definitely something I'm going to have to take a look at.

Looks like PFC is Annex 31D in IEEE 802.3, and pause frames are Annex 31B.

It looks like the implementation of PFC should be relatively straightforward, although I will have to make some modifications to the MACs to handle PFC frames since they need to jump over the queues for things to work correctly.  That part of the protocol mainly involves exchanging pause quanta over the link (where 1 pause quanta is 512 bit times), and the transmit side must stop sending the specified traffic class until the pause time expires.

I think the more tricky part is going to be implementing support for traffic classes in the rest of Corundum; right now it's not totally clear exactly how that's going to work, aside from having an output FIFO for each traffic class, internal flow control to prevent the output FIFOs from filling up, and logic at the input of the TX MAC after the FIFO to merge the data coming out of the separate FIFOs.  I'm planning on implementing some form of internal flow control along with the shared interface datapath so that there isn't head-of-line blocking when using multiple ports per interface, so it makes sense to figure out how PFC would fit in alongside that since it requires similar functionality. 

Alex Forencich
On 7/12/21 2:20 PM, maik peterson wrote:
hallo, please keep in mind that roce1/2 isnt stable at all (and was never).
the tech. depends strongly on back-pressure/pause-frame handling. and this is only stable
available in a handful of switches today (mlnx, some arista and cisco). the rest will fail...

mp



--
You received this message because you are subscribed to the Google Groups "corundum-nic" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corundum-nic...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/corundum-nic/591f7e58-e641-2008-cea7-92f36864b811%40eng.ucsd.edu.

Alex Forencich

unread,
Jul 19, 2021, 5:39:09 PM7/19/21
to corund...@googlegroups.com

So, looking at the PFC spec in 802.3 Annex 31D, it doesn't say anything about response time.  There are some numbers in 31B.3.7 for pause frames, but TBH pause frames are pretty trivial to implement with extremely low delay - just deassert tready on the TX side of the MAC between frames whenever the pause timer is non-zero.  For PFC, this back-pressure needs to be asserted at a higher level, before the different traffic classes are merged to avoid head-of-line blocking.  What I'm trying to determine is if I can keep an async FIFO after the merge (i.e. can I do the merge in the 250 MHz PCIe clock domain, then hand the packet off to the MAC which runs at 322 MHz through an async FIFO, or do I have to do the merge in the 322 MHz MAC clock domain which would involve crossing a lot more stuff into the MAC clock domain).  I can't find anything in 802.3 for PFC, but there is some information in 802.1Qbb, and interestingly that seems to be much more restrictive than the numbers for pause frames (614.4 ns budget for pausing the queue, regardless of rate, or 120 vs 394 pause quanta at 100 Gbps).  Does anyone have any concrete information on this, or other relevant sources?  Does properly implementing PFC mean that the requirements in 802.1Qbb need to be followed in addition to 802.3, or does 802.3 supersede 802.1Qbb? 

Ideally, I would like to be able to do everything in the PCIe clock domain and then just have a single async frame FIFO to the MAC, but the size of that FIFO would have to be large enough to store jumbo frames and it won't be possible to stop the transmission of frames that have already been handed off to that async FIFO, and the delay requirements in 802.1Qbb are ~half of a 1500 byte frame at 10 Gbps (1.2 us) and less than a 9K jumbo frame at 100 Gbps (720 ns).  Perhaps what I need to do is implement something more akin to an elastic buffer for TX that attempts to maintain a minimum fill level, that way I don't need a frame-oriented FIFO so the FIFO delay will be much less than an MTU frame size and should satisfy the response time requirements in 802.1Qbb.  For MTU frames on a 512-bit AXI stream interface at 100 Gbps, the clock speed only needs to be around 200 MHz, so I think this will work OK with a frame-oriented FIFO at 250 MHz followed by an async FIFO that enforces a minimum occupancy before releasing each start-of-frame to ensure the MAC gets a contiguous frame without any gaps. 

Alex Forencich
Reply all
Reply to author
Forward
0 new messages