Re: About corundum

254 views
Skip to first unread message

Alex Forencich

unread,
Oct 13, 2020, 5:06:45 PM10/13/20
to Tao Wang, corund...@googlegroups.com
Tao-

Thanks for your questions!  If you don't mind, I'm going to respond on
the list as this may be useful to other people as well. 

That's correct, each interface contains one set of queues and appears as
a separate network interface at the operating system level, while each
interface can then contain multiple ports that share the same set of
queues. 

In this case, which packets go out which ports is determined by the
schedulers, and since each port has its own scheduler, the mapping of
queues to ports can be changed dynamically under hardware control
without any input from the operating system.  This is intended to
support various experimental network architectures that utilize multiple
uplinks from each host.  In the receive direction, incoming packets from
multiple ports are simply multiplexed into the same set of queues.  The
hand-off to the ports is done atomically at the descriptor level - a
port makes a request to the interface descriptor fetch logic for a
descriptor for a certain queue, and either it gets a descriptor or the
operation fails due to the queue being empty or disabled.  There are no
issues with contention presuming there is sufficient PCIe bandwidth and
the host can keep the RX queues full. 

Some of this may make a little bit more sense when you consider the
history of Corundum.  It was originally developed with the eventual goal
of evaluating RotorNet
(http://cseweb.ucsd.edu/~gmporter/papers/sigcomm17-rotornet.pdf) and
Opera (https://cseweb.ucsd.edu/~snoeren/papers/opera-nsdi20.pdf).  In
that network, each host has multiple uplinks, different traffic must be
sent over each of the uplinks during different timeslots, and all of
this must be done under hardware control to ensure sufficient precision
in packet transmit timing.  The idea is that software sends data to
different NIC queues based on the ultimate destination, and then it is
up to the NIC schedulers to decide not only when to send data from each
queue, but also which port to use. 

Another potential application would be something like P-FatTree
(https://cseweb.ucsd.edu/~snoeren/papers/ptree-hotnets16.pdf), again
where hosts have multiple uplinks and there could be some benefit to
presenting a unified interface to the OS while permitting flow steering
across multiple ports on the NIC under hardware control.  Performing the
flow steering after the transmit queue means potentially less reordering
and/or latency as you won't have packets already queued up in hardware
queues for transmission on one port when you want to migrate the flow to
a different port.  

There are potentially other applications for a NIC with this
capability.  Perhaps someone in the community has some innovative ideas
on what else can be done with this. 

If your network architecture doesn't need multiple ports on the same
interface, then just set PORTS_PER_IF to 1 and it will act like a normal
NIC.  IF_COUNT = 2, PORTS_PER_IF = 1 is exactly the same in terms of
functionality as an existing "dual-port" NIC. 

Alex Forencich

On 10/13/20 12:59 PM, Tao Wang wrote:
> Hi Alex,
>
> This is Tao Wang from NYU. Really appreciated that you open source
> your 100G NIC design and port it to many FPGA boards.
>
> But I am confused about the design of corundum. Specifically, it is
> hierarchical structure that corundum has FPGA at top, with multiple
> interfaces, which can have multiple ports. And each port has its own
> Tx/Rx engine to schedule the Tx/Rx queues. But they all share the same
> set of queues under the same interface.
>
> If my understanding is correct, my question is:
>
> Is there any contention between the ports? how to resolve it?
>
> Another thing is that as you plot in the Figure 2, the typical
> design of this datapath is every interface has one port which is
> responsible for all the Tx/Rx queues. So what the rationale to have
> such multi-ports design?
>
> Could you please shed some lights on this?
>
> Thanks in advance.
>
> Best,
> Tao

Alex Forencich

unread,
Oct 13, 2020, 6:47:54 PM10/13/20
to Tao Wang, corund...@googlegroups.com
Each port has its own independent scheduler.  The scheduler is not
shared across ports.  So if you have one interface with two ports,
you'll have two independent schedulers, one for each port.  The idea is
that each port can independently determine which queues it wants to send
from at any given time. 

On the RX path, there is one RX engine associated with each physical
port.  When a packet arrives on the port, the RX engine associated with
that port requests a descriptor from the appropriate RX queue. 
Currently, the RX queue is selected via RSS flow hashing, but I'm hoping
to make this more general. 

Alex Forencich

On 10/13/20 3:36 PM, Tao Wang wrote:
> Thanks for your clarification.
>
> To zoom in, for the TX path, is the TX scheduler shared across multiple ports in the same interface?
>
> The TX schedule will choose one transmit engine in those ports to send out packets from a specific queue, right?
>
> But for the RX path, how do we know which Rx engine (or port) should process that receive request?

Alex Forencich

unread,
Oct 13, 2020, 7:55:19 PM10/13/20
to Tao Wang, corund...@googlegroups.com

Sure, you could have one scheduler that simply sends transmit commands to the individual ports if you want to go that route.  There is nothing inherent to the design that requires one scheduler per port.  For RotorNet, each port needs to run an independent TDMA schedule, so it makes sense to have separate schedulers for that application.  But other applications may have different requirements. 

Nobody invokes the schedulers, they operate independently.  The schedulers receive doorbell events from the queue handling logic when the driver enqueues packets, and are expected to track the queue empty/not empty state for each queue.  The schedulers then generate transmit commands that are sent to the TX engine on the port.  The TX engine then attempts to send a packet, and reports back to the scheduler either that the packet was sent, or that the send failed because the queue was empty or disabled.  

The schedulers can operate independently of each other, or you could theoretically have them communicate in some way.  They only impact each other in the sense that if you have two schedulers drawing from the same queue and they are each attempting to send at a specific rate, the queue will drain at a rate equal to the sum of the scheduler target rates.  So this may not necessarily be the best configuration if you want to precisely rate-limit specific queues. 

Also, the current schedulers in corundum are very simple round-robin schedulers; you may want to replace them with something more intelligent for your application.  I'm considering building a WFQ scheduler of some sort for Corundum at some point, but that could be a ways down the road. 

Alex Forencich
On 10/13/20 4:35 PM, Tao Wang wrote:
That makes sense.

But I am still confused about the Tx side (maybe I need to dig more background on this).

Who will invoke the Tx scheduler on each port? And since every port has full control over all the queues. Why do not we just have a single Tx scheduler?

Or say each Tx scheduler stands for a specific policy, if one scheduler can schedule the all queues, will it impact other schedulers?

Alex Forencich

unread,
Oct 13, 2020, 10:15:22 PM10/13/20
to Tao Wang, corund...@googlegroups.com

Another couple of things I should probably mention:

First, the NIC has no idea how big a packet is until the descriptor has been read, and once the descriptor has been read the NIC is then committed to sending the packet (there is currently no mechanism to re-enqueue descriptors from the NIC end).  So, when the scheduler issues a transmit command, it has no idea how many packets are in the queue or how big those packets are.  It's only after the descriptor fetch request has completed that the NIC knows for sure that the queue was empty or not, and, after interpreting the descriptor, how big the packet is.  At this point, the scheduler can be informed of the status of the operation - whether a packet is being sent, and how big it is. 

These constraints come from the fact that the NIC attempts to keep as little per-queue state as possible so that it can handle a larger number of queues.  Obviously if packets and/or descriptors were prefetched in a buffer on the NIC, then the NIC could know a lot more about the packets, but there are all sorts of issues associated with doing that (head of line blocking, etc.), it requires a lot more resources on the FPGA, and it could seriously limit the number of hardware queues supported by the design. 

Storing the queue state in the queue management logic requires 128 bits per queue, so two URAM instances can store the queue state for 4096 queues.  For reference, all current designs targeting UltraScale+ parts have a default TX queue count of 8192 queues per interface, and I have synthesis-tested the design to 32768 TX queues per interface. 

Something that I want to look at at some point is how to support an essentially unlimited number of hardware transmit queues.  Most likely this would take the form of a decently large number of hardware queues + some way to dynamically assign hardware queues to flows/connections/paths/etc. as they become active and reconfigure the schedulers appropriately. 

Incidentally, this multiple-ports-per-interface architectural feature means that corundum will likely never support LSO.  I don't think this is a major shortcoming though as software GSO + scatter/gather DMA should still be able to provide a reasonable level of performance. 

Alex Forencich
On 10/13/20 6:28 PM, Tao Wang wrote:
Thanks. That makes more sense to me now!

Tao Wang

unread,
May 10, 2021, 5:26:13 PM5/10/21
to corundum-nic
One minor question. Please correct me if I am wrong about those terminologies.

Are those 4096 queues the same as "physical" queues in the NIC? 

What makes it possible for Corundum to support thousands of queues compared with traditional NICs (like, they only support 64 or 128 queues)?

Is this because that the queue state as you mentioned earlier?

Alex Forencich

unread,
May 10, 2021, 5:39:15 PM5/10/21
to corund...@googlegroups.com

Yes, 4096 hardware [rx/tx/completion/event] queues per two URAM instances, at least on UltraScale+ devices.  I can't really say why traditional NICs don't implement gobs and gobs of queues, aside from that it probably just wasn't a design constraint (although some NICs do support large numbers of queues; I know the QDMA IP core supports 2048 and Mellanox NICs can support many thousands of queue pairs for RDMA).  For corundum, we wanted to be able to control the flow of data on a per-destination basis with a hardware scheduler, so supporting a very large number of hardware queues is an important design point.  And I think this will be important for any application that wants to use one hardware queue per connection or similar.

Ideally, I would like to find a way to support an arbitrarily large number of queues in the future, most likely based on a reasonably large number of hardware queues + some sort of dynamic allocation/remapping.  Currently, the main constraints are on the software side - the queue index field in the linux kernel is only 16 bits, and there aren't particularly good methods for classifying traffic into different hardware queues at the moment. 

Alex Forencich
--
You received this message because you are subscribed to the Google Groups "corundum-nic" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corundum-nic...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/corundum-nic/43cacfb1-5d4d-45fe-a4e4-7795bd5c76b2n%40googlegroups.com.

Tao Wang

unread,
Jun 23, 2021, 11:01:57 AM6/23/21
to corundum-nic
Sorry to bring this up again.

I do not understand the queue scheduling (i.e., transmit scheduler) part. Please correct me if I am wrong.

The scheduler works on the queue states, which point to the descriptor queues residing at the host memory.

But the elements in one descriptor queue do not necessarily come from the same flow, right?

So is it possible to do flow scheduling at Corundum?

Alex Forencich

unread,
Jun 23, 2021, 3:17:47 PM6/23/21
to corund...@googlegroups.com

Yes, the NIC does not know anything about the packets in the queues until it decides to send them, and then at that point it's committed to sending them.  So it's not possible to do flow scheduling completely in hardware.  Instead, what you have to do is assign flows to queues in software (perhaps in the qdisc layer).  Then, you have a 1:1 mapping between flows and queues, and the scheduler can control the flows separately by controlling the queues.  There is really no other reasonable way to implement this; sending all of the packets to the NIC unconditionally where the NIC hardware can classify them suffers from increased latency, head-of-line blocking, storage limitations, potentially memory bandwidth limitations, and it doesn't provide any method for the NIC to assert backpressure against applications on a per-flow basis. 

Alex Forencich

Tao Wang

unread,
Jul 6, 2021, 3:24:23 PM7/6/21
to corundum-nic
Thanks for the explanation. But how does a Qdisc map to a descriptor queue in hardware?

Alex Forencich

unread,
Jul 6, 2021, 3:44:34 PM7/6/21
to corund...@googlegroups.com

Tao Wang

unread,
Jul 6, 2021, 3:55:06 PM7/6/21
to corundum-nic
Thanks. I was thinking about doing this at driver level. maybe implementing ndo_select_queue can also help here.

Alex Forencich

unread,
Jul 12, 2021, 11:25:45 PM7/12/21
to corund...@googlegroups.com

I mean sure, you can do it at the driver level.  It's just harder to change the mapping if it's baked into the driver.  But perhaps there is some way to make that more flexible, for instance by exposing some sort of configuration options in sysfs. 

Alex Forencich
Reply all
Reply to author
Forward
0 new messages