Sure, you could have one scheduler that simply sends transmit
commands to the individual ports if you want to go that route.
There is nothing inherent to the design that requires one
scheduler per port. For RotorNet, each port needs to run an
independent TDMA schedule, so it makes sense to have separate
schedulers for that application. But other applications may have
different requirements.
Nobody invokes the schedulers, they operate independently. The schedulers receive doorbell events from the queue handling logic when the driver enqueues packets, and are expected to track the queue empty/not empty state for each queue. The schedulers then generate transmit commands that are sent to the TX engine on the port. The TX engine then attempts to send a packet, and reports back to the scheduler either that the packet was sent, or that the send failed because the queue was empty or disabled.
The schedulers can operate independently of each other, or you
could theoretically have them communicate in some way. They only
impact each other in the sense that if you have two schedulers
drawing from the same queue and they are each attempting to send
at a specific rate, the queue will drain at a rate equal to the
sum of the scheduler target rates. So this may not necessarily be
the best configuration if you want to precisely rate-limit
specific queues.
Also, the current schedulers in corundum are very simple
round-robin schedulers; you may want to replace them with
something more intelligent for your application. I'm considering
building a WFQ scheduler of some sort for Corundum at some point,
but that could be a ways down the road.
Alex Forencich
That makes sense.
But I am still confused about the Tx side (maybe I need to dig more background on this).
Who will invoke the Tx scheduler on each port? And since every port has full control over all the queues. Why do not we just have a single Tx scheduler?
Or say each Tx scheduler stands for a specific policy, if one scheduler can schedule the all queues, will it impact other schedulers?
Another couple of things I should probably mention:
First, the NIC has no idea how big a packet is until the
descriptor has been read, and once the descriptor has been read
the NIC is then committed to sending the packet (there is
currently no mechanism to re-enqueue descriptors from the NIC
end). So, when the scheduler issues a transmit command, it has no
idea how many packets are in the queue or how big those packets
are. It's only after the descriptor fetch request has completed
that the NIC knows for sure that the queue was empty or not, and,
after interpreting the descriptor, how big the packet is. At this
point, the scheduler can be informed of the status of the
operation - whether a packet is being sent, and how big it is.
These constraints come from the fact that the NIC attempts to
keep as little per-queue state as possible so that it can handle a
larger number of queues. Obviously if packets and/or descriptors
were prefetched in a buffer on the NIC, then the NIC could know a
lot more about the packets, but there are all sorts of issues
associated with doing that (head of line blocking, etc.), it
requires a lot more resources on the FPGA, and it could seriously
limit the number of hardware queues supported by the design.
Storing the queue state in the queue management logic requires
128 bits per queue, so two URAM instances can store the queue
state for 4096 queues. For reference, all current designs
targeting UltraScale+ parts have a default TX queue count of 8192
queues per interface, and I have synthesis-tested the design to
32768 TX queues per interface.
Something that I want to look at at some point is how to support
an essentially unlimited number of hardware transmit queues. Most
likely this would take the form of a decently large number of
hardware queues + some way to dynamically assign hardware queues
to flows/connections/paths/etc. as they become active and
reconfigure the schedulers appropriately.
Incidentally, this multiple-ports-per-interface architectural
feature means that corundum will likely never support LSO. I
don't think this is a major shortcoming though as software GSO +
scatter/gather DMA should still be able to provide a reasonable
level of performance.
Alex Forencich
Thanks. That makes more sense to me now!
Yes, 4096 hardware [rx/tx/completion/event] queues per two URAM instances, at least on UltraScale+ devices. I can't really say why traditional NICs don't implement gobs and gobs of queues, aside from that it probably just wasn't a design constraint (although some NICs do support large numbers of queues; I know the QDMA IP core supports 2048 and Mellanox NICs can support many thousands of queue pairs for RDMA). For corundum, we wanted to be able to control the flow of data on a per-destination basis with a hardware scheduler, so supporting a very large number of hardware queues is an important design point. And I think this will be important for any application that wants to use one hardware queue per connection or similar.
Ideally, I would like to find a way to support an arbitrarily
large number of queues in the future, most likely based on a
reasonably large number of hardware queues + some sort of dynamic
allocation/remapping. Currently, the main constraints are on the
software side - the queue index field in the linux kernel is only
16 bits, and there aren't particularly good methods for
classifying traffic into different hardware queues at the moment.
Alex Forencich
--
You received this message because you are subscribed to the Google Groups "corundum-nic" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corundum-nic...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/corundum-nic/43cacfb1-5d4d-45fe-a4e4-7795bd5c76b2n%40googlegroups.com.
Yes, the NIC does not know anything about the packets in the
queues until it decides to send them, and then at that point it's
committed to sending them. So it's not possible to do flow
scheduling completely in hardware. Instead, what you have to do
is assign flows to queues in software (perhaps in the qdisc
layer). Then, you have a 1:1 mapping between flows and queues,
and the scheduler can control the flows separately by controlling
the queues. There is really no other reasonable way to implement
this; sending all of the packets to the NIC unconditionally where
the NIC hardware can classify them suffers from increased latency,
head-of-line blocking, storage limitations, potentially memory
bandwidth limitations, and it doesn't provide any method for the
NIC to assert backpressure against applications on a per-flow
basis.
Alex Forencich
To view this discussion on the web visit https://groups.google.com/d/msgid/corundum-nic/c003806f-53db-437a-95f7-ebfd887d5dd9n%40googlegroups.com.
https://man7.org/linux/man-pages/man8/tc-skbedit.8.html
Alex Forencich
To view this discussion on the web visit https://groups.google.com/d/msgid/corundum-nic/d5932a02-dff6-491a-8489-345187412e48n%40googlegroups.com.
I mean sure, you can do it at the driver level. It's just harder
to change the mapping if it's baked into the driver. But perhaps
there is some way to make that more flexible, for instance by
exposing some sort of configuration options in sysfs.
Alex Forencich
To view this discussion on the web visit https://groups.google.com/d/msgid/corundum-nic/f356f75c-615b-4073-bce1-05574eea7a52n%40googlegroups.com.