netmap design question - accessing netmap:X-n individual queues on FreeBSD

Eduardo Meyer

unread,

Jan 20, 2016, 10:09:12 AM1/20/16

to

Hello all,

I have some doubts regarding netmap design direct queue usage.

If open netmap:ix0 I am opening all 0-7 queues. Are those queues FIFO among
themselves? I mean first packeds will be available on netmap:ix0-0 and if
this queue fills up the next packets will be on netmap:ix0-1, and via
netmap:ix0 I have all queues from 0 to 7, is this understanding correct?

Or something else happens, like, only 1 queue is used if I open netmap:ix0?

A later question, if I open netmap:ix0-0 and nothing else is it supposed to
work?

On my tests I can see that "pkt-gen -f tx -i ix0-0" will work, but "pkt-gen
-f rx -i ix0-0" will not. I can transmit but cant receive on a given queue.
Why is that? And how could I make something like this, work (code change
required?):

bridge -i netmap:ix0-0 -i netmap:ix1-0

Or should it already work?

Mr Pavel Odiltsov, the author from fastnetmon mentioned he can run on Linux
and it works:

"kipfw netmap:eth0-n netmap:eth1-n"

But this or the above bridge example won't work on FreeBSD. Is that any
different? (I did not try on Linux).

I could also notice performance differences I would like to understand, if
I run:

pkt-gen -i ix0 -f tx -s 192.168.0.2 -d 192.168.0.1

I have 14.8Mpps (like rate).

If I run:

pkt-gen -i ix0-1 -f tx -s 192.168.0.2 -d 192.168.0.1

I can have only 11Mpps. In fact I have 11Mpps if I run on ix0-2, ix0-3, ...
ix0-7.

I can understand if I run all queues I can have better pps rates than only
one single queue, sure, however if I run:

pkt-gen -i ix0-0 -f tx -s 192.168.0.2 -d 192.168.0.1

I also have 14.8Mpps. So yeah, ix0 or ix0-0 both give me 14.8Mpps while any
other queue give me 11Mpps. How should I understand this?

I am asking this because I want to hack (for learning) into the bridge code
to make it multithreaded, and I want to have a thread opening each one of
the 8 available queues allocated on each one of my 8 CPUs.

However both opening ix0-1 and ix1-1 on source code or as a parameter to
the bridge application, I can't have it working.

I also looked on "pkt-gen -p 8 -c 8" and although debug shows I have 8
threads on 8 CPU with 8 queues I still only see 1 thread working, all other
threads go IDLE. I expected to see at least 2 threads working, say, for
ix0-0 and ix0-1 to fill line rate.

Thank you in advance.

--
===========
Eduardo Meyer
_______________________________________________
freeb...@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net...@freebsd.org"

Pavel Odintsov

unread,

Jan 20, 2016, 1:46:57 PM1/20/16

to

Hi

Yes, this approach working really well on Linux. But I have never tried to
do same on FreeBSD.

I'm using similar approach in dastnetmon abd read data from the network
card in X threads where each thread assigned to physical queue. So for
Linux you should use my custom (based on Intel's drivers from
Sourceforge with netmap patches) vbecause vanilla drivers haven't support
for multi queue mode.

> freeb...@freebsd.org <javascript:;> mailing list

> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net...@freebsd.org

> <javascript:;>"
>

--
Sincerely yours, Pavel Odintsov

Adrian Chadd

unread,

Jan 20, 2016, 3:35:03 PM1/20/16

to

Ok, so, I mostly did this already:

https://github.com/erikarn/netmap-tools/

it has a multi-threaded, multi-queue bridge + ipv4 decap for testing.

-a

Pavel Odintsov

unread,

Jan 23, 2016, 4:31:34 AM1/23/16

to

Hi!

Great job! Do you have performance estimations?

On Wednesday, 20 January 2016, Adrian Chadd <adrian...@gmail.com> wrote:

> Ok, so, I mostly did this already:
>
> https://github.com/erikarn/netmap-tools/
>
> it has a multi-threaded, multi-queue bridge + ipv4 decap for testing.
>
>
>
> -a
>

--
Sincerely yours, Pavel Odintsov

Adrian Chadd

unread,

Jan 23, 2016, 10:31:38 AM1/23/16

to

For random src/dst ports and IPs and on the chelsio t5 40gig hardware,
I was getting what, uhm, 40mil tx pps and around 25ish mil rx pps?

The chelsio rx path really wants to be coalescing rx buffers, which
the netmap API currently doesn't support. I've no idea if luigi has
plans to add that. So, it has the hilarious side effect of "adding
more RX queues" translates to "drops in RX performance." :(

Thanks,

-a

Marcus Cenzatti

unread,

Jan 23, 2016, 2:17:28 PM1/23/16

to

On 1/23/2016 at 1:31 PM, "Adrian Chadd" <adrian...@gmail.com> wrote:
>
>For random src/dst ports and IPs and on the chelsio t5 40gig
>hardware,
>I was getting what, uhm, 40mil tx pps and around 25ish mil rx pps?
>
>The chelsio rx path really wants to be coalescing rx buffers, which
>the netmap API currently doesn't support. I've no idea if luigi has
>plans to add that. So, it has the hilarious side effect of "adding
>more RX queues" translates to "drops in RX performance." :(
>
>Thanks,

hello,

I am sorry, are you saying intel and chelsio distribute RX packet load differently? If I am not mistaken intel will distributed traffic among queues based on ip addresses flow/hashes/whatever, does chelsio make it per packet or somethig other?

how does this behavior you noticed could affect single queue (or applications on netmap:if0 and not netmap:if0-n) on chelsio?

thanks

Adrian Chadd

unread,

Jan 23, 2016, 2:27:57 PM1/23/16

to

ok, so it's .. a little more complicated than that.

The chelsio hardware (thanks jim!) and intel hardware (thanks
sean/limelight!) do support various kinds of traffic hashing into
different queues. The common subset of behaviour is the microsoft RSS
requirement spec. You can hash on v4, v6 headers as well as
v4+(tcp,udp) ports and v6+(tcp,udp) ports. It depends on the traffic
and the RSS config.

Now, each NIC and driver has different defaults. This meant that yes,
each NIC and driver would distribute traffic differently, leading to
some pretty amusing differences in workload behaviour that was purely
due to how the driver defaults worked.

This is one of the motivations behind me pushing the freebsd rss stuff
along - I wanted the RX queue config code and the RSS hashing config
code to be explicitly done in the driver to match what the system
configuration is so that we /have/ that code. Otherwise it's arbitrary
- some drivers hash on just L3 headers, some hash on L3/L4 headers,
some hash fragments, some may not; some drivers populate the RSS hash
id in the mbuf flowid (cxgbe); some just put the queue id in there
(intel drivers.)

So when you use "RSS" in -HEAD, the NICs that support it hopefully
obey the config in sys/net/rss_config.c - which is to hash on TCP
ports for v4/v6 traffic, and just the L3 v4/v6 headers for everything
else. The NICS support hashing on UDP, but the challenge is that
fragments hash differently to non-fragments, so you have to
re-constitute the fragments back into a normal packet and rehash that
in software. Now, for TCP fragments are infrequent, but for UDP
they're quite frequent in some workloads. So, I defaulted UDP RSS off
and just let UDP be hashed to the L3 address.

This means that if you use RSS enabled, and you're using a supported
NIC (igb, ixgbe, cxgbe, ixl is currently broken, and I think the
mellanox driver does RSS now?) then it'll distribute like this:

* TCP traffic: will hash based on L3 (src,dst) and L4 (port);
* UDP traffic, will hash on L3 (src, dst) only;
* other (eg ICMP, GRE, etc) - hash on L3 (src, dst) only;
* non-IP traffic is currently hashed however the NIC does it. This
isn't currently expressed via RSS.

If you use RSS on a non-supported NIC, then it'll re-hash the packet
in software and distribute work to multiple netisr queues as
appropriate.

Now, my eventual (!) plan is to expose the RSS queue/key/hash config
per NIC rather than global, and then users can select what they want
for things like netmap, and let the kernel TCP/UDP stack code rehash
things as appropriate. But it's all spare time project for me at the
moment, and right now I'm debugging the ixl RSS code and preparing
some UDP stack changes to cope with uhm, "behaviour."

Ok, so with all of that - if you don't use RSS in HEAD, then the
traffic distribution will depend purely upon the driver defaults.
intel and chelsio should be hashing on TCP/UDP headers for those
packets, and just straight L3 only for others. There's no shared RSS
key - sometimes it's random (older intel drivers), sometimes it's
fixed (cxgbe) so you can end up with reboot-to-reboot variations with
intel drivers. If you're doing straight iperf testing, with one or a
small number of connections, it's quite likely you'll end up with only
a small subset of the RX queues being used. If you want to see the NIC
really distribute things, you should use pkt-gen with the random
port/IP options that I added in -HEAD about a year ago now.

Next - how well does it scale across multiple queues. So I noticed on
the cxgbe hardware that adding more RX queues actually caused the
aggregate RX throughput to drop by a few million pps. After Navdeep/I
poked at it, we and the chelsio hardware people concluded that because
netmap is doing one-buffer-is-one-packet on receive, the RX DMA engine
may be not keeping up feeding it descriptors and we really should be
supporting RX buffer batching. It doesn't show up at 10g line rate,
only when you're trying to hit the NIC theoretical max on 40g. Yeah,
we hit the NIC theoretical max on 40g with one RX queue, but then we
couldn't do any work - that's why I was trying to farm it out to
multiple queues via hardware.

Finally - some drivers had some very .. silly defaults for interrupt
handling. The chelsio driver was generating a lot of notifications in
netmap mode, which navdeep heavily tuned and fixed whilst we were
digging into 40g behaviour. The intel interrupt moderation code
assumes you're not using netmap and the netmap code doesn't increment
the right counters, so AIM is just plainly broken and you end up with
crazy high varying interrupt rates which really slow things down.

(there's also NUMA related things at high pps rates, but I'm going to
ignore this for now.)

I hope this helps.

-adrian

Adrian Chadd

unread,

Jan 23, 2016, 2:38:28 PM1/23/16

to

Oh and one other thing - on the cxgbe hardware, the netmap interfaces
(ncxl) have a different MAC. things like broadcast traffic is
duplicated to cxlX AND ncxlX. So, if you're only using netmap and
you're testing promisc/bridging, you should bring /down/ the cxlX
interface and leave ncxlX up - otherwise yeah, the cxbge MAC will
duplicate packets in hardware, which halves the RX bandwidth available
on the NIC /and/ chews CPU in FreeBSD as the normal ethernet input
path drops all of those packets.

(I think the same thing holds with the virtual devices via SR-IOV on
ixgbe hardware, btw..)

-a

Luigi Rizzo

unread,

Jan 23, 2016, 2:49:49 PM1/23/16

to

On Sat, Jan 23, 2016 at 10:43 AM, Marcus Cenzatti <cenz...@hush.com> wrote:
>
>
> On 1/23/2016 at 1:31 PM, "Adrian Chadd" <adrian...@gmail.com> wrote:
>>
>>For random src/dst ports and IPs and on the chelsio t5 40gig
>>hardware,
>>I was getting what, uhm, 40mil tx pps and around 25ish mil rx pps?
>>
>>The chelsio rx path really wants to be coalescing rx buffers, which
>>the netmap API currently doesn't support. I've no idea if luigi has
>>plans to add that. So, it has the hilarious side effect of "adding
>>more RX queues" translates to "drops in RX performance." :(
>>
>>Thanks,
>
> hello,
>
> I am sorry, are you saying intel and chelsio distribute RX packet load differently? If I am not mistaken intel will distributed traffic among queues based on ip addresses flow/hashes/whatever, does chelsio make it per packet or somethig other?
>

I think there are several orthogonal issues here:
- traffic distribution has been discussed by Adrian
so please look at the email he just sent;

- when you use netmap on a single queue ie netmap:ix0-X
the software side is as efficient as it can, as it needs
to check the status of a single queue on poll() or ioctl(..RXSYNC..).
On the contrary, when you access netmap:if0 (i.e. all
queues on a single file descriptor) every system call
has to check all the queues so you are better off with
a smaller number of queues.

- on the hardware side, distributing traffic to multiple RX queues
has also a cost that increases with the number of queues, as the
NIC needs to update the ring pointers and fetch buffers for
multiple queues, and you can easily run out of PCIe bandwidth for
these transactions. This affects all NICs.
Some (ix ?) have parameters to configure how often to update the rings
and fetch descriptors, mitigating the problem. Some (ixl) don't.

My opinion is that you should use multiple queues only if you want
to rely on hw-based traffic steering, and/or your workload is
bottlenecked by the CPU rather than bus I/O bandwidth. Even so,
use as few queues as possible.

Sometimes people use multiple queues to increase the number of
receive buffers and tolerate more latency in the software side, but
this really depends on the traffic distribution, so in the worst case
you are still dealing with a single ring.

Often you are better off using a single hw queue and have a
process read from it using netmap and demultiplex to different
netmap pipes (zero copy). That reduces bus transactions.

Another option which I am experimenting these days is forget about
individual packets once you are off the wire, and connect the various
processes in your pipeline with a stream (TCP or similar) where packets
and descriptors are back to back. CPUs and OSes are very efficient in
dealing with streams of data.

cheers
luigi

Another motivation would be to have more

Often you are better off d
CPU limited. on multiqueue is that you should use it only if your workload

> how does this behavior you noticed could affect single queue (or applications on netmap:if0 and not netmap:if0-n) on chelsio?
>
> thanks
>

> _______________________________________________
> freeb...@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net...@freebsd.org"

--
-----------------------------------------+-------------------------------
Prof. Luigi RIZZO, ri...@iet.unipi.it . Dip. di Ing. dell'Informazione
http://www.iet.unipi.it/~luigi/ . Universita` di Pisa
TEL +39-050-2217533 . via Diotisalvi 2
Mobile +39-338-6809875 . 56122 PISA (Italy)
-----------------------------------------+-------------------------------

Eduardo Meyer

unread,

Jan 23, 2016, 3:22:32 PM1/23/16

to

Thanks for the explanation.

What i was trying to achieve is more performance using more than one CPU to
actually bridge at line rate, since I have several cores (16 cores) but low
CPU clock, and growing up horizontally is easier and cheaper than getting
faster CPUs. I though that using 2 queues with two bridge instances or two
threads (adrian's bridge) and using that on one of the other 15 idle cores
would just allow me to grow from 9Mpps to 14Mpps. It looks like it's not
that simple.

If I was a developer making a multithreaded netmap application to increase
pps rates, is there any other/better strategy than using multiple queues?
Should I distribute the load along the application reading and writing to
one single Q or something better? I mean, a multithreaded bridge to check
how many packets we have in the queue and distributing a constant number of
packets to each thread, is it possible/efficient? If I don't have the
constant number of packets I should be able to see TAIL via netmap, right?

Thank you all for all the details on how things actually work.

--
===========
Eduardo Meyer
pessoal: dudu....@gmail.com
profissional: ddm.fa...@saude.gov.br