netmap over virtio giving packets with extra 12 bytes

Avinash Sridharan

unread,

Jan 5, 2015, 12:54:58 AM1/5/15

to

I am using netmap with the click modular router, running the click-modular
router in user space. A while back I was using this combination with the
e1000 device driver, with a slightly older netmap code-base.

Recently I updated my netmap code base and am trying to use the
click-modular router with netmap over a virtio-net device driver (over
KVM). With this combination, though I was able to receive packets I was
unable to interpret any packets coming from the FromDevice element.

To debug this issue (and to negate any changes I made to the click-modular
router), I ran the pkt-gen application with the "dump payload" option:

sudo ~/pkt-gen -i eth1 -f rx -X

This showed that packets are being received correctly from the
netmap-enabled interface, but there are an extra "12" bytes appended to the
packet.

381.088570 main_thread [1446] 1 pps (1 pkts in 1001088 usec)

ring 0x7f133bca6000 cur 1 [buf 516 flags 0x0000 len 72]

0: 00 00 00 00 00 00 00 00 00 00 01 00 01 80 c2 00 ................ <<
extra 12 bytes

16: 00 00 40 16 7e 5b 50 f0 00 26 42 42 03 00 00 00 ..@.~[P..&BB....

32: 00 00 80 00 40 16 7e 5b 50 f0 00 00 00 00 80 00 ....@.~[P.......

48: 40 16 7e 5b 50 f0 80 01 00 00 14 00 02 00 00 00 @.~[P...........

64: 00 00 00 00 bc 9b f6 74

As we can see, the above is an STP BPDU, and there are 12 leading bytes in
the payload.

The extra leading bytes screw up the packet interpretation.

So is this is an artifact of the virtio-net driver or has something changed
in the netmap device driver?

Thanks,

Avinash
_______________________________________________
freeb...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net...@freebsd.org"

Luigi Rizzo

unread,

Jan 5, 2015, 7:19:59 AM1/5/15

to

What you see is a virtio issue.

virtio prepends a 10 or 12-byte "virtio header"
to all packets, which is used to define what sort
of NIC accelerations (checksum, tso etc.) are
expected on the link.

I do not remember if there is a way in qemu-kvm to
remove the header. Maybe Vincenzo (in Cc) remembers.

cheers
luigi

--
-----------------------------------------+-------------------------------
Prof. Luigi RIZZO, ri...@iet.unipi.it . Dip. di Ing. dell'Informazione
http://www.iet.unipi.it/~luigi/ . Universita` di Pisa
TEL +39-050-2211611 . via Diotisalvi 2
Mobile +39-338-6809875 . 56122 PISA (Italy)
-----------------------------------------+-------------------------------

Adrian Chadd

unread,

Jan 5, 2015, 11:33:27 AM1/5/15

to

... surely virtio should be skipping over those bytes in the netmap rx
side before handing them up?

(It won't be the only hardware that puts the RX descriptor status in
the RX frame itself..)

-adrian

Luigi Rizzo

unread,

Jan 5, 2015, 12:01:01 PM1/5/15

to

On Mon, Jan 05, 2015 at 08:33:17AM -0800, Adrian Chadd wrote:
> ... surely virtio should be skipping over those bytes in the netmap rx
> side before handing them up?
>
> (It won't be the only hardware that puts the RX descriptor status in
> the RX frame itself..)

it is not the rx descriptor, those 12 bytes are also present on the tx side.
Think of them as an encapsulation whose presence is negotiated when KVM connects
to the TAP port, and after that is present on all packets bidirectionally.

Now, surely we could add/remove those bytes in the virtio-netmap code
(at the price of an additional data copy).

I need to investigate further.
Avinash, could you tell us exactly your configuration -- what is the
network backend for QEMU/KVM, and whether you are using virtio in
native or emulated netmap mode ?

cheers
luigi

Avinash Sridharan

unread,

Jan 5, 2015, 12:24:53 PM1/5/15

to

I used virsh to start the VM over qemu-kvm. Here is a dump of the network
XML that was fed to libvirt, while creating the domain:

<interface type='bridge'>

<mac address='52:54:00:0f:8f:af'/>

<source bridge='br0'/>

<target dev='vnet3'/>

<model type='virtio'/>

<alias name='net0'/>

<address type='pci' domain='0x0000' bus='0x00' slot='0x03'
function='0x0'/>

</interface>

<interface type='bridge'>

<mac address='52:54:00:0f:8f:bf'/>

<source bridge='br0'/>

<target dev='vnet4'/>

<model type='virtio'/>

<alias name='net1'/>

<address type='pci' domain='0x0000' bus='0x00' slot='0x05'
function='0x0'/>

</interface>

The network backend is a bridge interface running over gentoo. The bridge
has a single port, which is a realtek r8169 NIC.

I didn't quite understand the question about using "virtio in native or
emulated netmap mode "?

While running the above experiment I just replaced the virtio-net drivers
with the netmap enabled virtio-net drivers.

*asridharan@stitch-dp-1** ~/stitch-click/kernel $* lsmod

Module Size Used by

virtio_net 24612 0

netmap 95166 1 virtio_net

mii 3875 0

*asridharan@stitch-dp-1** ~/stitch-click/kernel $* modinfo ./virtio_net.ko

filename: /home/asridharan/stitch-click/kernel/./virtio_net.ko

license: GPL

description: Virtio network driver

alias: virtio:d00000001v*

depends: netmap

vermagic: 3.14.14-gentoo SMP mod_unload

parm: napi_weight:int

parm: csum:bool

parm: gso:bool

*asridharan@stitch-dp-1** ~/stitch-click/kernel $* modinfo ./netmap.ko

filename: /home/asridharan/stitch-click/kernel/./netmap.ko

license: Dual BSD/GPL

description: The netmap packet I/O framework

author: http://info.iet.unipi.it/~luigi/netmap/

depends:

vermagic: 3.14.14-gentoo SMP mod_unload

parm: verbose:int

parm: no_timestamp:int

parm: mitigate:int

parm: no_pendintr:int

parm: txsync_retry:int

parm: adaptive_io:int

parm: flags:int

parm: fwd:int

parm: mmap_unreg:int

parm: admode:int

parm: generic_mit:int

parm: generic_ringsize:int

parm: generic_rings:int

parm: default_pipes:int

parm: bridge_batch:int

parm: if_size:int

parm: if_curr_size:int

parm: if_num:int

parm: if_curr_num:int

parm: priv_if_size:int

parm: priv_if_num:int

parm: ring_size:int

parm: ring_curr_size:int

parm: ring_num:int

parm: ring_curr_num:int

parm: priv_ring_size:int

parm: priv_ring_num:int

parm: buf_size:int

parm: buf_curr_size:int

parm: buf_num:int

parm: buf_curr_num:int

parm: priv_buf_size:int

parm: priv_buf_num:int

Please ping me if I haven't given enough data.

Thanks,

Avinash

Adrian Chadd

unread,

Jan 5, 2015, 3:39:07 PM1/5/15

to

On 5 January 2015 at 09:05, Luigi Rizzo <ri...@iet.unipi.it> wrote:
> On Mon, Jan 05, 2015 at 08:33:17AM -0800, Adrian Chadd wrote:
>> ... surely virtio should be skipping over those bytes in the netmap rx
>> side before handing them up?
>>
>> (It won't be the only hardware that puts the RX descriptor status in
>> the RX frame itself..)
>
> it is not the rx descriptor, those 12 bytes are also present on the tx side.
> Think of them as an encapsulation whose presence is negotiated when KVM connects
> to the TAP port, and after that is present on all packets bidirectionally.
>
> Now, surely we could add/remove those bytes in the virtio-netmap code
> (at the price of an additional data copy).

Right, but I have similar issues with other wifi devices who put RX
descriptor completion info in the header of the mbuf you pass to the
NIC.

So in the driver RX path, I have to += the data pointer /past/ the header.

So, maybe for netmap you need to consider adding something that lets
you specify how many bytes into the RX buffer the payload starts.

-adrian

Vincenzo Maffione

unread,

Jan 6, 2015, 11:18:07 AM1/6/15

to

Hello,

From what I can guess you are dealing with a QEMU-KVM guest that
uses virtio-net device(s) and runs netmap over that device(s).
Then, you connect the guest to the host (gentoo) network stack using a
standard linux bridge: a TAP device is used by QEMU to forward guest
traffic from/to the host network stack.

Is that correct?

Following Luigi's explanations, the virtio-net header is part of the
virtio standard, and its purpose is to carry offloading info
(checksum, TSO) across the guests and host kernels. For instance, your
guest kernel can offload the TCP checksum to the virtio-net device,
which in turn uses the virtio-net header (that requires TAP driver
support) to postpone the checksum to the host kernel. If packets
arrive to a physical NIC that supports checksum offloading (e.g. a
r8169 NIC attached to the same bridge to which the TAP is attached),
you have effectively offloaded the checksum computation from the guest
kernel straight to the physical NIC in the physical host.

If you see the virtio-net header with "pkt-gen -f rx", it means that
you are using netmap in "native" mode, that is you use the specific
virtio netmap adapter to send/receive packets from the (virtual) NIC.
If you used netmap over virtio-net in "emulated" mode you wouldn't see
the virtio-net header, because netmap would be using the standard
driver (slow) datapath under the hood: In the rx datapath, the driver
converts the virtio-net header into skbuffs/mbufs metadata, so you
don't see it.

I don't remember having tried to make QEMU use a TAP with no
virtio-net-header extension, but I see that it is possible to disable
it invoking qemu from command line

$ x86_64-softmmu/qemu-system-x86_64 --help | grep tap

-net tap[,vlan=n][,name=str][,fd=h][,fds=x:y:...:z][,ifname=name][,script=file][,downscript=dfile][,helper=helper][,sndbuf=nbytes][,vnet_hdr=on|off][,vhost=on|off][,vhostfd=h][,vhostfds=x:y:...:z][,vhostforce=on|off][,queues=n]
use vnet_hdr=off to avoid enabling the IFF_VNET_HDR tap flag
-netdev [user|tap|bridge|vde|netmap|vhost-user|socket|hubport],id=str[,option][,option][,...]

where you see that you can specify "vnet_hdr=off" when declaring the
qemu "backend" associated to the virtio-net guest device.
Never tried, but it should work. In the worst case you can recompile
the tap driver without IFF_VNET_HDR extension, so that QEMU does not
find it.

Cheers,
Vincenzo

--
Vincenzo Maffione

Avinash Sridharan

unread,

Jan 6, 2015, 11:31:19 AM1/6/15

to

Hi Vincenzo,
Thanks for the explanation. From your explanation it seems like the netmap
in "native" mode over virtio-net should be giving some indication of how
many extra bytes have been added by the virtio-net driver (or for that
matter any other driver that provides this type of rx-descriptor).
Otherwise, the application will have to store knowledge about the specifics
of the underlying devices which dosen't seem that clean. (I think Adrian
was referring to the same issue)

That said, how do we handle TX in this case? Since the underlying driver
(netmap + virtio-net) expects an extra 12 bytes of header that the
application should know when to add. Or is this optional?

On Tue, Jan 6, 2015 at 8:17 AM, Vincenzo Maffione <v.maf...@gmail.com>
wrote:

Avinash Sridharan

unread,

Jan 6, 2015, 11:33:29 AM1/6/15

to

By the way, Vincenzo, your assumptions about my system setup:

"From what I can guess you are dealing with a QEMU-KVM guest that
uses virtio-net device(s) and runs netmap over that device(s).
Then, you connect the guest to the host (gentoo) network stack using a
standard linux bridge: a TAP device is used by QEMU to forward guest
traffic from/to the host network stack."

are correct.

On Tue, Jan 6, 2015 at 8:31 AM, Avinash Sridharan <

Vincenzo Maffione

unread,

Jan 6, 2015, 11:57:33 AM1/6/15

to

2015-01-06 17:31 GMT+01:00 Avinash Sridharan <avinash....@gmail.com>:
> Hi Vincenzo,
> Thanks for the explanation. From your explanation it seems like the netmap
> in "native" mode over virtio-net should be giving some indication of how
> many extra bytes have been added by the virtio-net driver (or for that
> matter any other driver that provides this type of rx-descriptor).
> Otherwise, the application will have to store knowledge about the specifics
> of the underlying devices which dosen't seem that clean. (I think Adrian was
> referring to the same issue)

I understand the problem (this was left as an open problem), and it's
not clear to me what should be the best solution.

On one hand, one could modify the native virtio-net adapter so that it
discards the extra header. This can be done with a copy - as Luigi
suggests - or with some trick involving scatter-gather virtio support
- e.g. trying to make the virtio-net headers go in some other buffers
parallel to the "official" netmap buffers (i.e. the ones your
application reads from).

On the other end, the virtio-net header carry some information (mainly
TCP/UDP related) that you may not want to discard - even when using
netmap - and this depends on the specific application.

So to avoid making a one-for-all decision, I thought it was better to
leave it to the application.

>
> That said, how do we handle TX in this case? Since the underlying driver
> (netmap + virtio-net) expects an extra 12 bytes of header that the
> application should know when to add. Or is this optional?

Yes you have to add it, otherwise it won't work! It's not optional. In
order to make tests for virtio-net tx, I added a "-H" option to
pkt-gen, which pushes an empty virtio-net-header before the ethernet
frame. You can use "-H 12" for your tests.

Again, it's not clear (at least to me) how we should manage this
virtio-net peculiarity at the netmap API level.

Cheers,
Vincenzo

Luigi Rizzo

unread,

Jan 6, 2015, 3:05:56 PM1/6/15

to

On Tue, Jan 06, 2015 at 10:15:02AM -0800, Adrian Chadd wrote:
...

> This won't be the first time that there'll be useful data at the front
> end of an RX mbuf that isn't related to the mbuf payload.
>
> It'd be nice if there were something in each rx ring slot saying how
> far to skip into the buffer to get the beginning of the packet.

I am not opposed in principle, this is something we have been looking
at since day one. The blocking issue is that incompatible hw constraints
make it hard to make a decent choice.

Examples:
1. the rx buffer size you tell to ixgbe must be a power of two.
If you want to write at some offset into the netmap buffer,
you need to allocate one twice the size you pass to the driver.
2. some NICs may want buffers aligned to 4, 8, 16 bytes, so input
offsets for headers cannot be arbitrary (12 is almost as bad as 14!)
3. irrespective of functionality, performance drops badly with small
packets (where it matters the most) when buffers are not aligned
to 64 byte boundaries.

The above makes me think that for small packets, copying is the
only reasonable way to go, and for large packets i have no idea
how to deal with #1 and #2 without having to do scatter-gather.

If you have a good suggestion please speak up.

cheers
luigi

Adrian Chadd

unread,

Jan 6, 2015, 3:06:19 PM1/6/15

to

This won't be the first time that there'll be useful data at the front
end of an RX mbuf that isn't related to the mbuf payload.

It'd be nice if there were something in each rx ring slot saying how
far to skip into the buffer to get the beginning of the packet.

-adrian