Packet loss over loopback

tric...@accusoft.com

unread,

Aug 29, 2019, 5:25:25 AM8/29/19

to

Hi folks,

I'm observing here packet loss over UDP transmissions, even if routed over the loopback device. Yes, of course, it is UDP, so it is understood that packets can be dropped. In this specific case, however, the receiver is just a dummy that does nothing more than call "recvfrom" in a tiny loop, and checks for the packet sequence number of the protocol. The sender is nothing but a tiny program that sends out packets in a loop, running into usleep every N packets to limit the bandwidth.

Nevertheless, the receiver tells me that packets are missing. Naturally, the problem increases when the load of the system increases, i.e. if I start a couple of dummy programs that just occupy the CPU. Though this is a quite capable server...

Unfortunately, TCP is not negotiable, this needs to be a UDP connection.

I already reconfigured the kernel size buffers in /etc/sysctl.conf, but without much success. The problem remains.

Is there any other interface that would allow me to receive packets more frequently, through the kernel buffer directly? Windows has this "RIO" (registered IO) interface which helped a lot - essentially, the application puts buffers aside and let the kernel fill them, instead requiring polling through recvfrom().

Is there something similar on Linux side that would minimize packet loss?
Is there some other setting I should try?

Jorgen Grahn

unread,

Aug 29, 2019, 8:26:37 AM8/29/19

to

On Thu, 2019-08-29, tric...@accusoft.com wrote:
> Hi folks,
>
> I'm observing here packet loss over UDP transmissions, even if
> routed over the loopback device. Yes, of course, it is UDP, so it is
> understood that packets can be dropped. In this specific case,
> however, the receiver is just a dummy that does nothing more than
> call "recvfrom" in a tiny loop, and checks for the packet sequence
> number of the protocol. The sender is nothing but a tiny program
> that sends out packets in a loop, running into usleep every N
> packets to limit the bandwidth.
>
> Nevertheless, the receiver tells me that packets are
> missing. Naturally, the problem increases when the load of the
> system increases, i.e. if I start a couple of dummy programs that
> just occupy the CPU. Though this is a quite capable server...
>
> Unfortunately, TCP is not negotiable, this needs to be a UDP connection.

It seems to me this has to be about lack of flow control: the sender
is faster than the receiver, it doesn't care about the receiver, and
there's (with UDP) a fixed RX buffer. Staying on the loopback doesn't
change that.

You could change the protocol over UDP to include a flow control
mechanism, but then you might as well switch to TCP.

> I already reconfigured the kernel size buffers in /etc/sysctl.conf,
> but without much success. The problem remains.

Those buffers tend to be huge even in the default case, so it's mildly
surprising that they fill up, if your receiver is so fast. Note that
'netstat -uan' will show you how full the RX buffer is at any given
time.

> Is there any other interface that would allow me to receive packets
> more frequently, through the kernel buffer directly? Windows has
> this "RIO" (registered IO) interface which helped a lot -
> essentially, the application puts buffers aside and let the kernel
> fill them, instead requiring polling through recvfrom().

Don't know. Linux has recvmmsg(2) which lets you consume N UDP
messages in one call, instead of being awoken by poll/select N times
and calling read() N times.

Also, if you only need to read a sequence number at the start of the
message, you don't have to read all of it ... although I don't know
if that would help performance much.

> Is there something similar on Linux side that would minimize packet loss?
> Is there some other setting I should try?

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

tric...@accusoft.com

unread,

Aug 29, 2019, 8:54:36 AM8/29/19

to

Thanks Jorgen,

yes, indeed, there is no flow control, though the source is a constant bitrate source (by design) - essentially video, over RTP. Bandwith is, however, quite large, about 140 MByte per second.

What I should say that the small "dummy" receiver I had implemented is single threaded, though the real receiver uses 10 threads of which two are set aside exclusively for polling the network, and the remaining 8 for video decoding and, if they have nothing better to do, for filling the input ring buffer of the decoder. The machine has sufficient physical cores.

Hence, there is certainly some CPU horsepower on the receiving end. Given that the load-average is not yet maxed out, I would assume (probably incorrectly so?) that the receiver can take the load.

One thing I found is that powermanagement spoils the receiver. The CPU load of the decoder isn't high enough to let the kernel clock up the cores all the time, so it needs to be turned off.

What I still find irritating is that the kernel needs to copy the payload into my buffer (or at least I assume so) - which is at this datarate probably an issue. In the actual decoder operation, copying is avoided to any amount possible, and the decoder itself is a "lock-free" implementation.

So, are there any other network layers I may be able to disable to avoid latency, and speed up processing?

Are there some network configurations or kernel configurations I could try to avoid copying data around, and read "as close to the hardware"?

Greetings,
Thomas

Jorgen Grahn

unread,

Aug 29, 2019, 2:32:40 PM8/29/19

to

On Thu, 2019-08-29, tric...@accusoft.com wrote:

> Thanks Jorgen,
>
> yes, indeed, there is no flow control, though the source is a
> constant bitrate source (by design) - essentially video, over
> RTP. Bandwith is, however, quite large, about 140 MByte per second.

> What I should say that the small "dummy" receiver I had implemented
> is single threaded, though the real receiver uses 10 threads of
> which two are set aside exclusively for polling the network, and the
> remaining 8 for video decoding and, if they have nothing better to
> do, for filling the input ring buffer of the decoder. The machine
> has sufficient physical cores.
>
> Hence, there is certainly some CPU horsepower on the receiving
> end. Given that the load-average is not yet maxed out, I would
> assume (probably incorrectly so?) that the receiver can take the
> load.

If you see the UDP socket's RX buffer fill up, that's the proof it
can't. I mentioned netstat, but perhaps there are even better tools
to profile socket buffer usage.

Regarding horsepower, at least in the past the Linux kernel placed the
network RX work on a single core, which could easily become the
bottleneck. That kind of work shows up as 'softirq'; you can use
e.g. 'mpstat -P ALL 1' to monitor it.

> One thing I found is that powermanagement spoils the receiver. The
> CPU load of the decoder isn't high enough to let the kernel clock up
> the cores all the time, so it needs to be turned off.
>
> What I still find irritating is that the kernel needs to copy the
> payload into my buffer (or at least I assume so) - which is at this
> datarate probably an issue. In the actual decoder operation, copying
> is avoided to any amount possible, and the decoder itself is a
> "lock-free" implementation.
>
> So, are there any other network layers I may be able to disable to
> avoid latency, and speed up processing?
>
> Are there some network configurations or kernel configurations I
> could try to avoid copying data around, and read "as close to the
> hardware"?

You're assuming copying is the problem, but you haven't measured yet.
It might be a problem, but the problem could easily be inefficient use
of select(2) and read(2) like I mentioned earlier.

There's a Linux feature called PF_RING which I think can offer zero-copy,
but I think that means working on the Ethernet level.

These are just my five cents, by the way. Around ten years ago I
wrote simulators for (more or less) routers, and had to process tiny
UDP packets as efficiently as possible. I've mercifully forgotten
most of it (except I still loathe UDP) and things in the Linux
kernel may have improved since the 2.4 days. I hope others can
comment.