How can I get the behavior of NIC buffer back-pressure to IP layer or TCP layer in NS-3?

Qiao Zhang

unread,

May 12, 2016, 1:06:40 AM5/12/16

to ns-3-users

Hi all,

I'm trying to simulate a certain kind of workload for a data center network, where an end-host can have up to thousands of active TCP connections to other end-hosts, e.g. a large MapReduce shuffle.

I've been using BulkSendApplication to create application traffic, so it sends data as fast as it can using TCP sockets until the TCP transmit buffer is full and the application would block.

However, when the TCP layer tries to send data onto the wire, using IP layer, there is no back-pressure, and packets would get dropped at the end-host net device before they reach the wire.

I'm using DropTailQueue as the transmit queue for the net device on the host. When there are thousands of TCP sockets, each with a large window of data to send. They would all call Ipv4->Send() in turn to put packets onto the NetDevice transmit queue. When the queue is overrun, there can be a large amount of drops and the TCP can start to behave in weird ways when the drop rate gets high. It seems weird to me that packets have to be dropped needlessly on the end-host side.

If I understand correctly, on real machines, when the NIC buffer is full, the kernel thread that tries to put data into the NIC buffer would get blocked, and that back-pressure would "propagate" backwards so that TCP layer would stop sending. This has two benefits. One, TCP wouldn't get so many drops at the end-host side. Two, when TCP is blocked from sending data to IP layer and below, it wouldn't start timers for its packets, so the RTT measurements wouldn't include much queueing delays on the end-host net device transmit buffer and the RTT actually measures the time spent in the network.

I wonder if there is any way to get back-pressure from end-host net device transmit buffer to TCP layer?

I've tried to get back-pressure using some really ugly hacks, e.g. when TCP tries to SendDataPacket or SendEmptyPacket, check and see if the DropTailQueue is full, and if so, either give up sending or somehow queue up those send calls, and invoke them again when the queue frees up (basically trying to simulate condition variable sleep and wake up behavior in NS-3 event driven model).

If someone can clarify whether

1) I'm getting this completely wrong: when there are thousands of active TCP flows, I should expect packet drops even at the end-hosts on a real machine

2) I'm using NS-3 wrong: I shouldn't use DropTailQueue for end-host NetDevice. (I'm using v3.23, so there is no TrafficController at the IP layer yet, but that only gives me some kind of queue management, and it would still drop packets when the underlying buffer is full).

3) there's some way to get back-pressure as I described

I've tried to fix this issues for months without a solution now, so if someone can help me, i'd be really grateful!

Thanks!

Qiao

Tommaso Pecorella

unread,

May 12, 2016, 4:04:52 AM5/12/16

to ns-3-users

Hi,

your analysis is wrong. If the drop is in the end node, then the packet did leave the wire already. There could be a drop in the end node only if you receive code is not doing its job (i.e., it's leaving stuff in the buffers). This is true for ns-3 and for real machines.

Back pressure - it is possible but not currently implemented. The underlying assumption is that the NetDevice output buffer is big enough to host the bursts of the sending apps. This could be wrong for some use-cases, and we plan to introduce back pressure.

Still, I'd double check where your drops are coming from.

Cheers,

T

Nat P

unread,

May 12, 2016, 4:29:29 AM5/12/16

to ns-3-users

Il giorno giovedì 12 maggio 2016 07:06:40 UTC+2, Qiao Zhang ha scritto:

Hi all,

I'm trying to simulate a certain kind of workload for a data center network, where an end-host can have up to thousands of active TCP connections to other end-hosts, e.g. a large MapReduce shuffle.

[cut]

However, when the TCP layer tries to send data onto the wire, using IP layer, there is no back-pressure, and packets would get dropped at the end-host net device before they reach the wire.

[cut]

I wonder if there is any way to get back-pressure from end-host net device transmit buffer to TCP layer?

I've tried to get back-pressure using some really ugly hacks, e.g. when TCP tries to SendDataPacket or SendEmptyPacket, check and see if the DropTailQueue is full, and if so, either give up sending or somehow queue up those send calls, and invoke them again when the queue frees up (basically trying to simulate condition variable sleep and wake up behavior in NS-3 event driven model).

If someone can clarify whether
1) I'm getting this completely wrong: when there are thousands of active TCP flows, I should expect packet drops even at the end-hosts on a real machine
2) I'm using NS-3 wrong: I shouldn't use DropTailQueue for end-host NetDevice. (I'm using v3.23, so there is no TrafficController at the IP layer yet, but that only gives me some kind of queue management, and it would still drop packets when the underlying buffer is full).
3) there's some way to get back-pressure as I described

Hi Qiao,
this is a very interesting point! As you have already discovered, the main issue here is the backpressure between TCP and IP (I disagree with Tommaso's post here :D)

First of all, I'm now using the concept of traffic-control layer, so this will apply to ns-3.25 onward. It's without any sense thinking on ns-3.23: sorry but this has to be done. Analyzing a router, it is perfectly fine that packets are dropped at TC level: coming from interface eth0 and directed to eth1 (after the routing process), if they found a full droptail queue on that interface, they should be dropped. Backpressure of TC is done through L2 and L3, and it allows packets to be stored in L2.5 queue.

Now, there's another kind of backpressure in the end host: between TCP and TC, as you said. Id est, if tc queue is full, there's no need for TCP to send down that packets; they can wait in TCP buffers until there's some free space. And here enters in the play the bufferbloat, and Linux developers have resolved it implementing TCP small queues.

It is for sure an interesting addition; If you are interested in developing, please let me know to share the workload.

Nat

Qiao Zhang

unread,

May 12, 2016, 2:35:45 PM5/12/16

to ns-3-users

Hi Tommaso,

Thanks so much for your quick reply! I've used ns-3 for over a year now, and never tried to post a question on the mailing list. I'm so happy that the community is so responsive :)

I think I wasn't precise about my terminology "onto the wire". I meant that the Ipv4L3Protocol gets a unicast packet with a known route entry and a known gateway, then it finds the correct interface and sent it out. In my case (v3.23), the packet would get queued by a PointToPointNetDevice transmit DropTailQueue, and when it's full, the packet gets dropped before it gets onto the PointToPointChannel that it connects to. In the case that Nat mentions below (v3.25), the packet would get queued in the TrafficController layer and potentially get dropped too.

I have tried a few things before:

1) making the netdevice buffer much larger to absorb bursts as you've suggested, but if there are enough BulkSend applications active, the buffer sooner or later would get overrun unless the bottleneck for those TCP flows are further downstream in the network and congestion control kicks in.

2) replace BulkSend with OnOffApplication with a constant but controllable bit rate, so that i can make sure all applications on the same host never send data faster than the speed they are drained (the data rate of the PointToPointNetDevice attached to the end-host). This avoids the excessive drops on the end-host side mostly, but then my applications are not so realistic anymore; after all, I'm trying to simulate some real workload here.

So I do think the drops I'm seeing is because of the lack of backpressure from L2 or L2.5 queues back to TCP. I'm really look forward to seeing that getting introduced at some point.

But in the meantime, do you know of any way I can work around this limitation (potentially with "hacks") and somehow get back-pressure? I'm trying to get this behavior fixed so that I can run some experiments and get the results for a paper I'm trying to submit next month.

It seems that making L2 or L2.5 queue a bounded queue is not enough, because by the time IP layer tries to put a packet in those queues, TCP thinks the packet is already out and starts the timer for RTT measurements. So I need to block inside TCP, and make sure that TCP doesn't even take the data off the TxBuffer. I struggled with finding the right place to block and wake up TCP in TcpSocketBase. I can see two places, SendDataPacket and SendEmptyPacket that TcpSocketBase is making the decision of whether to take some data off the TxBuffer and send the packet to IP layer. I can make it check whether the NetDevice output buffer is full, and if so, queue itself as a pending event and return. Now, when the output buffer drains, it finds the pending sends and schedule them again. I think this is an ugly hack, and potentially makes TCP behave wrongly--I essentially make SendDataPacket and SendEmptyPacket NO-OP and let them return, and then reschedule them at some arbitrary point in the future when that send event could be stale/invalid.

Sorry for this super long email. But I've been hacking at this for quite a while, and I'd really appreciate it if I can get another take on this problem.

Thanks a lot!

Qiao

Qiao Zhang

unread,

May 12, 2016, 3:43:51 PM5/12/16

to ns-3-users

Hi Nat,

Thanks a lot for your clarifications! TrafficController is a fantastic addition! I'm super happy to see it getting introduced last month. I notice that without qdisc, there's no way to guarantee flow-level fairness when there are many applications sending at the same time on a host. When a tcp socket was scheduled to send, it used to just blast away at the device buffer and leave no room for the next socket at the next scheduling time slice. Qdisc fixes that. I started using ns-3.23 since March last year so I had to hack up a FQ or SFQ to get flow-level fairness and it's a little too late for me to port to ns-3.25 now. I got 8k LOC and about a month to go before a paper deadline :(

But I'm indeed talking about the backpressure from TC back to TCP. You suggested TCP Small Queues which basically puts a per-TCP-flow limit on how much it can use the L2.5 or L2 queue. With Byte Queue Limit, we can even dynamically adapt that limit. However, ns-3 is a single-threaded event driven simulator, I wonder how can we get blocking behavior correctly?

TcpSocketBase is scheduled to run whenever an ACK comes back for it, but it has to decide how much new application data it needs to send right away. If SendDataPacket or SendEmptyPacket sees that it has no space in L2.5 or L2 queue, can it safely return without actually doing anything? I can see that when ACKs run out, the sender has too many NO-OPs, it would stop getting scheduled to send, unless TCP persist timer runs out. So that's problematic. So I created a queue that keep tracks of the "blocked" send events (we know which socket wants to send, and what packet, either SYN/ACK control packets or an intent to send more data packets). Now, when the underlying L2.5 or L2 queue transmits a packet and has an available spot, it fetches a pending send event and schedules it next. This hack has eliminated drops from TCP to TC layer, but I'm just not sure if it does the same thing as what Linux kernel does. Linux doesn't have the same issues as we do, because when the kernel thread that handles the TCP send blocks, it really blocks, and will not handle subsequent ACKs for that socket, so that essentially the TCP state machine is paused, and can resume when the L2.5/L2 queue frees up using condition variables.

I'm mostly thinking about hacks for now, as I'm a little hard-pressed to get my code working soon. But perhaps, you have another take on how to get this back-pressure behavior if you were to design and implement it in ns-3, whether with great software engineering practice or with hacks :)

I really appreciate your help!

Qiao

Tommaso Pecorella

unread,

May 12, 2016, 5:43:16 PM5/12/16

to ns-3-users

Hi Nat and Qiao,

first and foremost, thanks for your kind words.

Next, let me say that I strongly disagree with Nat's statement what he disagrees with me. We are, indeed, agreeing.

My point was about *where* the drops happens. I was pointing that they can't be in the receiving node, they must be in the transmitter node, or in a node in between... and you both confirmed my point.

About how to fix the problem, that's avery challenging thing.

- BulkSend is "smart" enough to fill the TCP buffer and keep it full until its data block is sent.

- TCP has an "inner" back pressure from the congestion control, but not from local queues.

The problem is that there's no way (yet) to do a generic back pressure from a NetDevice.

What you can do is to add something to the netDevice you're using, in order to know when the buffer is full. However, I wouldn't know were to react to this. The best option would be TCP, but I can't evaluate all the possible drawbacks.

Hope this helps a bit,

T.

Qiao Zhang

unread,

May 12, 2016, 7:25:54 PM5/12/16

to ns-3-users

Hi Tommaso and Nat,

I see that TCP socket Send() blocks and wakes up through callback when the TCP TxBuffer is full. Do you think it would properly back-pressure if I have a dynamic limit on the TCP TxBuffer that shrinks ahead of time when the underlying L2.5 or L2 buffer is close to full (simulating the behavior of TCP small buffer)? I'm hoping that BulkSend would see that its socket buffer is full and stop sending. However, I'm not sure if making TCP socket buffer smaller would actually slow down sending or simply adds some function call overhead that doesn't really take up logical time in ns-3.

Thanks for your input!

Qiao

PS: I got my understanding of TCP small buffer through the article below, rather than reading the patch itself, so I might be wrong.

https://www.coverfire.com/articles/queueing-in-the-linux-network-stack/

Natale Patriciello

unread,

May 13, 2016, 12:18:36 PM5/13/16

to ns-3-...@googlegroups.com, ns-dev...@isi.edu

Hi, CC'ing developers, it is a very important discussion and should be
keeped in mind for the future.

On 12/05/16 at 04:25pm, Qiao Zhang wrote:
> Hi Tommaso and Nat,
>
> I see that TCP socket Send() blocks and wakes up through callback when the
> TCP TxBuffer is full. Do you think it would properly back-pressure if I
> have a dynamic limit on the TCP TxBuffer that shrinks ahead of time when
> the underlying L2.5 or L2 buffer is close to full (simulating the behavior
> of TCP small buffer)? I'm hoping that BulkSend would see that its socket
> buffer is full and stop sending. However, I'm not sure if making TCP socket
> buffer smaller would actually slow down sending or simply adds some

> function call overhead *that doesn't really take up logical time* in ns-3.

>
> Thanks for your input!
> Qiao
>
> PS: I got my understanding of TCP small buffer through the article below,
> rather than reading the patch itself, so I might be wrong.
> https://www.coverfire.com/articles/queueing-in-the-linux-network-stack/

Hi both,

first of all of course we should be sure that the losses are happening
in the node :). Anyway, in any case, the congestion control of TCP
should kick in when we exceed the rate allowed for anything along the
path. Without TCP-TC backpressure, these packets are lost; with TCP-TC
backressure, these packets remain in the application. That means that if
the TCP congestion control discover a local congestion, it could take
less drastic decisions.

Qiao, if you implement correctly what you are saying (shrinking the
sender buffer size), the bulk send application will stop transmitting.
Anyway, the thing that seems not so easy is the implementation itself:
for instance, how much space of the TC queue do you assign to each TCP
flow? A static value (e.g. 64, or queuedisc_max_size() / #TCP-flows)?
A dynamic one, based on cwnd? But depending on the rewarding strategy,
you will lock-out the flow with the smallest cwnd or cut the flow with
the biggest cwnd.. To be honest I don't know, I should take a look on
the Linux code.

Unfortunately, right now there's not possibility to know the max
dimension of the underlying queue disc; you should cast to every
possible type of queue disc (the same applies for ns3.23, with
s/QueueDisc/Queue/g). And another limitation is the API of IP layer...

Really interesting challenge by the way :)
Nat

Tommaso Pecorella

unread,

May 13, 2016, 7:43:47 PM5/13/16

to ns-3-users

I totally agree, and I think this is the TCP variant of this known bug:

1006: UDP socket tx buffer back pressure needed

https://www.nsnam.org/bugzilla/show_bug.cgi?id=1006

We definitely need a back pressure for TCP and UDP.

Cheers,

T.

Qiao Zhang

unread,

May 15, 2016, 12:42:05 AM5/15/16

to ns-3-users

Hi both,

Thanks for your input on this problem. I can see how a general solution is in fact quite challenging. I think in most cases, as long as the local L2.5 buffers are larger than buffers elsewhere in the network and there aren't too many flows sharing the same outgoing link,TCP's bottleneck would be in the network, and congestion control would kick in before we start losing packets at local egress.

For my particular case, I might have to experiment with cutting back sender buffer size conservatively when local L2 buffer is close to full. But I can see how this is like re-implementing TCP congestion control, at the socket buffer level. So this might not work out, and I might have to try something else in that case.

Thanks again for all your input!

Qiao

Eitan Zahavi

unread,

May 19, 2020, 12:01:22 PM5/19/20

to ns-3-users

Hi All,

Was back-pressure from traffic-controller to TCP eventually implemented into the NS-3 model?