Re: Web10g and connection stall analysis

Aaditeshwar Seth

unread,

Apr 13, 2014, 11:32:25 PM4/13/14

to Arvind Mahla, Vinay Ribeiro, Zahir Koradia, asheesh...@me.com, ruralne...@googlegroups.com

Comments inline:

On 13-04-2014 15:45, Arvind Mahla wrote:

On Sat, Apr 5, 2014 at 11:01 AM, Aaditeshwar Seth <as...@cse.iitd.ernet.in> wrote:

What I recall about these connection stalls is that:

1. The server (sender) would keep sending data and run into timeouts, but the receiver never received the data within the 0.5 RTT at that time but much much later

2. These stalls were triggered by a burst data release that was likely to be a combination of CUBIC stepping up the congestion window, delayed acks, and possibly TSO.

So a number of things need to be investigated:
- For 1, what is the RTO calculated by TCP at the sender when these stalls occur, to confirm that the RTO is similar to the RTTs and much smaller than the delay at which the data is actually received

We have gone through tcpdumps and Web10g logs. The observation is in sync with your suggestions. The RTO is similar to the RTTs, but slightly higher. The RTO observed in plots is less than half of the delay at which the data is received.

Please give evidence, ie. show the reported RTO over time, RTT over time, etc, so that we can make this claim.

- For 2, can we precisely identify the trigger point, which could be based on the amount of congestion window increase and the rate of increase

The observations of dumps for the trigger point of stalls do not reveal any stepping up of the congestion window, in fact in many cases the congestion window shows gradual decrease before connection stall.

Same comment at earlier, show the congwin variable over time.

Specifically to understand the cause:
- Check that no middlebox in the way is buffering out-of-order packets, something that was noticed in the 'middleboxes in cellular networks' paper where firewalls were seen to be buffering out of order packets for deep packet inspection to check for virus signatures. In our case, we should check if out of order packets sent artificially are being buffered

The tests were performed and out-of-order packet buffering was not observed.

- Whether signal strength at the receiver is a problem. Do we have signal strength measurements during this time? Otherwise having multiple transfers going is a possibility, but why do you say upload? Here it is a download scenario. I don't think we investigated stalls in upload

Signal strength measurements are there, but the frequency is once in three seconds and reliability has not been ascertained.
Tests will be performed for competing flows: TCP Vs TCP to different destinations, TCP Vs UDP to same destination. These tests will be performed in the current week. Stalls are present in both uploads and downloads and we are investigating in both directions.

Ok

- How about the modem, could this be an issue?

This will be investigated once the above mentioned tests are completed.

Make sure you order other modems in advance if you don't already have different models.

Please also give timelines for these things.

I and Asheesh worked on these issues. Today I am going to Hyderabad on Official Duty and will be back on 19th Apr, I have detailed the further tests to Asheesh and have asked him to complete the task by the end of this week.

Arvind
Aaditeshwar
On 03-04-2014 20:44, Arvind Mahla wrote:

Sir,

We along with Zahir analysed Web10g logs for connection stalls. The Web10g variables such as congestion window size, RTT, RTO, were plotted along with tcpdumps but the reason for connection could not be established.
We need to explore further by looking answers to the following questions:

1. Are the connection stalls due to poor signal quality or fading? To get insight into this we need to run parallel UDP/TCP uploads along with ping. Signal fading will definitely affect the parallel flows. It will also reveal whether the stalling is protocol specific. Parallel uploads should be done using different servers simultaneously. It will help in isolating the reason for connection stalls.

2. Are the connection stalls OS specific? We need to run the tests on Windows machine.

Arvind

On Tue, Apr 1, 2014 at 9:27 AM, Aaditeshwar Seth <as...@cse.iitd.ernet.in> wrote:

Folks, Vinay is not available therefore please catch up among yourself and with Zahir, and send me a good report on web10g analysis especially if it can explain the connection stalling

On 31-03-2014 08:31, Aaditeshwar Seth wrote:

Folks, Are you free tomorrow (Tuesday) at 5pm to discuss these and the web10g results? Arvind/Asheesh, please catch up before that and send us an updated report on the web10g results that Ashesh has sent.

On 31-03-2014 08:16, Arvind Mahla wrote:

Sir,

In the report I had sent, the Conclusion section was modified as per your comments.
I will further refine it for better understanding.

Arvind

--
Aaditeshwar Seth
Co-founder, Gram Vaani Community Media

http://gramvaani.org
http://www.linkedin.com/company/gram-vaani-community-media
-- 
Aaditeshwar Seth
Co-founder, Gram Vaani Community Media

http://gramvaani.org
http://www.linkedin.com/company/gram-vaani-community-media

-- 
Aaditeshwar Seth
Assistant Professor
Computer Science and Engineering
IIT Delhi

http://www.cse.iitd.ernet.in/~aseth

Arvind Mahla

unread,

Apr 19, 2014, 2:32:05 PM4/19/14

to Aaditeshwar Seth, Vinay Ribeiro, Zahir Koradia, asheesh...@me.com, ruralne...@googlegroups.com

Sir,

Please find document attached showing figures for RTO, RTT & congestion window plots.

The answers to your queries are inline.

On Sun, Apr 13, 2014 at 11:32 PM, Aaditeshwar Seth <as...@cse.iitd.ernet.in> wrote:

Comments inline:

On 13-04-2014 15:45, Arvind Mahla wrote:

On Sat, Apr 5, 2014 at 11:01 AM, Aaditeshwar Seth <as...@cse.iitd.ernet.in> wrote:

What I recall about these connection stalls is that:

1. The server (sender) would keep sending data and run into timeouts, but the receiver never received the data within the 0.5 RTT at that time but much much later

2. These stalls were triggered by a burst data release that was likely to be a combination of CUBIC stepping up the congestion window, delayed acks, and possibly TSO.

So a number of things need to be investigated:
- For 1, what is the RTO calculated by TCP at the sender when these stalls occur, to confirm that the RTO is similar to the RTTs and much smaller than the delay at which the data is actually received

We have gone through tcpdumps and Web10g logs. The observation is in sync with your suggestions. The RTO is similar to the RTTs, but slightly higher. The RTO observed in plots is less than half of the delay at which the data is received.

Please give evidence, ie. show the reported RTO over time, RTT over time, etc, so that we can make this claim.

The RTO, RTT plots w.r.t. time are plotted in Figure 1 in the attached document.

- For 2, can we precisely identify the trigger point, which could be based on the amount of congestion window increase and the rate of increase

The observations of dumps for the trigger point of stalls do not reveal any stepping up of the congestion window, in fact in many cases the congestion window shows gradual decrease before connection stall.

Same comment at earlier, show the congwin variable over time.

The Congestion window is plotted in Figure 2 in the attached document.

Specifically to understand the cause:
- Check that no middlebox in the way is buffering out-of-order packets, something that was noticed in the 'middleboxes in cellular networks' paper where firewalls were seen to be buffering out of order packets for deep packet inspection to check for virus signatures. In our case, we should check if out of order packets sent artificially are being buffered

The tests were performed and out-of-order packet buffering was not observed.

- Whether signal strength at the receiver is a problem. Do we have signal strength measurements during this time? Otherwise having multiple transfers going is a possibility, but why do you say upload? Here it is a download scenario. I don't think we investigated stalls in upload

Signal strength measurements are there, but the frequency is once in three seconds and reliability has not been ascertained.
Tests will be performed for competing flows: TCP Vs TCP to different destinations, TCP Vs UDP to same destination. These tests will be performed in the current week. Stalls are present in both uploads and downloads and we are investigating in both directions.

Ok

The tests have been written and deployed. The logs will be analysed and reported soon.

- How about the modem, could this be an issue?

This will be investigated once the above mentioned tests are completed.

Make sure you order other modems in advance if you don't already have different models.

OK

report.pdf

Aaditeshwar Seth

unread,

Apr 21, 2014, 4:06:52 AM4/21/14

to Arvind Mahla, Vinay Ribeiro, Zahir Koradia, asheesh...@me.com, ruralne...@googlegroups.com

In these graphs, you also need to show the acks received and packets sent at the sender, and separately also show the packets received and acks dispatched at the receiver. A connection stall is when the sender has sent packets but which haven't been received. What you've marked as a connection stall here, seems to be a pause in dispatching data at the sender, which seems to be according to standard practice since no new acks came in.

Arvind Mahla

unread,

Apr 27, 2014, 4:18:53 PM4/27/14

to ruralne...@googlegroups.com, Vinay Ribeiro, Zahir Koradia, asheesh...@me.com

Sir,

The weekly update of the work done will be send by evening.

Arvind

--
You received this message because you are subscribed to the Google Groups "ruralnet-act4d" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ruralnet-act4...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Arvind Mahla

unread,

Apr 28, 2014, 3:19:59 PM4/28/14

to ruralne...@googlegroups.com, Vinay Ribeiro, Zahir Koradia, asheesh...@me.com

Sir,

On Mon, Apr 21, 2014 at 1:36 PM, Aaditeshwar Seth <as...@cse.iitd.ernet.in> wrote:

In these graphs, you also need to show the acks received and packets sent at the sender, and separately also show the packets received and acks dispatched at the receiver. A connection stall is when the sender has sent packets but which haven't been received. What you've marked as a connection stall here, seems to be a pause in dispatching data at the sender, which seems to be according to standard practice since no new acks came in.

Please find the report attached with graphs of both sender and receiver sides.

I will report on the TCP & UDP parallel flows as soon as I get test data from Asheesh.

Arvind

--

TCPStalls.pdf

Aaditeshwar Seth

unread,

May 4, 2014, 9:49:09 AM5/4/14

to ruralne...@googlegroups.com, Vinay Ribeiro, Zahir Koradia, asheesh...@me.com

So this is not the same as a connection stall, where the packets are sent but received after a huge delay (not retransmitted). This could be another pattern that both data and ack packets are lost, which could be due to a loss of connectivity, but it is not the connection stall phenomenon that was described in the paper. Have we noticed any connection stalls? Zahir, need your intervention here -- I am guessing you did scripts to automatically extract incidences of stall events.

Zahir Koradia

unread,

May 5, 2014, 12:15:22 AM5/5/14

to ruralnet-act4d, Vinay Ribeiro, asheesh...@me.com

The graph looks very similar to stalls. Even in our description of stalls, we claimed that some data gets lost and other delayed for very long time. In these graphs the proportion of delayed packets is lower but not zero. This is seen through the sacks of delayed packets observed right after 17:00:00 marker in figure 1. One key difference though is loss of ack also, something we had not observed before. This may indicate that the problem is likely at a lower layer of the protocol stack.

I am curious to see the protocol/local conditions dependence, something Arvind, Asheesh, and I had discussed. If the problem is indeed loss of connectivity then it will be protocol independent. Also if the problem is with the client itself, then another client near by will not have a problem. If however, the problem is at the base station (or may be along a common path beyond the base station) then we will see both clients have a problem at the same time.

The scripts for stalls detection are with Aravindh and I think they have already been shared with Asheesh/Arvind.

Zahir

Aaditeshwar Seth

unread,

May 5, 2014, 6:25:36 AM5/5/14

to ruralne...@googlegroups.com, Vinay Ribeiro, asheesh...@me.com

Sounds reasonable. Arvind, we must complete these tests ASAP, on the following lines:
- Automatically detect stalls in the data used in the paper, and the new data you have collected
- Categorize stall events in terms of the % of delayed packets, % of lost packets/acks
- Separately confirm if stalls are an artifact of local connectivity (by running experiments in parallel), and can we rule out influence of the modem (by repeating tests from a different model)

Also write a comprehensive report that includes web10g analysis, rules out out-of-order buffering in some middlebox, etc.

Aaditeshwar

Vinay Ribeiro

unread,

May 6, 2014, 12:51:51 AM5/6/14

to Aaditeshwar Seth, ruralne...@googlegroups.com, asheesh...@me.com

Hi,

Some students in a class project reported unusually high RTTs (as high as 50 secs for Reliance) on the uplink for Airtel and Reliance 3G. Have we performed any experiments on the uplink? This is probably due to some buffer bloat sort of issue but this time the buffer is at the client device.