Global TCP load balancer times out connection when only downloading.

10,695 views
Skip to first unread message

Anand Mistry

unread,
Dec 5, 2017, 11:49:22 PM12/5/17
to gce-discussion
Hi,

I have a server that exposes a TCP service, which is then exposed to the internet using the global TCP load balancer. When a client connects, the server transmits data on a regular basis (ever 10 seconds), which the client receives. However, I've noticed that if the client doesn't send any data to the server, the connection times out after the backend timeout in the load balancer config. If I increase the timeout, the connection stays around longer. This seems counterintuative since the timeout is meant for idle connections and this connection clearly isn't idle. I can set the timeout to something large (i.e. 86400 seconds), but this seems like a very poor workaround.

I have another client-server which also uses the global load balancer, but it does in-band health checks, and the connection there can stay open for hours. In the problem case, this isn't possible since I don't control the client.

Is what I'm seeing the expected behaviour for the global load balancer. And if so, where is this documented?

Thanks.

Fady (Google Cloud Platform)

unread,
Dec 6, 2017, 6:01:10 PM12/6/17
to gce-discussion
Hello Anand, 

As explained in this document which should also apply to global TCP load balancers, the time out of the backend service is not an idle or a keep alive timeout. It is a fixed time (default 30 seconds) that the backend service would wait for the backend instance(s) until it considers a request to have been failed.


On the other hand, the default TCP session timeout on a GCE instance is 10 minutes (600 seconds). Per this document, “ idle TCP connections are disconnected after 10 minutes. If your instance initiates or accepts long-lived connections with an external host, you can adjust TCP keep-alive settings to prevent these timeouts from dropping connections. You can configure the keep-alive settings on the Compute Engine instance, your external client, or both, depending on the host that typically initiates the connection. Set the keep-alives to less than 600 seconds to ensure that connections are refreshed before the timeout occurs.”  To set keep-alive you may check this stackoverflow discussion. I hope this helps.


Anand Mistry

unread,
Dec 6, 2017, 8:16:24 PM12/6/17
to gce-discussion
On Thursday, 7 December 2017 10:01:10 UTC+11, Fady (Google Cloud Platform) wrote:
Hello Anand, 

As explained in this document which should also apply to global TCP load balancers, the time out of the backend service is not an idle or a keep alive timeout. It is a fixed time (default 30 seconds) that the backend service would wait for the backend instance(s) until it considers a request to have been failed.


That is extremely ill defined. What does it mean for a request to have "failed"? My service is constantly sending data, which is clearly being received by the client (I assume TCP ACK packets are being properly sent). Clearly, the request hasn't "failed".

Carlos (Cloud Platform Support)

unread,
Dec 8, 2017, 5:14:27 PM12/8/17
to gce-discussion
Hi Anand,

I agree with Fady that the timeout mentioned in the documentation is describing the time the LB will wait for the backend answer before let us say returning a 5xx error to the client. It is not an idle timeout. That been said, could you provide some additional insight on what you are trying to achieve? What specific type of LB are you using? Is your service running on GKE? 

Anand Mistry

unread,
Dec 8, 2017, 6:19:29 PM12/8/17
to gce-discussion
I'm using the TCP proxy balancer. The connections are meant to be long-lived with intermittent traffic. The specific thing I was doing was streaming a log to the client. The client sends a request through the TCP connection, and the server streams a log for a while. There's continuous downstream traffic which is the log, but there's no upstream traffic until the log stream is terminated. If I send upstream traffic, the timeout appears to reset. Another server behind the same type of balancer doesn't experience this problem because it does regular in-band health checks. This isn't HTTP traffic, so there's no 5XX responses possible, and there's no concept of "request" and "response" unless you're talking about the initial TCP handshake, which is not the problem.

I would expect the TCP proxy keep the connection open as long as there's TCP traffic or keep alives on both ends of the proxy. For now, I've set the timeout to 1 week, which doesn't seems like a good workaround.

Fady (Google Cloud Platform)

unread,
Dec 12, 2017, 6:29:19 PM12/12/17
to gce-discussion

Hello Anand,


After checking with the backline team, and per this document, the backend service timeout for global TCP proxy load balancer, like the Global SSL Proxy load balancer, is actually an idle timeout. As you mentioned, the timeout here is different from the http load balancer where it’s a period to wait for the backend until it considers an http request failed. That said, is it possible to privately send me a tcpdump while reproducing the same behavior?



Gerrit DeWitt

unread,
Jan 12, 2018, 8:56:25 PM1/12/18
to gce-discussion
Hello Anand,

I'm Gerrit, a Solutions Engineer in Seattle.  I work in GCP Support with Fady who asked me to provide a little bit of a more detailed explanation for you.  I can clarify some of the concepts around timeouts with our global load balancer offerings.  With that in mind, here goes...

We have three global load balancer solutions, and all of them rely on a conceptual object called a backend service.  (The Global HTTP(s) load balancer offering can use a backend storage bucket instead, but that's outside of the scope of this answer.) The three global load balancer offerings are:

* Global HTTP(s) load balancer [1]
* (Global) SSL Proxy load balancer [2]
* (Global) TCP Proxy load balancer [3]

Each of these types of load balancers acts as a proxy, terminating HTTP or TCP requests from clients at the load balancer, and creating new requests to send to backend instances.  The backend service defines the backends and how they are to be used, and it also provides some logic for how to connect to them.  Included with that logic is a concept of a backend service timeout [4].

I understand that your question is directed to the TCP Proxy load balancer, but I think it will be easier to explain timeouts and retries if we start with the Global HTTP(S) load balancer and work backwards.

For the Global HTTP(s) load balancer offering, the backend service timeout represents a response timeout; that is, the amount of time the load balancer will wait for a backend instance to send a response to a request.  It's the time the load balancer will wait before giving up on the backend and synthesizing a HTTP 502 response.  HTTP/1.1 connections are considered “alive” (persistent) by default [5], so the concept of a keepalive or idle timeout is separate.  For the Global HTTP(s) load balancer, we have a fixed keepalive timeout of 600 seconds.  The TCP session timeout is the keepalive timeout, not the response timeout, for HTTP.

The two types of timeout - response and idle/keepalive - are unique to the HTTP protocol (due to [5]).

The notion of a timeout becomes simpler outside of HTTP.  For example, if the Global HTTP(s) load balancer is used to balance a WebSocket connection, response and idle/keepalive timeouts are not separate.  For WebSockets, the backend service timeout still defines the response timeout, but that's the only timeout, so it is also the TCP session timeout.  By “TCP session timeout,” we mean the maximum amount of time that a TCP session between the load balancer and a backend instance is allowed to remain open, whether or not the connection is active or idle.  More detail for the WebSocket example is here [7].

The same single timeout concept applies to the SSL Proxy and TCP Proxy load balancers.  For both of these, the backend service timeout defines the response timeout, which is the TCP session timeout.  This means that the the backend service timeout defines the maximum amount of time a TCP session can remain open between the load balancer and its backend instance, regardless of activity.  A special case is that idle connections can only persist for the duration of the backend service/response/TCP session timeout, but this timeout also applies to active connections for which the load balancer has not yet received a response.  In your example where you're doing in-band health checks, I'm guessing you're sending data, getting a response, sending a health check, getting a response, etc. where all responses are returned within a period of time less than or equal to the backend service (response) timeout.

I realize there's a bit of subtlety here and that we could better state what we've documented for the backend service with respect to SSL and TCP Proxy load balancers [8].

One other point I'll address is this:  We're sometimes asked where the backend service fits in the OSI networking stack.  It's best to think of the backend service as a collection of configuration parameters for our load balancers.  For the three I've discussed, you might be using one to proxy HTTP or SSL traffic (application layer) or TCP traffic (layer four).  The backend service object is also used by our non-proxy internal load balancer offering (which is outside the scope of this answer).

Hopefully this helps you out!

Sincerely,

Gerrit
Cloud Solutions Engineer, Seattle

Anand Mistry

unread,
Jan 15, 2018, 12:50:02 AM1/15/18
to Gerrit DeWitt, gce-discussion
Thanks for the response, but this doesn't seem to address the behaviour I'm seeing.

On 13 January 2018 at 12:56, 'Gerrit DeWitt' via gce-discussion <gce-dis...@googlegroups.com> wrote:
Hello Anand,

I'm Gerrit, a Solutions Engineer in Seattle.  I work in GCP Support with Fady who asked me to provide a little bit of a more detailed explanation for you.  I can clarify some of the concepts around timeouts with our global load balancer offerings.  With that in mind, here goes...

We have three global load balancer solutions, and all of them rely on a conceptual object called a backend service.  (The Global HTTP(s) load balancer offering can use a backend storage bucket instead, but that's outside of the scope of this answer.) The three global load balancer offerings are:

* Global HTTP(s) load balancer [1]
* (Global) SSL Proxy load balancer [2]
* (Global) TCP Proxy load balancer [3]

Each of these types of load balancers acts as a proxy, terminating HTTP or TCP requests from clients at the load balancer, and creating new requests to send to backend instances.  The backend service defines the backends and how they are to be used, and it also provides some logic for how to connect to them.  Included with that logic is a concept of a backend service timeout [4].

I understand that your question is directed to the TCP Proxy load balancer, but I think it will be easier to explain timeouts and retries if we start with the Global HTTP(S) load balancer and work backwards.

For the Global HTTP(s) load balancer offering, the backend service timeout represents a response timeout; that is, the amount of time the load balancer will wait for a backend instance to send a response to a request.  It's the time the load balancer will wait before giving up on the backend and synthesizing a HTTP 502 response.  HTTP/1.1 connections are considered “alive” (persistent) by default [5], so the concept of a keepalive or idle timeout is separate.  For the Global HTTP(s) load balancer, we have a fixed keepalive timeout of 600 seconds.  The TCP session timeout is the keepalive timeout, not the response timeout, for HTTP.

The two types of timeout - response and idle/keepalive - are unique to the HTTP protocol (due to [5]).

The notion of a timeout becomes simpler outside of HTTP.  For example, if the Global HTTP(s) load balancer is used to balance a WebSocket connection, response and idle/keepalive timeouts are not separate.  For WebSockets, the backend service timeout still defines the response timeout, but that's the only timeout, so it is also the TCP session timeout.  By “TCP session timeout,” we mean the maximum amount of time that a TCP session between the load balancer and a backend instance is allowed to remain open, whether or not the connection is active or idle.  More detail for the WebSocket example is here [7].

The same single timeout concept applies to the SSL Proxy and TCP Proxy load balancers.  For both of these, the backend service timeout defines the response timeout, which is the TCP session timeout.  This means that the the backend service timeout defines the maximum amount of time a TCP session can remain open between the load balancer and its backend instance, regardless of activity

If this is the case, how can I keep a TCP session open for hours when I set the backend timeout to 60 seconds?
 
A special case is that idle connections can only persist for the duration of the backend service/response/TCP session timeout, but this timeout also applies to active connections for which the load balancer has not yet received a response.  In your example where you're doing in-band health checks, I'm guessing you're sending data, getting a response, sending a health check, getting a response, etc. where all responses are returned within a period of time less than or equal to the backend service (response) timeout.

Yes, but according to the part I highlighted above, the connection should still be dropped after the backend service timeout. 

But, more importantly, what denotes a "request" and "response" in the TCP proxy load balancer? Given the traffic itself is completely opaque to the balancer (in my case, it's encrypted), the balancer cannot properly comprehend the notion of a request or a response. In my problematic case, I'm sending a periodic response to a single request. What if the response is a single large file? What if I reversed the situation and make periodic requests, but rarely send a response? Both of these cases are valid, and common, especially with legacy protocols that don't do any form of in-band flow control (expecting TCP to handle it). In my case, I have no control over the TCP traffic, but expect that if there's continuous traffic in either direction, the connection should remain open. TCP-level ACKs confirm the receipt of the data, which itself is "response" of sorts and verifies the connection is still open and valid.

At this point, I've resorted to setting a 1 week backend timeout. Probably longer than most of my users' needs, but it seems to work.
 

I realize there's a bit of subtlety here and that we could better state what we've documented for the backend service with respect to SSL and TCP Proxy load balancers [8].

One other point I'll address is this:  We're sometimes asked where the backend service fits in the OSI networking stack.  It's best to think of the backend service as a collection of configuration parameters for our load balancers.  For the three I've discussed, you might be using one to proxy HTTP or SSL traffic (application layer) or TCP traffic (layer four).  The backend service object is also used by our non-proxy internal load balancer offering (which is outside the scope of this answer).

Hopefully this helps you out!

Sincerely,

Gerrit
Cloud Solutions Engineer, Seattle

--
© 2017 Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043
 
Email preferences: You received this email because you signed up for the Google Compute Engine Discussion Google Group (gce-discussion@googlegroups.com) to participate in discussions with other members of the Google Compute Engine community and the Google Compute Engine Team.
---
You received this message because you are subscribed to a topic in the Google Groups "gce-discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gce-discussion/MKKRv03d5i4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gce-discussion+unsubscribe@googlegroups.com.
To post to this group, send email to gce-discussion@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gce-discussion/360f60b0-c738-44d5-83fa-b1f7fbfeca61%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Gerrit DeWitt

unread,
Jan 16, 2018, 3:22:14 PM1/16/18
to gce-discussion
Hello Anand,

It sounds like you've solved your issue by increasing the backend service timeout.  I'm glad to hear that's working for you.  That would be my suggestion.  You may not need it to be set to one week, but, with experimentation, you can determine a value that meets your needs.

To answer your follow-up questions:

Follow-up Question 1:  ...[H]ow can I keep a TCP session open for hours when I set the backend timeout to 60 seconds?

If you keep the backend timeout set to 60 seconds, you'll need to make sure your backend instances (servers) are returning responses to the load balancer (to pass onto clients) in less than 60 seconds.  As soon as a response is returned, the session can once again remain open for the duration of the backend service timeout.  (More about what constitutes a “response” is discussed in the next answer.)

Another strategy, which it seems you've employed, is to set the backend service timeout to greater than the maximum amount of time you'd expect to keep the TCP session open.

Follow-up Question 2:  ...[W]hat denotes a "request" and "response" in the TCP proxy load balancer?

As we have both noted, the concept of a “response” is easily defined for HTTP traffic, but it's more complex for TCP traffic.

I think the best way to conceptualize this is to consider a TCP session as being open as long as the sequence number of the packets sent by the instance behind the load balancer are increasing within the time frame specified by the backend service (response) timeout.  A simple client SYN, server SYN-ACK, client ACK establishes a TCP session.  For the start of a new HTTP request, this brings the TCP sequence number of the server to 1, until the server sends more data to the client.  As the server sends more data, its TCP sequence number increases.

If the server's TCP sequence number doesn't increase, and a period of time greater than the backend service timeout elapses, I would expect for the TCP session to be terminated by the load balancer.  In other words, the load balancer passes traffic for a session as long as TCP sequence numbers increase; those should be the “responses” necessary to “reset the counter.”  The TCP session should remain open as long as these continue in a periodic fashion where the period is less than the backend service timeout.  Also keep in mind that the load balancer is acting as a proxy for the three types we discussed here.

This link [1] does an excellent job of explaining TCP sequence numbers with respect to HTTP, but the same principle applies for other types of traffic.  A good takeaway here is that a simple SYN-ACK from a server may not, by itself, be a “response.”

Does that help clear things up?

--Gerrit

Reply all
Reply to author
Forward
0 new messages