UDP network load balancing issues with response / outbound traffic.

783 views
Skip to first unread message

Simon Morley

unread,
Jul 24, 2014, 1:26:53 PM7/24/14
to gce-dis...@googlegroups.com
We've created a UDP load balancer for our radius servers using a pretty standard set of rules. Ports 1812-1813 (UDP) are forwarded to the instances.

The inbound traffic is received ok and I see it hitting the radius server.

It's expected then that the radius server responds (with an Access-Accept packet for example). 

The latter doesn't seem to work - the server loops continuously and then fails.

If I remove the load-balancer and hit the server's external IP, I can see the packet is sent and received correctly.

Is there something I need to do on the load balancer side of things to facility this?

Marilu

unread,
Jul 24, 2014, 1:47:31 PM7/24/14
to gce-dis...@googlegroups.com
Hello Simon,

Do a tcpdump of your host, and see how the connection to server 169.254.169.254 is closed (This is the IP that is the source of health check). As you said, it's expected to respond with Access-Accept, but if you see an [R] flag, that means that connection is not closed properly, therefore your instance are considered unhealthy.

Marilu

Simon Morley

unread,
Jul 24, 2014, 2:15:20 PM7/24/14
to gce-dis...@googlegroups.com
Hey Marilu, thanks for the reply.

I've run that and can see [R] appearing:

18:07:25.081860 IP metadata.google.internal.51118 > 43.7.148.146.bc.googleusercontent.com.http: Flags [S], seq 96844179, win 8096, options [mss 1024], length 0
18:07:25.081897 IP 43.7.148.146.bc.googleusercontent.com.http > metadata.google.internal.51118: Flags [S.], seq 2239516008, ack 96844180, win 14200, options [mss 1420], length 0
18:07:25.081993 IP metadata.google.internal.51118 > 43.7.148.146.bc.googleusercontent.com.http: Flags [.], ack 1, win 8096, length 0
18:07:25.082016 IP metadata.google.internal.51118 > 43.7.148.146.bc.googleusercontent.com.http: Flags [P.], seq 1:39, ack 1, win 8096, length 38
18:07:25.082023 IP 43.7.148.146.bc.googleusercontent.com.http > metadata.google.internal.51118: Flags [.], ack 39, win 14200, length 0
18:07:25.082248 IP 43.7.148.146.bc.googleusercontent.com.http > metadata.google.internal.51118: Flags [P.], seq 1:238, ack 39, win 14200, length 237
18:07:25.082403 IP metadata.google.internal.51118 > 43.7.148.146.bc.googleusercontent.com.http: Flags [R.], seq 39, ack 238, win 8096, length 0

But the console says the host is healthy.

So, I've run curl in a loop and the output is as so:

curl: (28) Connection timed out after 1001 milliseconds
www
curl: (28) Connection timed out after 1000 milliseconds
www
curl: (28) Connection timed out after 1000 milliseconds
curl: (28) Connection timed out after 1001 milliseconds
curl: (28) Connection timed out after 1001 milliseconds
curl: (28) Connection timed out after 1001 milliseconds
www
www
curl: (28) Connection timed out after 1000 milliseconds
curl: (28) Connection timed out after 1001 milliseconds
curl: (28) Connection timed out after 1001 milliseconds
curl: (28) Connection timed out after 1001 milliseconds

I was affected by the issue this afternoon but thought that was resolved.

Was can I do to fix? 

Marilu

unread,
Jul 24, 2014, 3:53:45 PM7/24/14
to gce-dis...@googlegroups.com
Hi Simon,

Does the tcpdump were done in your vm or in the IP for the loadbalance?
When you mention, 'But the console says the host is healthy', are you referring to your vm or to the Load Balancer?

The results that you provided, seems to be from your vm, can you do this  sudo tcpdump -A -n host your-IP-for-load-balance

If you see a result like 
19:49:29.986581 IP 169.254.169.254.49634 > 23.236.169.236.80: Flags [R.], seq 41, ack 834, win 8096, length 0

It means that your connection is not being closed properly, you need to find the way to force this connection to close properly. Even if your instance is healthy, if this connection is not closed properly, you will get an unhealthy load balance.

Marilu

Simon Morley

unread,
Jul 24, 2014, 6:02:50 PM7/24/14
to gce-dis...@googlegroups.com
Ok, that makes more sense now. Was a little confused.

I meant the health check in the target pools section says the instance is passing the health check and is part of the cluster.

I've rerun as suggested and the output looks more logical! ~~>

22:56:43.431339 IP 146.148.7.43.80 > 169.254.169.254.51797: Flags [.], ack 39, win 14200, length 0
E
..(.9@.@......+.....P.Ur..A..`.P.7x....
22:56:43.431506 IP 146.148.7.43.80 > 169.254.169.254.51797: Flags [P.], seq 1:258, ack 39, win 14200, length 257
E..).:@.@......+.....P.Ur..A..`
.P.7x....HTTP/1.1 200 OK
Server: nginx/1.6.0
Date: Thu, 24 Jul 2014 21:56:43 GMT
Content-Type: text/html
Content-Length: 22
Last-Modified: Thu, 24 Jul 2014 18:26:45 GMT
Connection: keep-alive
ETag: "53d14fe5-16"
Accept-Ranges: bytes


Hello from Debian 7.6




22:56:43.431622 IP 169.254.169.254.51797 > 146.148.7.43.80: Flags [R.], seq 39, ack 258, win 8096, length 0
E
..(c.....jc.......+.U.P..`.r..BP.......
22:56:48.431885 IP 169.254.169.254.51798 > 146.148.7.43.80: Flags [S], seq 115434630, win 8096, options [mss 1024], length 0
E..,c.....j^.......+.V.P..d.....`
...Vp......
22:56:48.431920 IP 146.148.7.43.80 > 169.254.169.254.51798: Flags [S.], seq 2076666232, ack 115434631, win 14200, options [mss 1420], length 0
E
..,..@.@.M....+.....P.V{.ix..d.`.7x........
22:56:48.432024 IP 169.254.169.254.51798 > 146.148.7.43.80: Flags [.], ack 1, win 8096, length 0
E..(c.....ja.......+.V.P..d.{.iyP....(..
22:56:48.432057 IP 169.254.169.254.51798 > 146.148.7.43.80: Flags [P.], seq 1:39, ack 1, win 8096, length 38
E..Nc.....j:.......+.V.P..d.{.iyP...i%..GET / HTTP/1.1
Host: 146.148.7.43



Are you saying the health check isn't getting closed properly or the radius request isn't being closed properly?

Sorry for being a bit slow - I'm a little confused where the problem is: radius, lb or nginx.

S

Simon Morley

unread,
Jul 25, 2014, 9:03:33 AM7/25/14
to gce-dis...@googlegroups.com
Hi again

Have spoken with the radius peeps and they've said considering UDP traffic's not closed as such, it sounds like there's a load-balancer problem.

I'm therefore assuming the answer to my last question is that it's the http health check is failing. And, as you said, the server's being removed?

But this doesn't make sense - we have a simple nginx server responding to the health check.

It really looks like our UDP traffic outbound is just blocked. Have just tested on a number of clients and do not ever get the reply.

How can I troubleshoot this further?

On Thursday, 24 July 2014 20:53:45 UTC+1, Marilu wrote:

Marilu

unread,
Jul 25, 2014, 11:26:12 AM7/25/14
to gce-dis...@googlegroups.com
Hello Simon,

For the information provided, it seems like your load balance is working just fine. I assume that you have created a firewall allowing UDP traffic to your instance, since you mention that you can see the traffic being received in your server.
I'm looking at what could be causing the outbound traffic 'being blocked'.

Marilu

Simon Morley

unread,
Jul 25, 2014, 11:44:20 AM7/25/14
to gce-dis...@googlegroups.com
Hi

Yeah, the server works fine on it's own - all the firewall rules look ok. Have successfully authenticated against it.

The problems are only when I'm authenticating through the load balancer.

The debug logs on the radius server in both scenarios are almost identical. The last entry is the radius server sending back the UDP access-accept packet to the source IP. 

On the standalone server, I can see this received on the client. Behind the load-bal. it never arrives.

S

Marilu

unread,
Jul 25, 2014, 1:38:07 PM7/25/14
to gce-dis...@googlegroups.com
Hi Simon,

When referring to the instance behind the load balancer (LB), is it the same instance that you're accessing directly? Or do you have 2 separate servers? One behind the LB and one not?

When sending traffic through the LB the instance replies directly, traffic is not returned to the LB, so this shouldn't affect you.

Can you compare the traffic that is being received by the instance behind the LB and the traffic that is being received directly? Is it different in any way?

Marilu

Simon Morley

unread,
Jul 25, 2014, 2:25:48 PM7/25/14
to gce-dis...@googlegroups.com
Oh, that's odd then.

Yes, we have removed all servers except one which we're using to test. I have tested with two but not right now.

If I authenticate against the ephemeral ip, all is well. 

If I authenticate against the lb ip, all looks well on the radius but the packet is not recv. on the client.

The logs on the server look the same for each request.

Like I said, behind the lb, we don't get the access-accept packet.

Thanks again for your help :) I hope I make sense!

Jeff Kingyens

unread,
Jul 26, 2014, 1:32:34 AM7/26/14
to gce-dis...@googlegroups.com
I think I'm seeing similar issues when attempting UDP requests through a load balancer originating from behind the load balancer. I have a 53/udp DNS server setup on each instance as well as a DNS load balancer. If I send a DNS request directly to another instance's ip, I get a valid DNS response. However, if I send a DNS request from an instance through the load balancer, dig reports an error:

dig google.com @23.251.159.188

;; reply from unexpected source: 172.17.42.1#53, expected 23.251.159.188#53

It seems UDP packets are having a time going through the load balancer when originating from behind the load balancer. Is this a similar issue to what you are seeing?

Simon Morley

unread,
Jul 26, 2014, 9:24:47 AM7/26/14
to gce-dis...@googlegroups.com
Yes, that's the same issue I'm seeing. Apparently, they shouldn't be send through the LB when sending out (see further up post). However, we've tested behind and directly and it only fails behind. The packet is sent but not recvd.

Have you tested without the LB to prove it works?

Simon Morley

unread,
Jul 26, 2014, 11:39:58 AM7/26/14
to gce-dis...@googlegroups.com
Ok so I've run tcpdump on both the server and client when authenticating in the following scenarios:

1. Server without load balancer
2. Server behind load balancer

Here's a copy of the dumps:

User Standalone

On the server:

16:31:25.496575 IP 213.205.227.181.45985 > 10.240.66.188.1812: RADIUS, Access Request (1), id: 0x2d length: 152
........3..
.B..........-...U.}....j....;d...RgRlns...x.D;8.q5.r.7.SK &4f57ce9e-ff1c-4601-8003-4d666cc372ba..
...=.......11-A4-3C-4C-1C-84..24-A4-3C-4C-1C-85P....ru.~.{....6|.
16:31:26.498120 IP 10.240.66.188.1812 > 213.205.227.181.45985: RADIUS, Access Reject (3), id: 0x2d length: 68
E
..`....@...
.B..........L...-.DUv....._$.u..F...0Your maximum never usage time has been reached


And the response on client

16:32:46.754671 IP 172.20.10.2.51292 > 130.211.82.107.1812: RADIUS, Access Request (1), id: 0x2d length: 152
.r@..d<.......E...*/..@.....
...Rk.\.......-.........Y..V.f.....RgRlns.....A>..dK{..87Y. &4f57ce9e-ff1c-4601-8003-4d666cc372ba..
...=.......11-A4-3C-4C-1C-84..24-A4-3C-4C-1C-85P...=(..27........
16:32:48.234361 IP 130.211.82.107.1812 > 172.20.10.2.51292: RADIUS, Access Reject (3), id: 0x2d length: 68
<......r@..d..E..`....+./^..Rk..
....\.L...-.D.7||..h.\t;..=*V.0Your maximum never usage time has been reached


And then behind the load balancer

On the server:

16:29:43.306198 IP 213.205.227.181.59297 > 146.148.7.43.1812: RADIUS, Access Request (1), id: 0x2b length: 152
E
`...3..1..c.......+.........+....R.M..:.w.hnt....RgRlns...s.Hk....B..l.d. &4f57ce9e-ff1c-4601-8003-4d666cc372ba..
...=.......11-A4-3C-4C-1C-84..24-A4-3C-4C-1C-85P..D.\..<.?1!.gI+.
16:29:44.307406 IP 10.240.66.188.1812 > 213.205.227.181.59297: RADIUS, Access Reject (3), id: 0x2b length: 68
E..`
....@...
.B..........L...+.D...O..wg9..59*V..0Your maximum never usage time has been reached



And on the client:

16:33:56.385748 IP 172.20.10.2.52176 > 146.148.7.43.1812: RADIUS, Access Request (1), id: 0x2d length: 152
.r@..d<.......E...[...@.....
....+.........-..`}........,...._..RgRlns..}...4B..]/...M.. &4f57ce9e-ff1c-4601-8003-4d666cc372ba..
...=.......11-A4-3C-4C-1C-84..24-A4-3C-4C-1C-85P..a.....4.EM}..t.


You can clearly see the last packet is sent but never received, only when using the load balancer.

16:32:48.234361 IP 130.211.82.107.1812 > 172.20.10.2.51292: RADIUS, Access Reject (3), id: 0x2d length: 68
<......r@..d..E..`....+./^..Rk..
....\.L...-.D.7||..h.\t;..=*V.0Your maximum never usage time has been reached

Jeff Kingyens

unread,
Jul 26, 2014, 5:13:55 PM7/26/14
to gce-dis...@googlegroups.com
Yes, it works fine without the load balancer:

 dig google.com @146.148.44.204

; <<>> DiG 9.9.5-3-Ubuntu <<>> google.com @146.148.44.204

;; global options: +cmd

;; Got answer:

;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 64112

;; flags: qr rd ra; QUERY: 1, ANSWER: 6, AUTHORITY: 0, ADDITIONAL: 1

..

Marilu

unread,
Jul 28, 2014, 10:22:03 AM7/28/14
to gce-dis...@googlegroups.com
Hello Simon and Jeff,

When you defined your target pool for the LB, did you define 'sessionAffinity' for it?

If not, creating a new target pool defining 'sessionAffinity' attribute, will make a difference in the LB?

Marilu
 

Marilu

unread,
Jul 28, 2014, 2:22:42 PM7/28/14
to gce-dis...@googlegroups.com
Hello Simon,

 As per the tcpdump you run on the server, I notice that the issue can because the way your Radius server is responding when behind the LB.

On the server without the LB, you client IP(213.205.227.181) is connecting to the internal IP of your instance 10.240.66.188,so the reply is return from your internal IP.

On the case of behind the LB, your client is connecting to the external IP address of your instance 146.148.7.43, however the reply is coming back from your internal IP. Since it's a session hasn't been open previously with this internal IP, the connection is not allowed. 

Can you configure your Radius server to bind to the public IP address instead of the internal IP? This way the connection pattern will be always using the external IP.

Marilu



 

Simon Morley

unread,
Jul 28, 2014, 6:15:32 PM7/28/14
to gce-dis...@googlegroups.com
You are one smart person :)

The smallest change and all is well. Thank you for taking the time to help me.

Jeff Kingyens

unread,
Jul 28, 2014, 8:38:18 PM7/28/14
to gce-dis...@googlegroups.com
Creating a new target pool with sessionAffinity set and assigning the udp/53 load balancer to this new target pool still does not fix the problem. 

However, I have been able to isolate the issue a bit further. It seems everything works fine when the DNS client request (from dig) originates from a network interface attached directly to the GCE private network (assigned an ip address in the private network address space). The problem occurs when the DNS client request from dig originates from a virtual network interface attached to a virtual network bridge (in this case, created by docker):

works fine with docker host networking (--net=host)
 
core@core0 ~ $ docker run -t -i --net=host jkingyens/dig /bin/bash
root@core0:/# dig google.com @23.251.159.188
; <<>> DiG 9.9.5-3-Ubuntu <<>> google.com @23.251.159.188
;; global options: +cmd
;; Got answer:
....
 
However, doesn't work with docker bridge networking (default):
 
core@core0 ~ $ docker run -t -i jkingyens/dig /bin/bash
root@f1d775d5a6af:/# dig google.com @23.251.159.188
;; reply from unexpected source: 172.17.42.1#53, expected 23.251.159.188#53
;; reply from unexpected source: 172.17.42.1#53, expected 23.251.159.188#53
;; reply from unexpected source: 172.17.42.1#53, expected 23.251.159.188#53

Marilu

unread,
Jul 29, 2014, 1:20:45 PM7/29/14
to gce-dis...@googlegroups.com
Hello Jeff,

From this error message reply from unexpected source: 172.17.42.1#53, expected 23.251.159.188#53, when using the virtual network, you DNS is replying for your internal IP 172.17.42.1,when is expected from your external IP 23.25.159.188.

Marilu
Reply all
Reply to author
Forward
0 new messages