Nginx as Load Balancer Connection Issues

16 views
Skip to first unread message

gtuhl

unread,
Jan 6, 2012, 4:49:16 PM1/6/12
to ng...@nginx.org
We have a box running nginx and two boxes running apache. The apache
boxes are configured as an upstream for nginx.

The nginx box has a public IP, and then it talks to the upstream apaches
using the private network (same switch). We are sustaining a couple
hundred requests/sec.

We've had several issues with the upstreams being counted out by nginx,
causing the "no live upstreams" message in the error log and end users
seeing 502 errors. When this happens the machines are barely being
used, single digit load averages in 16 core boxes.

Initially we were seeing a ton of "connect() failed (110: Connection
timed out)", 1 every couple seconds. I added these to sysctl.conf and
that seemed to solve the problem:

net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_fin_timeout = 20
net.ipv4.tcp_max_syn_backlog = 20480
net.core.netdev_max_backlog = 4096
net.ipv4.tcp_max_tw_buckets = 400000
net.core.somaxconn = 4096

Now things generally run fine but every once in awhile we get a huge
burst of "upstream prematurely closed connection while reading response
header from upstream" followed by a "no live upstreams". Again, no
apparent load on the machines involved. These bursts only last a minute
or so. We also still get an occasional "connect() failed (110:
Connection timed out)" but they are far less frequent, perhaps 1 or 2
per hour.

Anyone have recommendations for tuning the networking side to improve
the situation here? These are some of the nginx.conf settings we have
in place, removed the ones that don't seem related to the issue:

worker_processes 4;
worker_rlimit_nofile 30000;
events {
worker_connections 4096;
# multi_accept on;

use epoll;
}
http {
client_max_body_size 200m;

proxy_read_timeout 600s;
proxy_send_timeout 600s;
proxy_connect_timeout 60s;

proxy_buffer_size 128k;
proxy_buffers 4 128k;

keepalive_timeout 0;
tcp_nodelay on;
}

Happy to provide any other details. This is the "ulimit -a" on all
boxes:

core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 20
file size (blocks, -f) unlimited
pending signals (-i) 16382
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 300000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

Posted at Nginx Forum: http://forum.nginx.org/read.php?2,220894,220894#msg-220894

_______________________________________________
nginx mailing list
ng...@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx

gtuhl

unread,
Jan 23, 2012, 6:00:20 PM1/23/12
to ng...@nginx.org
gtuhl Wrote:
-------------------------------------------------------

> Initially we were seeing a ton of "connect()
> failed (110: Connection timed out)", 1 every
> couple seconds. I added these to sysctl.conf and
> that seemed to solve the problem:
>
> net.ipv4.tcp_syncookies = 1
> net.ipv4.tcp_fin_timeout = 20
> net.ipv4.tcp_max_syn_backlog = 20480
> net.core.netdev_max_backlog = 4096
> net.ipv4.tcp_max_tw_buckets = 400000
> net.core.somaxconn = 4096
>
> Now things generally run fine but every once in
> awhile we get a huge burst of "upstream
> prematurely closed connection while reading
> response header from upstream" followed by a "no
> live upstreams". Again, no apparent load on the
> machines involved. These bursts only last a
> minute or so. We also still get an occasional
> "connect() failed (110: Connection timed out)" but
> they are far less frequent, perhaps 1 or 2 per
> hour.
>

On looking at this again recently, we made two adjustments that
eliminated the connection issues completely:

net.nf_conntrack_max = 262144
net.ipv4.ip_local_port_range = 1024 65000

After making those two changes things became quite stable. However, we
still have massive numbers of TIME_WAIT connections both on the nginx
machine and on the upstream apache machines.

The nginx machine is accepting roughly 1000 requests/s, and has 40,000
connections in TIME_WAIT.
The apache machines are each accepting roughly 250 requests/s, and have
15,000 connections in TIME_WAIT.

We tried setting net.ipv4.tcp_tw_reuse to 1 and restarting networking.
That did not cause any trouble, but also didn't drop the TIME_WAIT
count. I have read that net.ipv4.tcp_tw_recycle is dangerous but we may
try that if others have had good experiences.

Is there a way to have these cleaned up more quickly? My concern is
that even with the expanded ip_local_port_range 40k is cutting it rather
close. Before we bumped ip_local_port_range the whole system was
falling down right as the TIME_WAIT count approached 32k. Is it normal
for nginx to cause this many TIME_WAIT connections? If we're only doing
1k requests/s and nearly exhausting the available port range what would
sites with heavier volume do?

Posted at Nginx Forum: http://forum.nginx.org/read.php?2,220894,221550#msg-221550

ggrensteiner

unread,
Jan 24, 2012, 12:59:54 PM1/24/12
to ng...@nginx.org
net.ipv4.tcp_tw_recycle = 1

is what your looking for

Posted at Nginx Forum: http://forum.nginx.org/read.php?2,220894,221583#msg-221583

Andrey Korolyov

unread,
Jan 24, 2012, 1:12:56 PM1/24/12
to ng...@nginx.org

This may cause trouble if multiple clients trying to reach the server
over same NAT, so be careful. I have a negative experience even on ~
10 http reqs/min from NAT machine.

gtuhl

unread,
Jan 24, 2012, 1:23:38 PM1/24/12
to ng...@nginx.org
Andrey Korolyov Wrote:
-------------------------------------------------------

> On Tue, Jan 24, 2012 at 9:59 PM, ggrensteiner
> <nginx...@nginx.us> wrote:
> > net.ipv4.tcp_tw_recycle = 1
> >
> > is what your looking for
> >
> > Posted at Nginx Forum:
> http://forum.nginx.org/read.php?2,220894,221583#ms
> g-221583
> >
> > _______________________________________________
> > nginx mailing list
> > ng...@nginx.org
> > http://mailman.nginx.org/mailman/listinfo/nginx
>
> This may cause trouble if multiple clients trying
> to reach the server
> over same NAT, so be careful. I have a negative
> experience even on ~
> 10 http reqs/min from NAT machine.
>

This is what I had read everywhere as well, so I've been hesitant to try
it. We definitely have a lot of users that would be coming at our
servers from the same buliding/NAT.

Has anyone tried using "net.ipv4.tcp_tw_reuse = 1" in a larger
connection count environment before?

I have it enabled now, but it did not seem to have any impact on the
number of TIME_WAIT connections. Does it wait until it actually needs
to reuse one (due to port exhaustion) before doing so? Or should it be
keeping the number lower?

Posted at Nginx Forum: http://forum.nginx.org/read.php?2,220894,221587#msg-221587

ggrensteiner

unread,
Jan 25, 2012, 6:14:43 PM1/25/12
to ng...@nginx.org
Have you tried using HTTP 1.1 keepalive connections from nginx to
apache? They became available in 1.1.4 and will re-use sockets rather
then close them and leaving them in TIME_WAIT

Be sure to remember to turn on keepalive in your apache config as well.

http://nginx.org/en/docs/http/ngx_http_upstream_module.html

Posted at Nginx Forum: http://forum.nginx.org/read.php?2,220894,221646#msg-221646

Rami Essaid

unread,
Jan 25, 2012, 6:21:54 PM1/25/12
to ng...@nginx.org
Out of curiosity why would it keep it in TIME_WAIT if it is closing the connection?

gtuhl

unread,
Mar 20, 2012, 5:33:44 PM3/20/12
to ng...@nginx.org
I'm thinking about giving the development version with the upstream
keepalive over http 1.1 a try.

Are people using that version in production? Is there a release
schedule/estimate anywhere that indicates when that feature might
trickle over to stable?

We're using nginx heavily in a pretty vanilla load balancer role -
upstream of apache servers, ssl termination in nginx, that's it in terms
of features we are using.

It's worked fantastically well overall, we're just flirting with an
ephemeral port limit on a few of our sites (have worked around by
setting up multiple A records pointed at multiple nginx pairs). If we
could get keepalive connections between nginx and the upstream apaches I
believe we would be in very good shape and could keep our configuration
simple moving forward.

Posted at Nginx Forum: http://forum.nginx.org/read.php?2,220894,224118#msg-224118

Alexandr Gomoliako

unread,
Mar 20, 2012, 5:42:20 PM3/20/12
to ng...@nginx.org
On Tue, Mar 20, 2012 at 11:33 PM, gtuhl <nginx...@nginx.us> wrote:
> I'm thinking about giving the development version with the upstream
> keepalive over http 1.1 a try.
>
> Are people using that version in production?  Is there a release
> schedule/estimate anywhere that indicates when that feature might
> trickle over to stable?

According to their roadmap -- in 6 days :)
http://trac.nginx.org/nginx/roadmap

David Yu

unread,
Mar 20, 2012, 5:46:30 PM3/20/12
to ng...@nginx.org
On Thu, Jan 26, 2012 at 7:21 AM, Rami Essaid <rami....@gmail.com> wrote:
Out of curiosity why would it keep it in TIME_WAIT if it is closing the connection?
+1.  Also if the connection is closed, why is the upstream (apache) in TIME_WAIT also?

On Wednesday, January 25, 2012 at 5:14 PM, ggrensteiner wrote:

Have you tried using HTTP 1.1 keepalive connections from nginx to
apache? They became available in 1.1.4 and will re-use sockets rather
then close them and leaving them in TIME_WAIT

Be sure to remember to turn on keepalive in your apache config as well.



_______________________________________________
nginx mailing list


_______________________________________________
nginx mailing list
ng...@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx



--
When the cat is away, the mouse is alone.
- David Yu

gtuhl

unread,
Mar 21, 2012, 8:56:54 AM3/21/12
to ng...@nginx.org
Alexandr Gomoliako Wrote:
-------------------------------------------------------

> On Tue, Mar 20, 2012 at 11:33 PM, gtuhl
> <nginx...@nginx.us> wrote:
> > I'm thinking about giving the development
> version with the upstream
> > keepalive over http 1.1 a try.
> >
> > Are people using that version in production?
>  Is there a release
> > schedule/estimate anywhere that indicates when
> that feature might
> > trickle over to stable?
>
> According to their roadmap -- in 6 days :)
> http://trac.nginx.org/nginx/roadmap
>

This is excellent news. Also apologies for somehow missing this page,
was exactly what I was looking for.

Posted at Nginx Forum: http://forum.nginx.org/read.php?2,220894,224171#msg-224171

gtuhl

unread,
Mar 28, 2012, 10:27:36 AM3/28/12
to ng...@nginx.org
Looks like that was for the 1.1.18 development release. Is this what
will become the 1.2.0 stable in a couple weeks? Seems I'll need to wait
for that one to get http 1.1 keepalive upstreams in stable.

gtuhl Wrote:
-------------------------------------------------------
> Alexandr Gomoliako Wrote:
> --------------------------------------------------


> -----
> > On Tue, Mar 20, 2012 at 11:33 PM, gtuhl
> > <nginx...@nginx.us> wrote:
> > > I'm thinking about giving the development
> > version with the upstream
> > > keepalive over http 1.1 a try.
> > >
> > > Are people using that version in production?
> >  Is there a release
> > > schedule/estimate anywhere that indicates
> when
> > that feature might
> > > trickle over to stable?
> >
> > According to their roadmap -- in 6 days :)
> > http://trac.nginx.org/nginx/roadmap
> >
>
> This is excellent news. Also apologies for
> somehow missing this page, was exactly what I was
> looking for.

Posted at Nginx Forum: http://forum.nginx.org/read.php?2,220894,224560#msg-224560

Reply all
Reply to author
Forward
0 new messages