$upstream_response_time和[$request_time]时间较久，且在这段时间内，访问nginx所在主机连接被拒绝

Yi Li

unread,

Apr 24, 2019, 11:07:29 PM4/24/19

to openresty

大家好，线上运行的时候，nginx运行出现主题描述的问题，具体信息如下：

操作系统信息：

Red Hat Enterprise Linux Server release 6.5 (Santiago)

Linux inta 2.6.32-431.el6.x86_64 #1 SMP Sun Nov 10 22:19:54 EST 2013 x86_64 x86_64 x86_64 GNU/Linux

nginx.conf部分内容

worker_processes auto;

worker_rlimit_nofile 65535;

events {

worker_connections 65535;

}

log_format main '[$req_timestamp]-[$request_time]-[$status]-[$upstream_response_time]-[$upstream_status]';

upstream backend {

server 0.0.0.0;

balancer_by_lua_file lua/balance.lua;

keepalive 300;

}

location / {

access_log logs/access.log main;

rewrite_by_lua_file lua/rewrite.lua;

proxy_pass http://backend;

proxy_read_timeout 1600ms;

proxy_send_timeout 1600ms;

proxy_connect_timeout 500ms;

proxy_request_buffering off;

proxy_http_version 1.1;

proxy_set_header Connection "";

log_by_lua_file lua/log.lua;

}

rewrite.lua里主要是对多个共享内存的访问。

access.log如下：

log_format main '[$req_timestamp]-[$request_time]-[$status]-[$upstream_response_time]-[$upstream_status]';

$req_timestamp的值是在进入rewrite.lua时获得的时间

[2019-04-21 17:44:01.211]-[0.076]-[200]-[0.076]-[200]

[2019-04-21 17:44:01.216]-[35.435]-[504]-[35.435]-[504]

[2019-04-21 17:44:01.211]-[35.440]-[504]-[35.440]-[504]

[2019-04-21 17:44:01.181]-[35.470]-[200]-[35.470]-[200]

[2019-04-21 17:44:01.287]-[35.364]-[504]-[35.364]-[504]

[2019-04-21 17:44:01.260]-[35.391]-[504]-[35.391]-[504]

[2019-04-21 17:44:01.268]-[35.383]-[499]-[-]-[-]

[2019-04-21 17:44:01.265]-[35.386]-[504]-[35.386]

[2019-04-21 17:44:01.244]-[35.407]-[200]-[35.407]-[200]

[2019-04-21 17:44:01.160]-[35.489]-[200]-[35.489]-[200]

[2019-04-21 17:44:01.218]-[35.431]-[200]-[35.431]-[200]

[2019-04-21 17:44:01.262]-[35.390]-[504]-[35.390]-[504]

[2019-04-21 17:44:01.211]-[0.088]-[200]-[0.088]-[200]

[2019-04-21 17:44:01.214]-[35.440]-[504]-[35.440]-[504]

[2019-04-21 17:44:01.216]-[35.438]-[504]-[35.438]-[504]

[2019-04-21 17:44:01.267]-[35.387]-[504]-[35.387]-[504]

[2019-04-21 17:44:01.283]-[35.371]-[504]-[35.371]-[504]

[2019-04-21 17:44:01.299]-[35.355]-[504]-[35.355]-[504]

[2019-04-21 17:44:01.254]-[35.397]-[504]-[35.397]-[504]

略去一些17:44:01时间点日志信息

[2019-04-21 17:44:36.657]-[0.000]-[499]-[-]-[-]

[2019-04-21 17:44:36.656]-[0.000]-[502]-[0.000]-[502]

[2019-04-21 17:44:36.657]-[0.000]-[499]-[-]-[-]

[2019-04-21 17:44:36.651]-[0.000]-[502]-[0.000]-[502]

[2019-04-21 17:44:36.657]-[0.000]-[502]-[0.000]-[502]

在17:44:02-17:44:35 时间内没有日志信息，这段使劲内客户端访问nginx，提示连接被拒绝。

日志的错误信息如下：

epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too while connecting t

o upstream--------->499错误原因

recv() failed (104: Connection reset by peer) while sending to client --------->？？？

upstream prematurely closed connection while reading response header from upstream--------->504错误原因

upstream timed out (110: Connection timed out) while reading response header from upstream--------->502错误原因

程序大部分时间运行正常，偶尔出现一段不翼而飞的几十秒。

有种想法是代码阻塞在哪里，导致nginx一直不去accept socket连接，导致TCP队列满了，连接拒绝。

可是分析代码，并没有什么耗时的逻辑，rewrite.lua里的竞争也只是共享内存的使用。

大家有什么想法吗？

tokers

unread,

Apr 24, 2019, 11:31:06 PM4/24/19

to openresty

Hello!

如果怀疑哪部分代码运行的时候持续占用过多的 CPU 时间，

可以考虑使用 OpenResty 官方提供的 systemtap 工具集抓几张火焰图

观察分析下热点代码路径。不过阻塞几十秒这个比较夸张了，除非是 Lua

代码里出现了一些 bug 导致的。

> 导致TCP队列满了，连接拒绝。

backlog 配置了多少？另外有观察过半连接队列长度吗？

另外可以看下 tcp_abort_on_overflow 这个参数是不是 1，

如果不是 1，那么 reset 应该不是因为连接队列满了引起的。

从你的日志来看，耗时基本都是在等后端，504 很多

而且时间都接近 35s，这个很有规律，是否是你的后端配置的某种超时时间？

Yi Li

unread,

Apr 25, 2019, 1:20:28 AM4/25/19

to openresty

Hello，谢谢你的回复~

>backlog 配置了多少？另外有观察过半连接队列长度吗？

nginx所在操作系统参数；

半连接长度：net.ipv4.tcp_max_syn_backlog 2048

全连接长度：net.core.somaxconn 3000

半连接队列长度实时变化还没有观察过，接下来会关注这个事情。

->tcp_abort_on_overflow 这个参数是不是 1，如果不是 1，那么 reset 应该不是因为连接队列满了引起的

nginx侧：tcp_abort_on_overflow=0，所以在客户端侧并没有看到Connection reset by peer。

recv() failed (104: Connection reset by peer) while sending to client 这里的reset应该是被上游服务器reset

生产环境里上游服务器的信息看不到，有个疑问是这里被上游服务器reset，有没有可能是因为

upstream backend {

server 0.0.0.0;

balancer_by_lua_file lua/balance.lua;

keepalive 300;

}这里空闲的长连接保持的个数比较多，导致新的连接建立不了？

->从你的日志来看，耗时基本都是在等后端，504 很多

而且时间都接近 35s，这个很有规律，是否是你的后端配置的某种超时时间？

这个间歇出现的情况还有30秒，25秒，24秒。在一次出现的过程中时间一致。

在 2019年4月25日星期四 UTC+8上午11:31:06，tokers写道：

tokers

unread,

Apr 25, 2019, 1:36:49 AM4/25/19

to openresty

> nginx所在操作系统参数；

> 半连接长度：net.ipv4.tcp_max_syn_backlog 2048

> 全连接长度：net.core.somaxconn 3000

nginx 配置的是多少？默认只有 512。

> recv() failed (104: Connection reset by peer) while sending to client 这里的reset应该是被上游服务器reset

> 生产环境里上游服务器的信息看不到，有个疑问是这里被上游服务器reset，有没有可能是因为

> upstream backend {

> server 0.0.0.0;

> balancer_by_lua_file lua/balance.lua;

> keepalive 300;

> }这里空闲的长连接保持的个数比较多，导致新的连接建立不了？

有观察过真实场景下 nginx 和后端维持的连接数量吗？可能还是要去分析下后端的一些配置。

Yi Li

unread,

Apr 25, 2019, 1:47:34 AM4/25/19

to openresty

Hello,

补充一下刚刚的回复

通过netstat -s | egrep "listen|LISTEN" 命令发现ngixn所在服务器，出现了半连接队列和全连接队列溢出的情况，

猜测这是客户端连接nginx，连接被拒绝的原因。

而之所以出现溢出，是因为nginx有35秒一直没有accept连接？

另外，backlog的值，nginx默认是511，这里nginx服务器所在的服务器也是511