Lots of connections left in CLOSE

Simon Roberts

unread,

Oct 5, 2015, 5:35:28 AM10/5/15

to Fluentd Google Group

Hi team,

I've got a couple of fluentd servers that accept incoming connections from an AWS ELB (load balancer), and a couple of dozen servers beyond that. This worked fine until a couple of weeks ago, until it started to fail. Looking at the server, it's got ~1000 connections left in CLOSE_WAIT, which is then causing Linux to overflow and drop incoming TCP connections.

# netstat -a | awk '/24224/ {print $6}' | sort | uniq -c

1

1026 CLOSE_WAIT

1 LISTEN

3 SYN_RECV

# nstat -a | grep -i listen

TcpExtListenOverflows 3832357 0.0

TcpExtListenDrops 3832357 0.0

The listener itself is literally default, TCP 24224:

type forward

</source>

I've tried the network setting changes recommended at http://docs.fluentd.org/articles/before-install, and checked those are currently applied. As well as a few others, since commented out:

#net.ipv4.tcp_max_syn_backlog = 4096

#net.ipv4.tcp_syncookies = 1

#net.core.somaxconn = 1024

#net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait = 1

net.ipv4.tcp_tw_recycle = 1

net.ipv4.tcp_tw_reuse = 1

net.ipv4.ip_local_port_range = 10240 65535

But no help. I think the incoming traffic is semi-genuine, but it's now getting hammered by tcp retries as well as the original traffic. Another listener continues to work, even as :24224 fails, so the problem seems to be port/listener specific. CPU usage is low. This is getting to be a bit of a problem for me - I introduced fluentd which is now apparently causing a problem :(

Mr. Fiber

unread,

Oct 5, 2015, 9:21:46 AM10/5/15

to Fluentd Google Group

Hi Simon,

That's weird. I checked one fluentd instance behind ELB but

CLOSE_WAIT keeps 0 or 1 on 20 - 40 established connection environment.

I heard this situation before, but reporter didn't reply to us and we stopped the investigation.

Now, some users use ELB for fluentd and there is no this problem.

So you may hit a corner case.

Could you tell me more details? I want to reproduce this problem.

- ELB setting

- What the client send events to fluentd? out_forward or other application?

- Connection stats: how many connections to in_forward via ELB?.

For mitigating the problem, change tcp_keepalive_xxx parameters may help.

For investigating the problem, compare `close socket` message in fluentd log with 'log_level debug'

and count FIN packet via tcpdump.

Masahiro

--
You received this message because you are subscribed to the Google Groups "Fluentd Google Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Simon Roberts

unread,

Oct 5, 2015, 2:10:20 PM10/5/15

to Fluentd Google Group

Thanks for your reply. I will provide all the details I can (problem is likely to happen again in 3 hours, when all the client machines are started for the day).

The ELB normally has two m3.medium Amazon Linux servers behind it, and is a simple TCP ELB, with two listeners/ports (one native fluentd on 24224, and one additional on 24223 for Windows/JSON - see below). The health-check is:

Ping Target	TCP:24224
Timeout	15 seconds
Interval	30 seconds
Unhealthy Threshold	5
Healthy Threshold	5

When the instances "fill up" with the ~1000 connections in CLOSE_WAIT, no more connections can be made to 24224, so the ELB detects the machine as offline. This flops back and forth, as it "gets lucky" with getting a connection or not. I have no reason to believe this is false alarm, since even on the instance itself, I cannot "telnet localhost 24224". It seems that the ~1000 connections make the machine "full"

(FWIW, I checked the process with strace, to see if it was using accept() vs poll() - as IIRC the former only supported 1024 connections, but it looks like it's using epoll)

Each of the instances are running fluentd, with the following configuration: Three listeners, one for local syslog, one for native fluentd input (from ~70 fluentd on linux), and one for input from Windows nxlog. These last two both come via the ELB:

format /^(?<time>[^ ]* +[^ ]* +[^ ]*) +(?<xhost>[^ \[]*)(?:\[(?<pid>[0-9]+)\]?)? (?<short_message>(?<ident>[^ ]*):? *.*)$/

include_source_host true

bind localhost

type syslog

tag system.syslog

</source>

type forward

</source>

format json

port 24223

tag windows.events

type tcp

</source>

There are a couple of event transformers, one to transform the windows events into better events (copy Message->short_message, hostname_fqdn->host; and parse the EventTime in local timezone); and one to fixup any events with missing/empty short_message, because GELF output plugin chokes on these:

enable_ruby

renew_time_key newtime

type record_transformer

remove_keys Message

short_message ${__send__('Message')}

host ${__send__('hostname_fqdn')}

newtime ${Time.parse(__send__('EventTime'), Time.now).to_i}

</record>

</filter>

enable_ruby

type record_transformer

short_message ${(defined? short_message and !short_message.nil? and !short_message.to_s.empty? ) ? short_message : "empty")}

</record>

</filter>

And then there are a couple of output plugins: writing events to local file, archive to S3, and sending onto Graylog via gelf format:

type copy

<store>

type stdout

</store>

<store>

buffer_path /var/log/td-agent/buffer/gelf

type gelf

buffer_chunk_limit 1000m

flush_interval 5s

port 12201

host graylog.aws.mycompany.local

buffer_type file

use_record_host true

</store>

<store>

s3_region ap-southeast-2

proxy_uri http://proxy.aws.mycompany.local:8080

buffer_path /var/log/td-agent/buffer/s3

time_slice_format %Y%m%d

type s3

s3_bucket mycompany-fluentd-logs

buffer_chunk_limit 1000m

path xlogs/

time_slice_wait 10m

format out_file

</store>

</match>

When it's working, that all seems to function fine.

- What the client send events to fluentd? out_forward or other application?

Each of the agents has following configuration that sends to fluentd via ELB:

[root@fnd-mail-l-prod-01 config.d]# cat 80-match-forward.conf

type copy

<store>

heartbeat_type tcp

buffer_path /var/log/td-agent/buffer/forward

heartbeat_interval 60s

type forward

buffer_chunk_limit 1000m

flush_interval 10s

host log-farm.aws.mycompany.local

port 24224

</server>

buffer_type file

</store>

</match>

(before I added heartbeat_interval 60s, they used default of 1s, which created a lot of heartbeats. I originally though that was the whole problem, but now that I've backed that off, we still seem to have the issue)

- Connection stats: how many connections to in_forward via ELB?.

Lots! I turned on ELB logging, and it got about 10,000 over five minutes from 68 clients. Many of these are failed connections (and retries), since the fluentd servers were in stuck state at the time. Only 1500 requests were actually passed to backend.

- For mitigating the problem, change tcp_keepalive_xxx parameters may help.

Do clients attempt to keep their TCP connection alive? If not, I thought these parameters would only be used by TCP clients, not server as fluentd is.

- For investigating the problem, compare `close socket` message in fluentd log with 'log_level debug'

and count FIN packet via tcpdump.

I turned on trace log last night, but system was working OK at the time (not enough load? most machines are shutdown over night). When they boot at 0700, they start their application servers, and generate a LOT of traffic, I'd suspect a hundred thousand messages over the space of 30 minutes.

The trace logging all looked like it was closing connections:

2015-10-05 22:07:58 +1100 [trace]: plugin/in_forward.rb:194:initialize: accepted fluent socket from '10.168.9.129:48632': object_id=70077080447380

2015-10-05 22:07:58 +1100 [trace]: plugin/in_forward.rb:262:on_close: closed socket

2015-10-05 22:07:58 +1100 fluent.trace: {"message":"accepted fluent socket from '10.168.9.129:48632': object_id=70077080447380"}

2015-10-05 22:07:58 +1100 fluent.trace: {"message":"closed socket"}

Just visually, it looked like there was a bunch of these four lines, most of which weren't followed by any log-lines, so I'm assuming heartbeat checks. Not a silly amount of these, maybe 4-5/second. And then every few seconds, I got a bunch of real log events (happen to be from a squid access log, which is monitored on one of the machines using in_tail).

Simon Roberts

unread,

Oct 5, 2015, 2:26:11 PM10/5/15

to Fluentd Google Group

I've just split the event logging to an explicit file, rather than stdout, so I can see if fluentd itself emits something that I didn't notice

#<store>

# type stdout

#</store>

<store>

type file

path /var/log/td-agent/copy

</store>

We'll see what happens when the clients start in 1.5 hours.

Simon Roberts

unread,

Oct 5, 2015, 5:45:57 PM10/5/15

to Fluentd Google Group

Yup, it failed again this morning. Only 136 connections, and yet dead again. Restarting now with -vv

# netstat -an | grep 24224 | wc -l

136

# echo | telnet localhost 24224

Trying 127.0.0.1...

...

Simon Roberts

unread,

Oct 5, 2015, 6:10:36 PM10/5/15

to Fluentd Google Group

Restarted with trace. Within a minute, there were over 1000 connections in close_wait

# netstat -a | awk '/:24224/ {print $6}' | sort | uniq -c

1

1026 CLOSE_WAIT

1 LISTEN

54 SYN_RECV

But only 59 socket-accepts logged the log. Summary:

2015-10-06 08:45:03 +1100 [debug]: plugin/in_syslog.rb:170:listen: listening syslog socket on localhost:5140 with udp

2015-10-06 08:45:03 +1100 [info]: plugin/in_forward.rb:74:listen: listening fluent socket on 0.0.0.0:24224

2015-10-06 08:45:03 +1100 [debug]: plugin/in_tcp.rb:27:listen: listening tcp socket on 0.0.0.0:24223

2015-10-06 08:45:10 +1100 [trace]: plugin/in_forward.rb:194:initialize: accepted fluent socket from '10.168.8.16:64096': object_id=7033998266956 0

2015-10-06 08:45:10 +1100 [trace]: plugin/in_forward.rb:262:on_close: closed socket

2015-10-06 08:45:26 +1100 [trace]: plugin/in_forward.rb:194:initialize: accepted fluent socket from '10.168.9.129:58155': object_id=70339982633860

2015-10-06 08:45:26 +1100 [trace]: plugin/in_forward.rb:262:on_close: closed socket

2015-10-06 08:45:40 +1100 [trace]: plugin/in_forward.rb:194:initialize: accepted fluent socket from '10.168.8.16:64106': object_id=70339982605180

2015-10-06 08:45:40 +1100 [trace]: plugin/in_forward.rb:262:on_close: closed socket

...

2015-10-06 08:47:10 +1100 [trace]: plugin/socket_util.rb:64:initialize: accepted fluent socket object_id=70339982037720

2015-10-06 08:47:11 +1100 [trace]: plugin/socket_util.rb:64:initialize: accepted fluent socket object_id=70339982033240

2015-10-06 08:47:11 +1100 [trace]: plugin/socket_util.rb:64:initialize: accepted fluent socket object_id=70340027412460

...

2015-10-06 08:48:11 +1100 [trace]: plugin/socket_util.rb:87:on_close: closed fluent socket object_id=70340027412460

2015-10-06 08:49:15 +1100 [trace]: plugin/socket_util.rb:87:on_close: closed fluent socket object_id=70339982037720

2015-10-06 08:50:03 +1100 [trace]: plugin/socket_util.rb:87:on_close: closed fluent socket object_id=70339982033240

...

ie: two kinds of log messages, presumably one is the native fluentd (24224) and one is the JSON (24223).

And then I restarted it again, to try and match up the epheremal port against something in netstat, and it doesn't appear to be failing. Literally nothing changed.

Simon Roberts

unread,

Oct 5, 2015, 6:42:06 PM10/5/15

to Fluentd Google Group

OK, it started failing again, lots of connections in CLOSE_WAIT. RAGE

Simon Roberts

unread,

Oct 5, 2015, 7:53:42 PM10/5/15

to Fluentd Google Group

OK, I have a complete set of td-agent trace logs, the events themselves, and a tcpdump matching the same window. It worked for a while, saved lots of events (37k). Everything looks OK in the TCP dump for about 230 seconds, when I see the first "TCP Window Full" packet, meaning that the receiver (fluentd) is full. If I look at the connection that happened to hit that message, it's quite a big event: 150K and that looks incomplete. Is there some size limitation in fluentd?

Looking at TCP packet lengths, 90% are 40-79 bytes, presumably heartbeats.

At 309 seconds I started seeing TCP retransmissions - presumably by then the server was stuffed, and this is the consequence of that.

====

I repeated experiment, and it failed almost immediately

[root@log-relay-l-prod td-agent]# tcpdump port not ssh -s0 -wstartup-fluentd.pcap

tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes

10645 packets captured

11465 packets received by filter

820 packets dropped by kernel

[root@log-relay-l-prod td-agent]#

[root@log-relay-l-prod td-agent]# service td-agent start

Starting td-agent: [ OK ]

[root@log-relay-l-prod td-agent]# netstat -a > netstat.txt

[root@log-relay-l-prod td-agent]# service td-agent stop

Shutting down td-agent: [ OK ]

[root@log-relay-l-prod td-agent]#

[root@log-relay-l-prod td-agent]# while [ 1 ]; do date ; netstat -an | awk '/:24224/ {print $6}' | sort | uniq -c; sleep 5; done

Tue Oct 6 10:36:58 AEDT 2015

Tue Oct 6 10:37:03 AEDT 2015

Tue Oct 6 10:37:08 AEDT 2015

1

365 CLOSE_WAIT

2 ESTABLISHED

1 LISTEN

Tue Oct 6 10:37:13 AEDT 2015

1

429 CLOSE_WAIT

3 ESTABLISHED

1 LISTEN

Tue Oct 6 10:37:18 AEDT 2015

1

504 CLOSE_WAIT

4 ESTABLISHED

1 LISTEN

Tue Oct 6 10:37:23 AEDT 2015

1

584 CLOSE_WAIT

4 ESTABLISHED

1 LISTEN

Tue Oct 6 10:37:28 AEDT 2015

1

663 CLOSE_WAIT

4 ESTABLISHED

1 LISTEN

Tue Oct 6 10:37:34 AEDT 2015

1

738 CLOSE_WAIT

5 ESTABLISHED

1 LISTEN

Tue Oct 6 10:37:39 AEDT 2015

1

808 CLOSE_WAIT

5 ESTABLISHED

1 LISTEN

Tue Oct 6 10:37:44 AEDT 2015

1

880 CLOSE_WAIT

6 ESTABLISHED

1 LISTEN

Tue Oct 6 10:37:49 AEDT 2015

Tue Oct 6 10:37:54 AEDT 2015

Tue Oct 6 10:37:59 AEDT 2015

Tue Oct 6 10:38:04 AEDT 2015

Looking at one of the open connections at random:

tcp 1 0 10.168.10.91:24224 10.168.9.129:25511 CLOSE_WAIT

It looked like a perfectly reasonable conversation (with no actual data)

[m45848@base2prd fail2]$ tcpdump -r startup-fluentd.pcap port 25511

reading from file startup-fluentd.pcap, link-type EN10MB (Ethernet)

10:37:04.640440 IP 10.168.9.129.25511 > 10.168.10.91.24224: Flags [S], seq 3429011007, win 17922, options [mss 8961,sackOK,TS val 43145500 ecr 0,nop,wscale 8], length 0

10:37:04.640451 IP 10.168.10.91.24224 > 10.168.9.129.25511: Flags [S.], seq 3585448042, ack 3429011008, win 26847, options [mss 8961,sackOK,TS val 16432431 ecr 43145500,nop,wscale 7], length 0

10:37:04.641477 IP 10.168.9.129.25511 > 10.168.10.91.24224: Flags [.], ack 1, win 71, options [nop,nop,TS val 43145500 ecr 16432431], length 0

10:37:04.641597 IP 10.168.9.129.25511 > 10.168.10.91.24224: Flags [F.], seq 1, ack 1, win 71, options [nop,nop,TS val 43145500 ecr 16432431], length 0

10:37:04.642233 IP 10.168.10.91.24224 > 10.168.9.129.25511: Flags [.], ack 2, win 210, options [nop,nop,TS val 16432432 ecr 43145500], length 0

[m45848@base2prd fail2]$

That's a TCP:

client SYN

fluentd SYN+ACK

client ACK

client FIN+ACK

fluentd ACK

Shouldn't fluentd have sent a FIN?

Simon Roberts

unread,

Oct 5, 2015, 11:41:38 PM10/5/15

to Fluentd Google Group

I think I figured out the problem. And it was totally my fault :(

When I ran strace against td-agent, starting it as a subprocess rather than attaching to existing process, I saw the following output:

[root@log-relay-l-prod td-agent]# strace -o new.strace -f /opt/td-agent/embedded/bin/ruby /usr/sbin/td-agent --log /var/log/td-agent/td-agent.log --use-v1-config -vv --group root

/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/filter_record_transformer.rb:214:in `eval': /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/filter_record_transformer.rb:214: syntax error, unexpected ')', expecting tSTRING_DEND (SyntaxError)

...y? ) ? short_message : "empty")}"

... ^

/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/filter_record_transformer.rb:214: unterminated string meets end of file

from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/filter_record_transformer.rb:214:in `expand'

from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/filter_record_transformer.rb:123:in `expand_placeholders'

from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/filter_record_transformer.rb:127:in `block in expand_placeholders'

from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/filter_record_transformer.rb:126:in `each_pair'

from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/filter_record_transformer.rb:126:in `expand_placeholders'

from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/filter_record_transformer.rb:115:in `reform'

from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/filter_record_transformer.rb:84:in `block in filter_stream'

from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/event.rb:147:in `feed_each'

from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/event.rb:147:in `each'

from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/filter_record_transformer.rb:82:in `filter_stream'

from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/event_router.rb:150:in `block in emit'

from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/event_router.rb:149:in `each'

from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/event_router.rb:149:in `emit'

from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/event_router.rb:88:in `emit_stream'

from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_forward.rb:147:in `on_message'

from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_forward.rb:243:in `call'

from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_forward.rb:243:in `block in on_read_msgpack'

from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_forward.rb:242:in `feed_each'

from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_forward.rb:242:in `on_read_msgpack'

from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_forward.rb:228:in `call'

from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_forward.rb:228:in `on_read'

from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.3.0/lib/cool.io/io.rb:123:in `on_readable'

from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.3.0/lib/cool.io/io.rb:186:in `on_readable'

from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.3.0/lib/cool.io/loop.rb:88:in `run_once'

from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.3.0/lib/cool.io/loop.rb:88:in `run'

from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_forward.rb:91:in `run'

[root@log-relay-l-prod td-agent]# less /etc/td-agent/config.d/

So the problem was an extra bracket in my ruby filter:

enable_ruby

short_message ${(defined? short_message and !short_message.nil? and !short_message.to_s.empty? ) ? short_message : "empty")}

</record>

type record_transformer

</filter>

The error message about that was not captured by fluentd, didn't appear in any logs, and the only way I happened to see it was running td-agent directly, not a service - how it's normally run.

class RubyPlaceholderExpander

...

def expand(str)

interpolated = str.gsub(/\$\{([^}]+)\}/, '#{\1}') # ${..} => #{..}

eval "\"#{interpolated}\"", @placeholders.instance_eval { binding }

rescue => e

log.warn "failed to expand `#{str}`", :error_class => e.class, :error => e.message

log.warn_backtrace

nil

end

...

end

Mr. Fiber

unread,

Oct 6, 2015, 12:49:35 AM10/6/15

to Fluentd Google Group

Thanks for the investigation.

So fixing your configuration and there is no lots of CLOSE_WAIT, right?

Continue to raise many exceptions seems to choke fluentd's behaviour.

> The error message about that was not captured by fluentd, didn't appear in any logs, and the only way I happened to see it was running td-agent directly, not a service - how it's normally run.

This is bad. We should fix it. I will check it with your configuration.

Again, thanks for hard investigation.

Masahiro

Simon Roberts

unread,

Oct 6, 2015, 3:05:05 AM10/6/15

to Fluentd Google Group

Yes, fixing the typo fixed all the problems with close_wait. I only saw one exception on the console, so unless they were suppressed, there wasn't huge numbers. Seems more likely the error wasn't handled, and caused a deadlock or something? I'm not really a ruby developer, but the stack trace leading to the exception might prove useful.

Reply all

Reply to author

Forward

Lots of connections left in CLOSE_WAIT

Simon Roberts

Mr. Fiber

Simon Roberts

Simon Roberts

Simon Roberts

Simon Roberts

Simon Roberts

Simon Roberts

Simon Roberts

Mr. Fiber

Simon Roberts