Lots of connections left in CLOSE_WAIT

3,296 views
Skip to first unread message

Simon Roberts

unread,
Oct 5, 2015, 5:35:28 AM10/5/15
to Fluentd Google Group
Hi team,

I've got a couple of fluentd servers that accept incoming connections from an AWS ELB (load balancer), and a couple of dozen servers beyond that.  This worked fine until a couple of weeks ago, until it started to fail.  Looking at the server, it's got ~1000 connections left in CLOSE_WAIT, which is then causing Linux to overflow and drop incoming TCP connections.

# netstat -a | awk '/24224/ {print $6}' | sort | uniq -c
      1
   1026 CLOSE_WAIT
      1 LISTEN
      3 SYN_RECV

#  nstat -a | grep -i listen
TcpExtListenOverflows           3832357            0.0
TcpExtListenDrops               3832357            0.0

The listener itself is literally default, TCP 24224:

<source>
  type forward
</source>

I've tried the network setting changes recommended at http://docs.fluentd.org/articles/before-install, and checked those are currently applied.  As well as a few others, since commented out:

#net.ipv4.tcp_max_syn_backlog = 4096
#net.ipv4.tcp_syncookies = 1
#net.core.somaxconn = 1024
#net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.ip_local_port_range = 10240    65535

But no help.  I think the incoming traffic is semi-genuine, but it's now getting hammered by tcp retries as well as the original traffic.  Another listener continues to work, even as :24224 fails, so the problem seems to be port/listener specific.  CPU usage is low. This is getting to be a bit of a problem for me - I introduced fluentd which is now apparently causing a problem :(


Mr. Fiber

unread,
Oct 5, 2015, 9:21:46 AM10/5/15
to Fluentd Google Group
Hi Simon,

That's weird. I checked one fluentd instance behind ELB but
CLOSE_WAIT keeps 0 or 1 on 20 - 40 established connection environment.
I heard this situation before, but reporter didn't reply to us and we stopped the investigation.
Now, some users use ELB for fluentd and there is no this problem.
So you may hit a corner case.

Could you tell me more details? I want to reproduce this problem.

- ELB setting
- What the client send events to fluentd? out_forward or other application?
- Connection stats: how many connections to in_forward via ELB?.

For mitigating the problem, change tcp_keepalive_xxx parameters may help.
For investigating the problem, compare `close socket` message in fluentd log with 'log_level debug'
and count FIN packet via tcpdump.


Masahiro


--
You received this message because you are subscribed to the Google Groups "Fluentd Google Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Simon Roberts

unread,
Oct 5, 2015, 2:10:20 PM10/5/15
to Fluentd Google Group
Thanks for your reply.  I will provide all the details I can  (problem is likely to happen again in 3 hours, when all the client machines are started for the day).

The ELB normally has two m3.medium Amazon Linux servers behind it, and is a simple TCP ELB, with two listeners/ports (one native fluentd on 24224, and one additional on 24223 for Windows/JSON - see below). The health-check is:

Ping TargetTCP:24224
Timeout15 seconds
Interval30 seconds
Unhealthy Threshold5
Healthy Threshold5

When the instances "fill up" with the ~1000 connections in CLOSE_WAIT, no more connections can be made to 24224, so the ELB detects the machine as offline.  This flops back and forth, as it "gets lucky" with getting a connection or not.  I have no reason to believe this is false alarm, since even on the instance itself, I cannot "telnet localhost 24224".  It seems that the ~1000 connections make the machine "full"

(FWIW, I checked the process with strace, to see if it was using accept() vs poll() - as IIRC the former only supported 1024 connections, but it looks like it's using epoll)

Each of the instances are running fluentd, with the following configuration: Three listeners, one for local syslog, one for native fluentd input (from ~70 fluentd on linux), and one for input from Windows nxlog.  These last two both come via the ELB:

  <source>
    format /^(?<time>[^ ]* +[^ ]* +[^ ]*) +(?<xhost>[^ \[]*)(?:\[(?<pid>[0-9]+)\]?)? (?<short_message>(?<ident>[^ ]*):? *.*)$/
    include_source_host true
    bind localhost
    type syslog
    tag system.syslog
  </source>
  <source>
    type forward
  </source>
  <source>
    format json
    port 24223
    tag windows.events
    type tcp
  </source>

There are a couple of event transformers, one to transform the windows events into better events (copy Message->short_message, hostname_fqdn->host; and parse the EventTime in local timezone); and one to fixup any events with missing/empty short_message, because GELF output plugin chokes on these:

 <filter windows.events>
    enable_ruby
    renew_time_key newtime
    type record_transformer
    remove_keys Message
    <record>
      short_message ${__send__('Message')}
      host ${__send__('hostname_fqdn')}
      newtime ${Time.parse(__send__('EventTime'), Time.now).to_i}
    </record>
  </filter>
  <filter *>
    enable_ruby
    type record_transformer
    <record>
      short_message ${(defined? short_message and !short_message.nil? and !short_message.to_s.empty? ) ? short_message : "empty")}
    </record>
  </filter>

And then there are a couple of output plugins: writing events to local file, archive to S3, and sending onto Graylog via gelf format:

  <match **>
    type copy
    <store>
      type stdout
    </store>
    <store>
      buffer_path /var/log/td-agent/buffer/gelf
      type gelf
      buffer_chunk_limit 1000m
      flush_interval 5s
      port 12201
      host graylog.aws.mycompany.local
      buffer_type file
      use_record_host true
    </store>
    <store>
      s3_region ap-southeast-2
      buffer_path /var/log/td-agent/buffer/s3
      time_slice_format %Y%m%d
      type s3
      s3_bucket mycompany-fluentd-logs
      buffer_chunk_limit 1000m
      path xlogs/
      time_slice_wait 10m
      format out_file
    </store>
  </match>

When it's working, that all seems to function fine.

- What the client send events to fluentd? out_forward or other application?

Each of the agents has following configuration that sends to fluentd via ELB:

[root@fnd-mail-l-prod-01 config.d]# cat 80-match-forward.conf
<match **>
  type copy
  <store>
        heartbeat_type tcp
        buffer_path /var/log/td-agent/buffer/forward
        heartbeat_interval 60s
        type forward
        buffer_chunk_limit 1000m
        flush_interval 10s
        <server>
          host log-farm.aws.mycompany.local
          port 24224
        </server>
        buffer_type file
  </store>
</match>

(before I added heartbeat_interval 60s, they used default of 1s, which created a lot of heartbeats. I originally though that was the whole problem, but now that I've backed that off, we still seem to have the issue)

- Connection stats: how many connections to in_forward via ELB?.

Lots!  I turned on ELB logging, and it got about 10,000 over five minutes from 68 clients.  Many of these are failed connections (and retries), since the fluentd servers were in stuck state at the time. Only 1500 requests were actually passed to backend.

- For mitigating the problem, change tcp_keepalive_xxx parameters may help.

Do clients attempt to keep their TCP connection alive?  If not, I thought these parameters would only be used by TCP clients, not server as fluentd is.

- For investigating the problem, compare `close socket` message in fluentd log with 'log_level debug'
and count FIN packet via tcpdump.

I turned on trace log last night, but system was working OK at the time (not enough load? most machines are shutdown over night).  When they boot at 0700, they start their application servers, and generate a LOT of traffic, I'd suspect a hundred thousand messages over the space of 30 minutes. 

The trace logging all looked like it was closing connections:

2015-10-05 22:07:58 +1100 [trace]: plugin/in_forward.rb:194:initialize: accepted fluent socket from '10.168.9.129:48632': object_id=70077080447380
2015-10-05 22:07:58 +1100 [trace]: plugin/in_forward.rb:262:on_close: closed socket
2015-10-05 22:07:58 +1100 fluent.trace: {"message":"accepted fluent socket from '10.168.9.129:48632': object_id=70077080447380"}
2015-10-05 22:07:58 +1100 fluent.trace: {"message":"closed socket"}

Just visually, it looked like there was a bunch of these four lines, most of which weren't followed by any log-lines, so I'm assuming heartbeat checks.  Not a silly amount of these, maybe 4-5/second.  And then every few seconds, I got a bunch of real log events  (happen to be from a squid access log, which is monitored on one of the machines using in_tail).

Simon Roberts

unread,
Oct 5, 2015, 2:26:11 PM10/5/15
to Fluentd Google Group
I've just split the event logging to an explicit file, rather than stdout, so I can see if fluentd itself emits something that I didn't notice

  #<store>
  #      type stdout
  #</store>
  <store>
    type file
    path /var/log/td-agent/copy
  </store>

We'll see what happens when the clients start in 1.5 hours.

Simon Roberts

unread,
Oct 5, 2015, 5:45:57 PM10/5/15
to Fluentd Google Group
Yup, it failed again this morning.  Only 136 connections, and yet dead again. Restarting now with -vv

# netstat -an | grep 24224 | wc -l
136
# echo | telnet localhost 24224
Trying 127.0.0.1...
...

Simon Roberts

unread,
Oct 5, 2015, 6:10:36 PM10/5/15
to Fluentd Google Group
Restarted with trace. Within a minute, there were over 1000 connections in close_wait

# netstat -a | awk '/:24224/ {print $6}' | sort | uniq -c
      1
   1026 CLOSE_WAIT
      1 LISTEN
     54 SYN_RECV

But only 59 socket-accepts logged the log.  Summary:

2015-10-06 08:45:03 +1100 [debug]: plugin/in_syslog.rb:170:listen: listening syslog socket on localhost:5140 with udp
2015-10-06 08:45:03 +1100 [info]: plugin/in_forward.rb:74:listen: listening fluent socket on 0.0.0.0:24224
2015-10-06 08:45:03 +1100 [debug]: plugin/in_tcp.rb:27:listen: listening tcp socket on 0.0.0.0:24223
2015-10-06 08:45:10 +1100 [trace]: plugin/in_forward.rb:194:initialize: accepted fluent socket from '10.168.8.16:64096': object_id=7033998266956                                                                      0
2015-10-06 08:45:10 +1100 [trace]: plugin/in_forward.rb:262:on_close: closed socket
2015-10-06 08:45:26 +1100 [trace]: plugin/in_forward.rb:194:initialize: accepted fluent socket from '10.168.9.129:58155': object_id=70339982633860
2015-10-06 08:45:26 +1100 [trace]: plugin/in_forward.rb:262:on_close: closed socket
2015-10-06 08:45:40 +1100 [trace]: plugin/in_forward.rb:194:initialize: accepted fluent socket from '10.168.8.16:64106': object_id=70339982605180
2015-10-06 08:45:40 +1100 [trace]: plugin/in_forward.rb:262:on_close: closed socket
...
2015-10-06 08:47:10 +1100 [trace]: plugin/socket_util.rb:64:initialize: accepted fluent socket object_id=70339982037720
2015-10-06 08:47:11 +1100 [trace]: plugin/socket_util.rb:64:initialize: accepted fluent socket object_id=70339982033240
2015-10-06 08:47:11 +1100 [trace]: plugin/socket_util.rb:64:initialize: accepted fluent socket object_id=70340027412460
...
2015-10-06 08:48:11 +1100 [trace]: plugin/socket_util.rb:87:on_close: closed fluent socket object_id=70340027412460
2015-10-06 08:49:15 +1100 [trace]: plugin/socket_util.rb:87:on_close: closed fluent socket object_id=70339982037720
2015-10-06 08:50:03 +1100 [trace]: plugin/socket_util.rb:87:on_close: closed fluent socket object_id=70339982033240
...

ie: two kinds of log messages, presumably one is the native fluentd (24224) and one is the JSON (24223). 


And then I restarted it again, to try and match up the epheremal port against something in netstat, and it doesn't appear to be failing. Literally nothing changed. 

Simon Roberts

unread,
Oct 5, 2015, 6:42:06 PM10/5/15
to Fluentd Google Group
OK, it started failing again, lots of connections in CLOSE_WAIT. RAGE

Simon Roberts

unread,
Oct 5, 2015, 7:53:42 PM10/5/15
to Fluentd Google Group
OK, I have a complete set of td-agent trace logs, the events themselves, and a tcpdump matching the same window.  It worked for a while, saved lots of events (37k).  Everything looks OK in the TCP dump for about 230 seconds, when I see the first "TCP Window Full" packet, meaning that the receiver (fluentd) is full. If I look at the connection that happened to hit that message, it's quite a big event: 150K and that looks incomplete. Is there some size limitation in fluentd?

Looking at TCP packet lengths, 90% are 40-79 bytes, presumably heartbeats.

At 309 seconds I started seeing TCP retransmissions - presumably by then the server was stuffed, and this is the consequence of that.

====

I repeated experiment, and it failed almost immediately


[root@log-relay-l-prod td-agent]# tcpdump port not ssh -s0 -wstartup-fluentd.pcap
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
10645 packets captured
11465 packets received by filter
820 packets dropped by kernel
[root@log-relay-l-prod td-agent]#



[root@log-relay-l-prod td-agent]# service td-agent start
Starting td-agent:                                         [  OK  ]
[root@log-relay-l-prod td-agent]# netstat -a > netstat.txt
[root@log-relay-l-prod td-agent]# service td-agent stop
Shutting down td-agent:                                    [  OK  ]
[root@log-relay-l-prod td-agent]#




[root@log-relay-l-prod td-agent]# while [ 1 ]; do date ; netstat -an | awk '/:24224/ {print $6}' | sort | uniq -c; sleep 5; done
Tue Oct  6 10:36:58 AEDT 2015
Tue Oct  6 10:37:03 AEDT 2015
Tue Oct  6 10:37:08 AEDT 2015
      1
    365 CLOSE_WAIT
      2 ESTABLISHED
      1 LISTEN
Tue Oct  6 10:37:13 AEDT 2015
      1
    429 CLOSE_WAIT
      3 ESTABLISHED
      1 LISTEN
Tue Oct  6 10:37:18 AEDT 2015
      1
    504 CLOSE_WAIT
      4 ESTABLISHED
      1 LISTEN
Tue Oct  6 10:37:23 AEDT 2015
      1
    584 CLOSE_WAIT
      4 ESTABLISHED
      1 LISTEN
Tue Oct  6 10:37:28 AEDT 2015
      1
    663 CLOSE_WAIT
      4 ESTABLISHED
      1 LISTEN
Tue Oct  6 10:37:34 AEDT 2015
      1
    738 CLOSE_WAIT
      5 ESTABLISHED
      1 LISTEN
Tue Oct  6 10:37:39 AEDT 2015
      1
    808 CLOSE_WAIT
      5 ESTABLISHED
      1 LISTEN
Tue Oct  6 10:37:44 AEDT 2015
      1
    880 CLOSE_WAIT
      6 ESTABLISHED
      1 LISTEN
Tue Oct  6 10:37:49 AEDT 2015
Tue Oct  6 10:37:54 AEDT 2015
Tue Oct  6 10:37:59 AEDT 2015
Tue Oct  6 10:38:04 AEDT 2015



Looking at one of the open connections at random:
tcp        1      0 10.168.10.91:24224          10.168.9.129:25511          CLOSE_WAIT

It looked like a perfectly reasonable conversation (with no actual data)

[m45848@base2prd fail2]$ tcpdump -r startup-fluentd.pcap port 25511
reading from file startup-fluentd.pcap, link-type EN10MB (Ethernet)
10:37:04.640440 IP 10.168.9.129.25511 > 10.168.10.91.24224: Flags [S], seq 3429011007, win 17922, options [mss 8961,sackOK,TS val 43145500 ecr 0,nop,wscale 8], length 0
10:37:04.640451 IP 10.168.10.91.24224 > 10.168.9.129.25511: Flags [S.], seq 3585448042, ack 3429011008, win 26847, options [mss 8961,sackOK,TS val 16432431 ecr 43145500,nop,wscale 7], length 0
10:37:04.641477 IP 10.168.9.129.25511 > 10.168.10.91.24224: Flags [.], ack 1, win 71, options [nop,nop,TS val 43145500 ecr 16432431], length 0
10:37:04.641597 IP 10.168.9.129.25511 > 10.168.10.91.24224: Flags [F.], seq 1, ack 1, win 71, options [nop,nop,TS val 43145500 ecr 16432431], length 0
10:37:04.642233 IP 10.168.10.91.24224 > 10.168.9.129.25511: Flags [.], ack 2, win 210, options [nop,nop,TS val 16432432 ecr 43145500], length 0
[m45848@base2prd fail2]$

That's a TCP:
client SYN
fluentd SYN+ACK
client ACK
client FIN+ACK
fluentd ACK

Shouldn't fluentd have sent a FIN? 

Simon Roberts

unread,
Oct 5, 2015, 11:41:38 PM10/5/15
to Fluentd Google Group
I think I figured out the problem.  And it was totally my fault :(

When I ran strace against td-agent, starting it as a subprocess rather than attaching to existing process, I saw the following output:

[root@log-relay-l-prod td-agent]# strace -o new.strace -f /opt/td-agent/embedded/bin/ruby /usr/sbin/td-agent --log /var/log/td-agent/td-agent.log --use-v1-config -vv --group root
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/filter_record_transformer.rb:214:in `eval': /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/filter_record_transformer.rb:214: syntax error, unexpected ')', expecting tSTRING_DEND (SyntaxError)
...y? ) ? short_message : "empty")}"
...                               ^
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/filter_record_transformer.rb:214: unterminated string meets end of file
        from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/filter_record_transformer.rb:214:in `expand'
        from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/filter_record_transformer.rb:123:in `expand_placeholders'
        from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/filter_record_transformer.rb:127:in `block in expand_placeholders'
        from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/filter_record_transformer.rb:126:in `each_pair'
        from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/filter_record_transformer.rb:126:in `expand_placeholders'
        from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/filter_record_transformer.rb:115:in `reform'
        from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/filter_record_transformer.rb:84:in `block in filter_stream'
        from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/event.rb:147:in `feed_each'
        from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/event.rb:147:in `each'
        from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/filter_record_transformer.rb:82:in `filter_stream'
        from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/event_router.rb:150:in `block in emit'
        from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/event_router.rb:149:in `each'
        from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/event_router.rb:149:in `emit'
        from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/event_router.rb:88:in `emit_stream'
        from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_forward.rb:147:in `on_message'
        from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_forward.rb:243:in `call'
        from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_forward.rb:243:in `block in on_read_msgpack'
        from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_forward.rb:242:in `feed_each'
        from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_forward.rb:242:in `on_read_msgpack'
        from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_forward.rb:228:in `call'
        from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_forward.rb:228:in `on_read'
        from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.3.0/lib/cool.io/io.rb:123:in `on_readable'
        from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.3.0/lib/cool.io/io.rb:186:in `on_readable'
        from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.3.0/lib/cool.io/loop.rb:88:in `run_once'
        from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.3.0/lib/cool.io/loop.rb:88:in `run'
        from /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_forward.rb:91:in `run'
[root@log-relay-l-prod td-agent]# less /etc/td-agent/config.d/

So the problem was an extra bracket in my ruby filter:

<filter *>
    enable_ruby
    <record>
      short_message ${(defined? short_message and !short_message.nil? and !short_message.to_s.empty? ) ? short_message : "empty")}
    </record>
    type record_transformer
</filter>

The error message about that was not captured by fluentd, didn't appear in any logs, and the only way I happened to see it was running td-agent directly, not a service - how it's normally run. 

    class RubyPlaceholderExpander
      ...
      def expand(str)
        interpolated = str.gsub(/\$\{([^}]+)\}/, '#{\1}') # ${..} => #{..}
        eval "\"#{interpolated}\"", @placeholders.instance_eval { binding }
      rescue => e
        log.warn "failed to expand `#{str}`", :error_class => e.class, :error => e.message
        log.warn_backtrace
        nil
      end
      ...
    end

Mr. Fiber

unread,
Oct 6, 2015, 12:49:35 AM10/6/15
to Fluentd Google Group
Thanks for the investigation.
So fixing your configuration and there is no lots of CLOSE_WAIT, right?
Continue to raise many exceptions seems to choke fluentd's behaviour.

The error message about that was not captured by fluentd, didn't appear in any logs, and the only way I happened to see it was running td-agent directly, not a service - how it's normally run. 

This is bad. We should fix it. I will check it with your configuration.
Again, thanks for hard investigation.


Masahiro


Simon Roberts

unread,
Oct 6, 2015, 3:05:05 AM10/6/15
to Fluentd Google Group
Yes, fixing the typo fixed all the problems with close_wait. I only saw one exception on the console, so unless they were suppressed, there wasn't huge numbers.  Seems more likely the error wasn't handled, and caused a deadlock or something?  I'm not really a ruby developer, but the stack trace leading to the exception might prove useful.
Reply all
Reply to author
Forward
0 new messages