Fluentd memory leak?

1,946 views
Skip to first unread message

jc...@atlassian.com

unread,
Aug 2, 2015, 9:59:08 PM8/2/15
to Fluentd Google Group
I'm running fluentd in a production environment. Our setup is the same as the one describe here http://docs.fluentd.org/articles/high-availability where individual nodes send logs to the aggregator node and aggregator will send logs to kinesis. The aggregator consist of 3 t2.small EC2 instances sit behind an elb. However, as you can see from the file attached, the network in/out was fine but the memory usage of fluentd keep going up from ~25% to ~60% (see machine_matrics attachment), eventually it will go up to ~90% and got killed by the OOM killer
 
This is my fluentd config:
<source>
    type forward
    port 8080
</source>

# Output
<match logs.**>
    type kinesis
    stream_name "#{ENV['STREAM_NAME']}"
    region "#{ENV['AWS_REGION']}"
    aws_key_id "#{ENV['AWS_ACCESS_KEY']}"
    aws_sec_key "#{ENV['AWS_ACCESS_SECRET_KEY']}"
    http_proxy "#{ENV['HTTP_PROXY']}"

    random_partition_key

    # kinesis retries
    retries_on_putrecords 10

    # multi-thread config
    detach_process 2
    num_threads 15

    # buffering
    buffer_type file
    buffer_path /var/log/fluent/log-aggregator.*.buffer
    buffer_chunk_limit 8m
    buffer_queue_limit 512
    flush_at_shutdown true
    flush_interval 10s

    # fluentd retries
    disable_retry_limit false
    retry_limit 10
</match>



Some other strange things i see, i see a ~150 open tcp connections and ~130 are in TIME_WAIT state (ELB uses TCP healthcheck on port 8080).

See attachment for more info about gc_stat and object_space output. 

Any one has similar problems?

machine_matrics.png
object_space.png
gc_stat.png

Mr. Fiber

unread,
Aug 2, 2015, 10:46:39 PM8/2/15
to Fluentd Google Group
I'm not sure why there are lots of TIME_WAIT state sockets.
How ELB handle RST / FIN packet of connections?
How many connections from forwarder to aggregator?

How about setting net.ipv4.tcp_tw_reuse = 1?


Masahiro

--
You received this message because you are subscribed to the Google Groups "Fluentd Google Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jc...@atlassian.com

unread,
Aug 2, 2015, 11:52:48 PM8/2/15
to Fluentd Google Group
1. We have ~140 forwarder at the moment, so now the ~150 tcp connection seems reasonable for me now.

Here are the TIME_WAIT connections:
$ netstat | grep TIME_WAIT | wc -l
123
$ netstat | grep TIME_WAIT | wc -l
127
$ netstat | grep TIME_WAIT | wc -l
129
$ netstat | grep TIME_WAIT | wc -l
133
$ netstat | grep TIME_WAIT | wc -l
133
$ netstat | grep TIME_WAIT | wc -l
135
$ netstat | grep TIME_WAIT | wc -l
119
$ netstat | grep TIME_WAIT | wc -l
121
So the connections did get closed.

2. I can't find any documentation online documenting how ELB handles FIN/RST packet, but there is a connection timeout if there is no data for 60 seconds.

3. I will try net.ipv4.tcp_tw_reuse

Let me know what other information you need.

jc...@atlassian.com

unread,
Aug 2, 2015, 11:57:48 PM8/2/15
to Fluentd Google Group
Also stats from top:

$ top
top - 03:13:55 up 1 day,  1:30,  1 user,  load average: 0.00, 0.01, 0.05
Tasks: 116 total,   2 running, 114 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.7 us,  0.7 sy,  0.0 ni, 96.0 id,  0.0 wa,  0.0 hi,  1.3 si,  0.3 st
KiB Mem:   2048524 total,  1947088 used,   101436 free,   192192 buffers
KiB Swap:        0 total,        0 used,        0 free.   643236 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 3021 root          20   0  662364  63760    2720 S  1.0        3.1  21:17.79 fluentd
       7 root          20   0       0                0         0 S  0.3        0.0   1:34.70 rcu_sched
       8 root          20   0       0                0         0 R  0.3        0.0   4:50.08 rcuos/0
 2265 td-agent   20   0  248432   67028   5004 S  0.3        3.3   3:25.26 ruby
 2570 dd-agent  20   0  148476   22540   5328 S  0.3        1.1   1:43.68 python
 3026 root          20   0  728896 337224   3848 S  0.3       16.5   7:47.26 fluentd
 3032 root          20   0  731760 344124   3848 S  0.3       16.8   7:49.64 fluentd
       1 root          20   0   33616      2956   1496 S  0.0         0.1    0:01.50 init

On Monday, August 3, 2015 at 12:46:39 PM UTC+10, repeatedly wrote:

Mr. Fiber

unread,
Aug 3, 2015, 3:35:42 AM8/3/15
to Fluentd Google Group
1. We have ~140 forwarder at the moment, so now the ~150 tcp connection seems reasonable for me now.

I see

From your attached image, incoming traffic is increasing.
Memory usage is increased with same traffic?


jc...@atlassian.com

unread,
Aug 4, 2015, 12:38:46 AM8/4/15
to Fluentd Google Group
The memory usage keep going up even though the traffic is constant. As you can see from the attachment. 

Currently I set the RUBY_GC_OLDMALLOC_LIMIT_MAX and RUBY_GC_MALLOC_LIMIT_MAX to a small value and see if more frequent GC helps with the situation. 
Screen Shot 2015-08-04 at 2.35.18 pm.png

jc...@atlassian.com

unread,
Aug 5, 2015, 3:19:56 AM8/5/15
to Fluentd Google Group
Memory Usage still keep going up :( The number of fluentd forward handler, CoolIOWatcher and TCP socket objects looks suspicious. I don't think there would be that many connections going in. However, I don't really see a direct correlation between all these metrics. 
Screen Shot 2015-08-05 at 5.12.30 pm.png
Screen Shot 2015-08-05 at 5.12.21 pm.png
Screen Shot 2015-08-05 at 5.12.40 pm.png
Screen Shot 2015-08-05 at 5.12.06 pm.png

Mr. Fiber

unread,
Aug 5, 2015, 6:37:12 AM8/5/15
to Fluentd Google Group
That's weird.
How about using other output plugin instead of kinesis?
I want to know which is the problem, input or output.

Lance N.

unread,
Aug 8, 2015, 8:07:47 PM8/8/15
to Fluentd Google Group
Perhaps old threads are piling up and not being harvested? Can you graph the number of threads in the Fluentd process? 

jc...@atlassian.com

unread,
Aug 9, 2015, 7:38:06 PM8/9/15
to Fluentd Google Group
I deployed a version with the null output plugin over the weekend. Here is the config:

<system>
    suppress_repeated_stacktrace
    suppress_config_dump
</system>

# Input
<source>
    type forward
    port 8080
</source>

<match fluent.**>
    type file
    path /var/log/fluent/log-aggregator
</match>

# Output
<match logs.**>
   type null
</match>

See attachment for the result. The memory usage still keep going up. (It didn't go up as fast but this is not in production so the load is much lower.)
Screen Shot 2015-08-10 at 9.34.20 am.png

Naotoshi Seo

unread,
Aug 9, 2015, 11:48:34 PM8/9/15
to Fluentd Google Group
Need to investigate.
Can you try to stop using <system></system> config?
If it still grows memory, can you also try --no-supervisor option?

Regards,
Naotoshi a.k.a sonots

jc...@atlassian.com

unread,
Aug 17, 2015, 6:22:29 AM8/17/15
to Fluentd Google Group
Sorry for taking so long to reply, things are busy here. I've created a aggregator that doesn't actually aggregate logs (no traffic from individual nodes, only from ELB).

I've attached some metrics I collected.

Next step I will do what you suggested to see if it makes a difference.

Thanks!
Screen Shot 2015-08-17 at 8.14.22 pm.png
Screen Shot 2015-08-17 at 8.15.32 pm.png
Screen Shot 2015-08-17 at 8.16.13 pm.png
Screen Shot 2015-08-17 at 8.16.04 pm.png
Screen Shot 2015-08-17 at 8.14.55 pm.png
Screen Shot 2015-08-17 at 8.16.56 pm.png

jc...@atlassian.com

unread,
Aug 17, 2015, 6:23:43 AM8/17/15
to Fluentd Google Group
Forgot one more attachment
Screen Shot 2015-08-17 at 8.14.33 pm.png

no jihun

unread,
Sep 2, 2015, 6:48:35 PM9/2/15
to Fluentd Google Group
any updates?

Mr. Fiber

unread,
Sep 2, 2015, 11:45:29 PM9/2/15
to Fluentd Google Group
Do you hit this problem too?


On Thu, Sep 3, 2015 at 7:48 AM, no jihun <jee...@gmail.com> wrote:
any updates?

no jihun

unread,
Sep 3, 2015, 1:37:47 AM9/3/15
to Fluentd Google Group
I'm not sure this is an same situation with jc.. 
but as you can see fluentd process suddenly consumed up the machine's ram.

in usual, fluentd consume about 50MB of RAM(this is approximately 0.84% of total RAM).

this is the fluentd ram consum statuses by minutes.
MEM(RAM) field keeps 0.85% until 2015-08-31 16:04.

currently I do not exactly know why the memory tracking did not made logs during 30minutes(16:05~16:32)

not only this day but also other days fluentd consumed very large memory.




here is my conf


<source>
  type monitor_agent
  bind 0.0.0.0
  port 24220
</source>

<source>
  type tail
  path /home1/logs/offerwall/offerwall.20*.log
  pos_file /home1/.fluentd/offerwall_tail_pos.log
  refresh_interval 5
  tag offerwall.log
  format json
  log_level error
</source>

<match offerwall.log>
  type elasticsearch
  buffer_type file
  buffer_path /home1/.fluentd/buffer/offerwall.*.buffer
  buffer_chunk_limit 32m
  buffer_queue_limit 1024
  logstash_format true
  utc_index true
  flush_interval 3
  disable_retry_limit true
  retry_wait 5
  request_timeout 10
  num_threads 1
  hosts 10.98---------------
  port 9200
  type_name dodol.analytics.real
</match>


$ gem list

*** LOCAL GEMS ***

bigdecimal (1.2.6)
cool.io (1.3.0)
elasticsearch (1.0.9)
elasticsearch-api (1.0.9)
elasticsearch-transport (1.0.9)
excon (0.45.3)
faraday (0.9.1)
fluent-plugin-elasticsearch (0.8.0)
fluentd (0.12.8)
http_parser.rb (0.6.0)
io-console (0.4.3)
json (1.8.1)
msgpack (0.5.11)
multi_json (1.11.0)
multipart-post (2.0.0)
psych (2.0.8)
rake (10.4.2)
rdoc (4.2.0)
sigdump (0.2.2)
string-scrub (0.0.5)
thread_safe (0.3.5)
tzinfo (1.2.2)
tzinfo-data (1.2015.4)
yajl-ruby (1.2.1)



this is the fluent's log about that time

2015-08-31 11:16:07 +0900 [info]: plugin/in_tail.rb:477:initialize: following tail of /home1/logs/offerwall/offerwall.2015-08-31.log
2015-08-31 11:16:07 +0900 [info]: plugin/in_tail.rb:477:initialize: following tail of /home1/logs/offerwall/offerwall.2015-08-24.log
2015-08-31 11:16:07 +0900 [info]: plugin/in_tail.rb:477:initialize: following tail of /home1/logs/offerwall/offerwall.2015-08-28.log
2015-08-31 11:16:07 +0900 [info]: plugin/in_tail.rb:477:initialize: following tail of /home1/logs/offerwall/offerwall.2015-08-23.log
2015-08-31 11:16:07 +0900 [info]: plugin/in_tail.rb:477:initialize: following tail of /home1/logs/offerwall/offerwall.2015-08-26.log
2015-08-31 11:16:11 +0900 [info]: plugin/out_elasticsearch.rb:67:client: Connection opened to Elasticsearch cluster => {:host=>"10.9...", :port=>9200, :scheme=>"http"}, {:host=>"10.98....", :port=>9200, :scheme=>"http"}, {:host=>"10.98.1...", :port=>9200, :scheme=>"http"}, {:host=>"10.9...1", :port=>9200, :scheme=>"http"}
2015-08-31 16:04:42 +0900 [info]: plugin/in_tail.rb:477:initialize: following tail of /home1/logs/offerwall/offerwall.20.restore.2nd.log
2015-09-01 00:01:01 +0900 [info]: plugin/in_tail.rb:387:on_rotate: detected rotation of /home1/logs/offerwall/offerwall.2015-08-22.log; waiting 5 seconds
2015-09-01 09:00:04 +0900 [info]: plugin/in_tail.rb:477:initialize: following tail of /home1/logs/offerwall/offerwall.2015-09-01.log
2015-09-01 12:06:29 +0900 [info]: plugin/in_tail.rb:477:initialize: following tail of /home1/logs/offerwall/offerwall.20.restore.3nd.log


I suspect in_tail or elasticsearch plugin.

at 16:04:42 I operated some command like this.

$ touch offerwall.20.restore.2nd.log 
  (waited until fluentd tailing new file.)
$ cat somelargefile >> offerwall.20.restore.2nd.log
  (somelargefile's size is near 200MB)



from when I restarted fluentd process after got sms notification now (24hours), fluentd's ram usage not going high.(keep under 50mb)


what can I provide more information?




2015년 9월 3일 목요일 오후 12시 45분 29초 UTC+9, repeatedly 님의 말:

Naotoshi Seo

unread,
Sep 3, 2015, 9:05:27 AM9/3/15
to flu...@googlegroups.com
It looks logs of old days appeared at a moment, and it caused the increase of memory.  So, it looks it is not a memory leak. 

> 2015-08-31 11:16:07 +0900 [info]: plugin/in_tail.rb:477:initialize: following tail of /home1/logs/offerwall/offerwall.2015-08-24.log
> 2015-08-31 11:16:07 +0900 [info]: plugin/in_tail.rb:477:initialize: following tail of /home1/logs/offerwall/offerwall.2015-08-28.log
> ....

If these logs appeared on the moment, someone writing the offerwall logs is doing something wrong.
If these logs already existed, but in_tail started following the log at the moment, in_tail is doing something wrong.

Can you check your situation?

Regards,
Naotoshi

no jihun

unread,
Sep 3, 2015, 6:34:25 PM9/3/15
to flu...@googlegroups.com

this logs caused by fluentd restarting.
because old files kept for 7days.

> 2015-08-31 11:16:07 +0900 [info]: plugin/in_tail.rb:477:initialize: following tail of /home1/logs/offerwall/offerwall.2015-08-24.log
> 2015-08-31 11:16:07 +0900 [info]: plugin/in_tail.rb:477:initialize: following tail of /home1/logs/offerwall/offerwall.2015-08-28.log
> ....

and this time is 3 hours earlier than memory going high moment.
also no duplicated document indexed in elasticsearch.

at the picture until 14 pm memory usage is very low.

Mr. Fiber

unread,
Sep 7, 2015, 1:43:41 AM9/7/15
to Fluentd Google Group
From your image, 16:04:55 consumes CPU so it read lots of logs.
Maybe following command.

$ cat somelargefile >> offerwall.20.restore.2nd.log
  (somelargefile's size is near 200MB)

You can check buffer status via in_monitor_agent and
Ruby's object allocations via in_object_space.
Comparing these metrics, you can investigate the problem deeply.
For example, what objects are not GCed.


--
Reply all
Reply to author
Forward
0 new messages