Hi All,
We seem to be having an ongoing issue with Fluentd (td-agent) and memory usage on our three Fluentd Aggregators. We have been working on tweaking our environment but keep experiencing the same problem with Memory Usage. Basically we can get the ingest to work for almost a week but then the Aggregators run out of memory and the ingest comes to a halt. From our monitoring graphs we can see that the memory % used on each Aggregator starts to increase slowly each day maybe between 5-10% per day up to a point where it consumes the majority of the memory available for itself and the OS where the ingest just comes to a halt. At this point we usually need to stop the Fluentd Forwarder, flush its bufffer, restart the Aggregators and make sure they are all started before we restart the ingest on our Forwarder.
We don't see any issues with our file buffer or s3 buffer sizes at all so they are all working fine.
Our three Aggregators have the following specs:
16 Core (Currently 8 workers assigned in our Fluentd Config)
32GB RAM
I added the following to the OS Environmental Variables:
RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=1.2
and:
flush_thread_count 4 (for the buffer for our ingest into Elastic Search)
Using td-agent version 3.7.0 (Amazon Linux 2).
Config per Aggregator attached below:
<worker 0-7>
<source>
@type forward
port 24224
bind 0.0.0.0
<transport tls>
cert_path /etc/td-agent/certs/fluentd.crt
private_key_path /etc/td-agent/certs/fluentd.key
private_key_passphrase ############
</transport>
<security>
self_hostname fluentd-aggregator
shared_key ###########
user_auth true
<user>
username test
password ##########
</user>
</security>
</source>
<filter ######.all>
@type parser
key_name message
<parse>
@type grok
grok_name_key grok_name
grok_failure_key grokfailure
# reserve_data true
# reserve_time true
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
pattern %{################}
</grok>
<grok>
name "################"
pattern %{################}
</grok>
<grok>
name "################"
pattern %{################}
</grok>
</parse>
</filter>
<filter ######.all>
@type elasticsearch_genid
hash_id_key _hash
</filter>
<match ######.all>
@type copy
<store>
@type s3
store_as gzip_command
s3_bucket ############
path ############/
s3_region us-west-2
s3_object_key_format %{path}################_%{time_slice}_%{index}.%{file_extension}
<buffer>
@type file
path /var/log/td-agent/s3_########
chunk_limit_size 16MB
total_limit_size 5120MB
flush_at_shutdown true
timekey 30
timekey_use_utc true
timekey_wait 10s
</buffer>
</store>
<store>
@type elasticsearch
host ################
user ################
password ################
include_timestamp true
time_key @timestamp
id_key _hash # specify same key name which is specified in hash_id_key
remove_keys _hash # Elasticsearch doesn't like keys that start with _
index_name ################
log_es_400_reason true
include_tag_key true
include_timestamp true
time_key @timestamp
tag_key @log_name
reconnect_on_error true
reload_on_failure true
reload_connections false
<buffer>
@type file
path /var/log/td-agent/buffer_################
# chuck + enqueue
total_limit_size 10240MB
chunk_limit_size 16MB
flush_at_shutdown true
flush_mode interval
flush_thread_count 4
flush_interval 5s
retry_timeout 1h
retry_max_interval 30
overflow_action drop_oldest_chunk
</buffer>
<secondary>
@type secondary_file
directory /var/log/td-agent/error
basename ################.*
</secondary>
</store>
</match>
</worker>
<system>
# equal to -qq option
# log_level debug
workers 8
</system>
Thanks,
Daniel