Got "buffer flush took longer time than slow_flush_log_threshold" error

2,054 views
Skip to first unread message

chikinchoi

unread,
Sep 6, 2020, 9:40:37 PM9/6/20
to Fluentd Google Group

Hi Team,

I got at lot of "buffer flush took longer time than slow_flush_log_threshold: elapsed_time=60.16122445899964 slow_flush_log_threshold=20.0 plugin_id="firelens_es" error in my fluentd which is connecting to Elasticsearch .
I saw that there is a document mentioned the reason for this error is because  Elasticsearch is exhausted [1]. However, I checked all Elasticsearch clusters CPU usage are low and healthy, may I know is there any other reason cause this error?
Please help. Thank you very much:

Below is my Fluentd config:

<system>
  @log_level debug
  workers 4
  root_dir /var/log/fluent/
</system>

<source>
  @type  forward
  @id    input1
  @log_level debug
  port  24224
  bind 0.0.0.0
</source>

# Used for docker health check
<source>
  @type http
  port 8888
  bind 0.0.0.0
</source>

# records sent for health checking won't be forwarded anywhere
<match health**>
  @type null
</match>

# Prometheus Configuration
# count number of incoming records per tag
<filter **>
  @type prometheus
  <metric>
    name fluentd_input_status_num_records_total
    type counter
    desc The total number of incoming records
    <labels>
      tag ${tag}
      hostname ${hostname}
    </labels>
  </metric>
</filter>

<filter **firelens**>
  @type concat
  key log
  multiline_start_regexp '^\{\\"@timestamp'
  multiline_end_regexp '/\}/'
  separator ""
  flush_interval 1
  timeout_label @NORMAL
</filter>

<filter **firelens**>
  @type parser
  key_name log
  reserve_data true
  emit_invalid_record_to_error false
  <parse>
  @type json
  </parse>
</filter>

<filter **firelens**>
  @type record_transformer
  enable_ruby
  <record>
    taskDef ${record["ecs_task_definition"].gsub(/:.*/, '')}
  </record>
</filter>

<filter lambdaNode**>
  @type parser
  key_name data
  reserve_data true
  emit_invalid_record_to_error false
  <parse>
  @type json
  </parse>
</filter>

<filter lambdaPython**>
  @type record_transformer
  enable_ruby
  <record>
    function_name ${record["function_name"].gsub(/:.*/, '')}
  </record>
</filter>

<filter lambdaPython**>
  @type parser
  key_name message
  reserve_data true
  emit_invalid_record_to_error false
  <parse>
  @type json
  </parse>
</filter>

# count number of outgoing records per tag
<match **firelens**>
  @type copy
  @id firelens
  <store>
    @type elasticsearch
    @id firelens_es
    scheme https
    ssl_version TLSv1_2
    host  "#{ENV['ELASTIC_HOST']}"
    port  "#{ENV['ELASTIC_POST']}"
    user  "#{ENV['ELASTIC_USER']}"
    password "#{ENV['ELASTIC_PWD']}"
    ssl_verify false
    log_es_400_reason true
    index_name ${taskDef}-%Y.%m
    reconnect_on_error true
    reload_on_failure true
    reload_connections false
    suppress_type_name true
    sniffer_class_name Fluent::Plugin::ElasticsearchSimpleSniffer
    request_timeout 2147483648
    <buffer tag, time, ecs_task_definition, taskDef>
      @type file
      flush_mode lazy
      flush_thread_count 8
      # path /var/log/fluent/firelens.*.buffer
      total_limit_size 8GB
      chunk_limit_size 80MB
      timekey 5s
      timekey_wait 1s
      overflow_action drop_oldest_chunk
      retry_max_interval 16s
      disable_chunk_backup true
      retry_forever false
      retry_max_times 1
    </buffer>
  </store>
  <store>
    @type prometheus
    @id firelens_pro
    <metric>
      name fluentd_output_status_num_records_total
      type counter
      desc The total number of outgoing records
      <labels>
        tag ${tag}
        hostname ${hostname}
      </labels>
    </metric>
  </store>
</match>

<match lambdaNode**>
  @type copy
  @id lambdaNode
  <store>
    @type elasticsearch
    @id lambdaNode_es
    scheme https
    ssl_version TLSv1_2
    host  "#{ENV['ELASTIC_HOST']}"
    port  "#{ENV['ELASTIC_POST']}"
    user  "#{ENV['ELASTIC_USER']}"
    password "#{ENV['ELASTIC_PWD']}"
    include_timestamp true
    ssl_verify false
    log_es_400_reason true
    index_name ${$.context.functionName}-%Y.%m
    reconnect_on_error true
    reload_on_failure true
    reload_connections false
    suppress_type_name true
    sniffer_class_name Fluent::Plugin::ElasticsearchSimpleSniffer
    reload_after 100
    <buffer tag, time, $.context.functionName>
      flush_mode lazy
      chunk_limit_size 5MB
      flush_thread_count 8
      total_limit_size 512MB
      timekey 5s
      timekey_wait 1s
      retry_max_interval 16s
      disable_chunk_backup true
    </buffer>
  </store>
  <store>
    @type prometheus
    @id lambdaNode_pro
    <metric>
      name fluentd_output_status_num_records_total
      type counter
      desc The total number of outgoing records
      <labels>
        tag ${tag}
        hostname ${hostname}
      </labels>
    </metric>
  </store>
</match>

<match lambdaPython**>
  @type copy
  @id lambdaPython
  <store>
    @type elasticsearch
    @id lambdaPython_es
    scheme https
    ssl_version TLSv1_2
    host  "#{ENV['ELASTIC_HOST']}"
    port  "#{ENV['ELASTIC_POST']}"
    user  "#{ENV['ELASTIC_USER']}"
    password "#{ENV['ELASTIC_PWD']}"
    include_timestamp true
    ssl_verify false
    log_es_400_reason true
    index_name ${function_name}-%Y.%m
    reconnect_on_error true
    reload_on_failure true
    reload_connections false
    suppress_type_name true
    sniffer_class_name Fluent::Plugin::ElasticsearchSimpleSniffer
    reload_after 100
    <buffer tag, time, function_name>
      flush_mode lazy
      chunk_limit_size 5MB
      flush_thread_count 8
      total_limit_size 512MB
      timekey 5s
      timekey_wait 1s
      retry_max_interval 16s
      disable_chunk_backup true
    </buffer>
  </store>
  <store>
    @type prometheus
    @id lambdaPython_pro
    <metric>
      name fluentd_output_status_num_records_total
      type counter
      desc The total number of outgoing records
      <labels>
        tag ${tag}
        hostname ${hostname}
      </labels>
    </metric>
  </store>
</match>

<label @FLUENT_LOG>
  <match fluent.*>
    @type null
  </match>
</label>

<label @NORMAL>
  <match **>
    @type null
  </match>
</label>

# expose metrics in prometheus format
<source>
  @type prometheus
  bind 0.0.0.0
  port 24231
  metrics_path /metrics
</source>

<source>
  @type prometheus_output_monitor
  interval 10
  <labels>
    hostname ${hostname}
  </labels>
</source>



Mr. Fiber

unread,
Sep 7, 2020, 6:55:42 AM9/7/20
to Fluentd Google Group
If fluentd's CPU usage is normal, other possible cases are

- network issue. Lots of request consumes network usage
- elasticsearch queue is filled by small requests



--
You received this message because you are subscribed to the Google Groups "Fluentd Google Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/fluentd/49c8b8b9-fe01-44ca-a7c6-1c3a10e42935n%40googlegroups.com.

chikinchoi

unread,
Sep 7, 2020, 10:07:24 PM9/7/20
to Fluentd Google Group
Hi repeatedly,

Thank you for your reply!
I found that the slow_flush error logs are not occurring continuously [1]. For example, this error occurs almost every hour on 5/9.
Also, I checked the Fluentd CPU usage, the maximum CPU usage is 25% and stays between 5% to 15% for other times.
For the Elasticsearch queue, may I know what metrics should I monitor the queue? Is there any parameter I can add or update in my fluentd config?

[1]
2020-09-05
01:26:32
2020-09-04 17:26:32 +0000 [warn]: #1 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=65.50423289102037 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
2020-09-04 17:26:32 +0000 [warn]: #1 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=65.50423289102037 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
02:44:50
2020-09-04 18:44:50 +0000 [warn]: #2 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=64.22521398495883 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
04:25:21
2200-09-04 20:25:21 +0000 [warn]: #2 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=65.47856833890546 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
04:40:41
2020-09-04 20:40:41 +0000 [warn]: #0 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=64.49219648807775 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
06:52:47
2020-09-04 22:52:47 +0000 [warn]: #0 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=65.26145618001465 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
08:06:02
2020-09-05 00:06:02 +0000 [warn]: #1 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=64.93444448092487 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
09:32:11
2020-09-05 01:32:11 +0000 [warn]: #2 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=64.2911620939849 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
11:10:27
2020-09-05 03:10:27 +0000 [warn]: #1 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=65.65923361200839 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
11:51:31
2020-09-05 03:51:31 +0000 [warn]: #0 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=64.28296210500412 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
14:00:12
2020-09-05 06:00:12 +0000 [warn]: #0 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=65.41564884700347 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
15:15:28
2020-09-05 07:15:28 +0000 [warn]: #2 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=65.2021695279982 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
15:23:35
2020-09-05 07:23:35 +0000 [warn]: #1 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=63.77776914194692 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
17:16:42
2020-09-05 09:16:42 +0000 [warn]: #3 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=65.49122794298455 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
18:14:48
2020-09-05 10:14:48 +0000 [warn]: #1 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=65.67751345399301 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
18:31:07
2020-09-05 10:31:07 +0000 [warn]: #3 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=65.22096731502097 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
20:00:02
2020-09-05 12:00:02 +0000 [warn]: #2 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=65.36720078997314 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
20:12:15
2020-09-05 12:12:15 +0000 [warn]: #2 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=63.81634930497967 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
23:14:21
2020-09-05 15:14:21 +0000 [warn]: #2 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=64.5038641099818 slow_flush_log_threshold=20.0 plugin_id="firelens_es"

chikinchoi

unread,
Sep 7, 2020, 11:11:24 PM9/7/20
to Fluentd Google Group
Hi repeatedly,

I found that my non-prd Fluentd CPU usage is 100% when the error slow_flush shows.
May I know how to solve it? just increase the CPU of my  Fluentd ?
Also, why we are looking to Fluentd CPU usage instead of Elasticsearch CPU usage?
Thank you very much!.


repeatedly在 2020年9月7日星期一下午6:55:42 [UTC+8]寫道:

chikinchoi

unread,
Sep 9, 2020, 9:13:40 PM9/9/20
to Fluentd Google Group
Hi repeatedly,

Please help for this question. Thank you very much!


repeatedly在 2020年9月7日星期一下午6:55:42 [UTC+8]寫道:

Mr. Fiber

unread,
Sep 10, 2020, 5:24:57 PM9/10/20
to Fluentd Google Group
> May I know how to solve it? just increase the CPU of my  Fluentd ?

This is one way. Other way is optimizing the data pipeline. For example

- Use record_modifier instead of record_transformer for enable_ruby case
- Remove unused tag from buffer chunk keys
- Merge filter / output prometheus plugins.
  Maybe, this is hard because you use these counts for error record check between input and output


chikinchoi

unread,
Sep 14, 2020, 4:52:05 AM9/14/20
to Fluentd Google Group
Hi repeatedly,

I have updated the CPU size to 4096 and also use record_modifier instead of record_transformer.
However, still can see some slow_flush error.
I checked the inncomfing records count per second. The maximum is 3000 records per second only and it seems not a large size of incoming record.
Do you have any other idea to fix it? Thank you very much!!.

chikinchoi

unread,
Sep 16, 2020, 10:19:17 PM9/16/20
to Fluentd Google Group
Hi,
I am still struggling with finding a solution to this issue.
I have increased the flush_thread_count from 8 to 32 but still got slow_flush error. What is the ideal value of this parameter?
Also, I found that I didn't add " queue_limit_length " parameter in the fluentd config. Is it the same parameter with " total_limit_size " according to [1]?
The Prometheus is added to monitor the fluentd performance. However, there is no detailed description for each metrics data. Which metrics data should I check when I got a slow_flush warning?
For example, which metrics should I check if I would like to know if all fluent thread is busy?
Please help.

Vazel

unread,
Dec 17, 2021, 3:56:21 AM12/17/21
to Fluentd Google Group
have you resolved?
Reply all
Reply to author
Forward
0 new messages