Got "buffer flush took longer time than slow_flush_log

chikinchoi

unread,

Sep 6, 2020, 9:40:37 PM9/6/20

to Fluentd Google Group

Hi Team,

I got at lot of "buffer flush took longer time than slow_flush_log_threshold: elapsed_time=60.16122445899964 slow_flush_log_threshold=20.0 plugin_id="firelens_es" error in my fluentd which is connecting to Elasticsearch .

I saw that there is a document mentioned the reason for this error is because Elasticsearch is exhausted [1]. However, I checked all Elasticsearch clusters CPU usage are low and healthy, may I know is there any other reason cause this error?

Please help. Thank you very much:

Below is my Fluentd config:

@log_level debug

workers 4

root_dir /var/log/fluent/

</system>

@type forward

@id input1

@log_level debug

port 24224

bind 0.0.0.0

</source>

# Used for docker health check

@type http

port 8888

bind 0.0.0.0

</source>

# records sent for health checking won't be forwarded anywhere

@type null

</match>

# Prometheus Configuration

# count number of incoming records per tag

@type prometheus

name fluentd_input_status_num_records_total

type counter

desc The total number of incoming records

tag ${tag}

hostname ${hostname}

</labels>

</metric>

</filter>

@type concat

key log

multiline_start_regexp '^\{\\"@timestamp'

multiline_end_regexp '/\}/'

separator ""

flush_interval 1

timeout_label @NORMAL

</filter>

@type parser

key_name log

reserve_data true

emit_invalid_record_to_error false

<parse>

@type json

</parse>

</filter>

@type record_transformer

enable_ruby

taskDef ${record["ecs_task_definition"].gsub(/:.*/, '')}

</record>

</filter>

@type parser

key_name data

reserve_data true

emit_invalid_record_to_error false

<parse>

@type json

</parse>

</filter>

@type record_transformer

enable_ruby

function_name ${record["function_name"].gsub(/:.*/, '')}

</record>

</filter>

@type parser

key_name message

reserve_data true

emit_invalid_record_to_error false

<parse>

@type json

</parse>

</filter>

# count number of outgoing records per tag

@type copy

@id firelens

<store>

@type elasticsearch

@id firelens_es

scheme https

ssl_version TLSv1_2

host "#{ENV['ELASTIC_HOST']}"

port "#{ENV['ELASTIC_POST']}"

user "#{ENV['ELASTIC_USER']}"

password "#{ENV['ELASTIC_PWD']}"

ssl_verify false

log_es_400_reason true

index_name ${taskDef}-%Y.%m

reconnect_on_error true

reload_on_failure true

reload_connections false

suppress_type_name true

sniffer_class_name Fluent::Plugin::ElasticsearchSimpleSniffer

request_timeout 2147483648

@type file

flush_mode lazy

flush_thread_count 8

# path /var/log/fluent/firelens.*.buffer

total_limit_size 8GB

chunk_limit_size 80MB

timekey 5s

timekey_wait 1s

overflow_action drop_oldest_chunk

retry_max_interval 16s

disable_chunk_backup true

retry_forever false

retry_max_times 1

</buffer>

</store>

<store>

@type prometheus

@id firelens_pro

name fluentd_output_status_num_records_total

type counter

desc The total number of outgoing records

tag ${tag}

hostname ${hostname}

</labels>

</metric>

</store>

</match>

@type copy

@id lambdaNode

<store>

@type elasticsearch

@id lambdaNode_es

scheme https

ssl_version TLSv1_2

host "#{ENV['ELASTIC_HOST']}"

port "#{ENV['ELASTIC_POST']}"

user "#{ENV['ELASTIC_USER']}"

password "#{ENV['ELASTIC_PWD']}"

include_timestamp true

ssl_verify false

log_es_400_reason true

index_name ${$.context.functionName}-%Y.%m

reconnect_on_error true

reload_on_failure true

reload_connections false

suppress_type_name true

sniffer_class_name Fluent::Plugin::ElasticsearchSimpleSniffer

reload_after 100

flush_mode lazy

chunk_limit_size 5MB

flush_thread_count 8

total_limit_size 512MB

timekey 5s

timekey_wait 1s

retry_max_interval 16s

disable_chunk_backup true

</buffer>

</store>

<store>

@type prometheus

@id lambdaNode_pro

name fluentd_output_status_num_records_total

type counter

desc The total number of outgoing records

tag ${tag}

hostname ${hostname}

</labels>

</metric>

</store>

</match>

@type copy

@id lambdaPython

<store>

@type elasticsearch

@id lambdaPython_es

scheme https

ssl_version TLSv1_2

host "#{ENV['ELASTIC_HOST']}"

port "#{ENV['ELASTIC_POST']}"

user "#{ENV['ELASTIC_USER']}"

password "#{ENV['ELASTIC_PWD']}"

include_timestamp true

ssl_verify false

log_es_400_reason true

index_name ${function_name}-%Y.%m

reconnect_on_error true

reload_on_failure true

reload_connections false

suppress_type_name true

sniffer_class_name Fluent::Plugin::ElasticsearchSimpleSniffer

reload_after 100

flush_mode lazy

chunk_limit_size 5MB

flush_thread_count 8

total_limit_size 512MB

timekey 5s

timekey_wait 1s

retry_max_interval 16s

disable_chunk_backup true

</buffer>

</store>

<store>

@type prometheus

@id lambdaPython_pro

name fluentd_output_status_num_records_total

type counter

desc The total number of outgoing records

tag ${tag}

hostname ${hostname}

</labels>

</metric>

</store>

</match>

@type null

</match>

</label>

@type null

</match>

</label>

# expose metrics in prometheus format

@type prometheus

bind 0.0.0.0

port 24231

metrics_path /metrics

</source>

@type prometheus_output_monitor

interval 10

hostname ${hostname}

</labels>

</source>

[1]
https://github.com/uken/fluent-plugin-elasticsearch#cannot-push-logs-to-elasticsearch-with-connect_write-timeout-reached-why

Mr. Fiber

unread,

Sep 7, 2020, 6:55:42 AM9/7/20

to Fluentd Google Group

If fluentd's CPU usage is normal, other possible cases are

- network issue. Lots of request consumes network usage

- elasticsearch queue is filled by small requests

--
You received this message because you are subscribed to the Google Groups "Fluentd Google Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/fluentd/49c8b8b9-fe01-44ca-a7c6-1c3a10e42935n%40googlegroups.com.

chikinchoi

unread,

Sep 7, 2020, 10:07:24 PM9/7/20

to Fluentd Google Group

Hi repeatedly,

Thank you for your reply!

I found that the slow_flush error logs are not occurring continuously [1]. For example, this error occurs almost every hour on 5/9.

Also, I checked the Fluentd CPU usage, the maximum CPU usage is 25% and stays between 5% to 15% for other times.

For the Elasticsearch queue, may I know what metrics should I monitor the queue? Is there any parameter I can add or update in my fluentd config?

[1]

2020-09-05

01:26:32
2020-09-04 17:26:32 +0000 [warn]: #1 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=65.50423289102037 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
2020-09-04 17:26:32 +0000 [warn]: #1 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=65.50423289102037 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
02:44:50
2020-09-04 18:44:50 +0000 [warn]: #2 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=64.22521398495883 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
04:25:21
2200-09-04 20:25:21 +0000 [warn]: #2 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=65.47856833890546 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
04:40:41
2020-09-04 20:40:41 +0000 [warn]: #0 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=64.49219648807775 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
06:52:47
2020-09-04 22:52:47 +0000 [warn]: #0 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=65.26145618001465 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
08:06:02
2020-09-05 00:06:02 +0000 [warn]: #1 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=64.93444448092487 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
09:32:11
2020-09-05 01:32:11 +0000 [warn]: #2 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=64.2911620939849 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
11:10:27
2020-09-05 03:10:27 +0000 [warn]: #1 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=65.65923361200839 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
11:51:31
2020-09-05 03:51:31 +0000 [warn]: #0 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=64.28296210500412 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
14:00:12
2020-09-05 06:00:12 +0000 [warn]: #0 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=65.41564884700347 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
15:15:28
2020-09-05 07:15:28 +0000 [warn]: #2 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=65.2021695279982 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
15:23:35
2020-09-05 07:23:35 +0000 [warn]: #1 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=63.77776914194692 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
17:16:42
2020-09-05 09:16:42 +0000 [warn]: #3 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=65.49122794298455 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
18:14:48
2020-09-05 10:14:48 +0000 [warn]: #1 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=65.67751345399301 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
18:31:07
2020-09-05 10:31:07 +0000 [warn]: #3 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=65.22096731502097 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
20:00:02
2020-09-05 12:00:02 +0000 [warn]: #2 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=65.36720078997314 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
20:12:15
2020-09-05 12:12:15 +0000 [warn]: #2 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=63.81634930497967 slow_flush_log_threshold=20.0 plugin_id="firelens_es"
23:14:21
2020-09-05 15:14:21 +0000 [warn]: #2 [firelens_es] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=64.5038641099818 slow_flush_log_threshold=20.0 plugin_id="firelens_es"

chikinchoi

unread,

Sep 7, 2020, 11:11:24 PM9/7/20

to Fluentd Google Group

Hi repeatedly,

I found that my non-prd Fluentd CPU usage is 100% when the error slow_flush shows.

May I know how to solve it? just increase the CPU of my Fluentd ?

Also, why we are looking to Fluentd CPU usage instead of Elasticsearch CPU usage?

Thank you very much!.

repeatedly在 2020年9月7日星期一下午6:55:42 [UTC+8]寫道：

chikinchoi

unread,

Sep 9, 2020, 9:13:40 PM9/9/20

to Fluentd Google Group

Hi repeatedly,

Please help for this question. Thank you very much!

repeatedly在 2020年9月7日星期一下午6:55:42 [UTC+8]寫道：

Mr. Fiber

unread,

Sep 10, 2020, 5:24:57 PM9/10/20

to Fluentd Google Group

> May I know how to solve it? just increase the CPU of my Fluentd ?

This is one way. Other way is optimizing the data pipeline. For example

- Use record_modifier instead of record_transformer for enable_ruby case

- Remove unused tag from buffer chunk keys

- Merge filter / output prometheus plugins.

Maybe, this is hard because you use these counts for error record check between input and output

To view this discussion on the web visit https://groups.google.com/d/msgid/fluentd/000b634e-38bd-4be4-8ee1-7a9d843af56an%40googlegroups.com.

chikinchoi

unread,

Sep 14, 2020, 4:52:05 AM9/14/20

to Fluentd Google Group

Hi repeatedly,

I have updated the CPU size to 4096 and also use record_modifier instead of record_transformer.

However, still can see some slow_flush error.

I checked the inncomfing records count per second. The maximum is 3000 records per second only and it seems not a large size of incoming record.

Do you have any other idea to fix it? Thank you very much!!.

chikinchoi

unread,

Sep 16, 2020, 10:19:17 PM9/16/20

to Fluentd Google Group

Hi,

I am still struggling with finding a solution to this issue.

I have increased the flush_thread_count from 8 to 32 but still got slow_flush error. What is the ideal value of this parameter?

Also, I found that I didn't add " queue_limit_length " parameter in the fluentd config. Is it the same parameter with " total_limit_size " according to [1]?

The Prometheus is added to monitor the fluentd performance. However, there is no detailed description for each metrics data. Which metrics data should I check when I got a slow_flush warning?

For example, which metrics should I check if I would like to know if all fluent thread is busy?

Please help.

[1] https://docs.fluentd.org/configuration/buffer-section#buffering-parameters

Vazel

unread,

Dec 17, 2021, 3:56:21 AM12/17/21

to Fluentd Google Group

have you resolved?

Reply all

Reply to author

Forward

Got "buffer flush took longer time than slow_flush_log_threshold" error

chikinchoi

Mr. Fiber

chikinchoi

chikinchoi

chikinchoi

Mr. Fiber

chikinchoi

chikinchoi

Vazel