queued_chunks_limit_size does not limit the number of chunk files created

Mahesh Gowda

unread,

Sep 10, 2018, 7:28:35 AM9/10/18

to Fluentd Google Group

Hi,

We are using fluentd-1.2.2 (td-agent-3.1.1-0.el7.x86_64.rpm).

fluentd buffer is used to buffer files. But we were facing issues when elasticsearch was down. The number of files created is increasing which results in all the fd resources being consumed.

so we updated to fluentd-1.2.2 where queued_chunks_limit_size parameter is available which limits the number of files created.

But unfortunately this does not limit the number of files created when destination is not reachable and this is consuming up the FD resources.

Below is a test td-agent.conf

<source>
    @type dummy
    tag "log.test"
    dummy {"hello":"world"}
</source>
<match log.*>
    @type copy
    <store>
      @type "elasticsearch"
      @log_level "info"
      include_tag_key true
      host "elasticsearch"
      port 9200
      logstash_format true
      logstash_prefix "journal"
      <buffer tag>
        @type "file"
        path "/tmp/fluentd122/"
        flush_mode interval
        flush_interval 1s
        retry_forever true
        retry_max_interval 5s
        chunk_limit_size 1k
        queued_chunks_limit_size 2
      </buffer>
    </store>
    <store>
      @type "stdout"
    </store>
</match>

Here have mentioned queued_chunks_limit_size as 2 and made the elasticsearch destination unreachable. As time goes the number of files created is increasing.

[root@vm-10-197-171-9 /tmp/fluentd122]$ ls -lhrt
total 172K
-rw-r--r--. 1 root root   60 Sep 10 10:37 buffer.q57581f1d58d817e450e4938bac6a273c.log
-rw-r--r--. 1 root root   72 Sep 10 10:37 buffer.q57581f1d58d817e450e4938bac6a273c.log.meta
-rw-r--r--. 1 root root   60 Sep 10 10:37 buffer.q57581f1f428284446d11b211f6b9760a.log
-rw-r--r--. 1 root root   72 Sep 10 10:37 buffer.q57581f1f428284446d11b211f6b9760a.log.meta
-rw-r--r--. 1 root root 9.6K Sep 10 10:42 buffer.q57581f212c12afd97f6a569ba588a68a.log
-rw-r--r--. 1 root root   74 Sep 10 10:42 buffer.q57581f212c12afd97f6a569ba588a68a.log.meta
-rw-r--r--. 1 root root   74 Sep 10 10:47 buffer.q5758204e7abe0bc0538fb389037fb397.log.meta
-rw-r--r--. 1 root root 9.5K Sep 10 10:47 buffer.q5758204e7abe0bc0538fb389037fb397.log
-rw-r--r--. 1 root root   74 Sep 10 10:52 buffer.q575821705b1558bf5f9173ff52d3dad2.log.meta
-rw-r--r--. 1 root root 9.5K Sep 10 10:52 buffer.q575821705b1558bf5f9173ff52d3dad2.log
-rw-r--r--. 1 root root   74 Sep 10 10:58 buffer.q575822924bc6724b4e00cba12fb89bec.log.meta
-rw-r--r--. 1 root root 9.5K Sep 10 10:58 buffer.q575822924bc6724b4e00cba12fb89bec.log
-rw-r--r--. 1 root root   74 Sep 10 11:03 buffer.q575823b43a1dbe2e4f388be78c6e2ed6.log.meta
-rw-r--r--. 1 root root 9.5K Sep 10 11:03 buffer.q575823b43a1dbe2e4f388be78c6e2ed6.log
-rw-r--r--. 1 root root 9.5K Sep 10 11:08 buffer.q575824d61df3e15bf315f62b3d19b55c.log
-rw-r--r--. 1 root root   74 Sep 10 11:08 buffer.q575824d61df3e15bf315f62b3d19b55c.log.meta
-rw-r--r--. 1 root root   74 Sep 10 11:13 buffer.q575825f8103290248f868b2ea2205e09.log.meta
-rw-r--r--. 1 root root 9.5K Sep 10 11:13 buffer.q575825f8103290248f868b2ea2205e09.log
-rw-r--r--. 1 root root 9.5K Sep 10 11:18 buffer.q57582719fe079e3e15de0d301c0a6cb3.log
-rw-r--r--. 1 root root   74 Sep 10 11:18 buffer.q57582719fe079e3e15de0d301c0a6cb3.log.meta
-rw-r--r--. 1 root root   74 Sep 10 11:23 buffer.q5758283be622dc9ec6ed9956f649561f.log.meta
-rw-r--r--. 1 root root 9.5K Sep 10 11:23 buffer.q5758283be622dc9ec6ed9956f649561f.log
-rw-r--r--. 1 root root   73 Sep 10 11:26 buffer.b5758295dd8bd9852d73f2751fdaea056.log.meta
-rw-r--r--. 1 root root 6.0K Sep 10 11:26 buffer.b5758295dd8bd9852d73f2751fdaea056.log

New buffer files are created indefinitely once the file size reaches 10K

Please let us know if there is something wrong with our configuration?

Thanks in advance,

Mahesh

Mahesh Gowda

unread,

Sep 10, 2018, 8:35:45 AM9/10/18

to Fluentd Google Group

Hi,

When running the configuration in debug mode i could notice this message in the logs:

2018-09-10 12:10:21 +0000 [warn]: #0 fluent/log.rb:336:warn: both of Plugin @id and path for <storage> are not specified. Using on-memory store.
2018-09-10 12:10:21 +0000 [warn]: fluent/log.rb:336:warn: parameter 'queued_chunks_limit_size' in <buffer tag>

@type "file"
path "/tmp/fluentd122/"
flush_mode interval
flush_interval 1s
retry_forever true
retry_max_interval 5s

chunk_limit_size 10k
queued_chunks_limit_size 2
</buffer> is not used.

Can you please let us know why this parameter is not considered.

Thanks,

Mr. Fiber

unread,

Sep 10, 2018, 8:37:54 AM9/10/18

to Fluentd Google Group

Please paste all startup log.

--
You received this message because you are subscribed to the Google Groups "Fluentd Google Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Message has been deleted

Mahesh Gowda

unread,

Sep 11, 2018, 8:59:35 AM9/11/18

to Fluentd Google Group

Im very very sorry posted wrong logs and was looking at wrong version of td-agent, please ignore the previous post.

The fluentd version is 1.2.2 with td-agent-3.2.0-0.el7.x86_64.rpm

The chunk files are getting created infinitely when destination is not reachable even when queued_chunks_limit_size 2

The fluentd logs are as below:

[root@vm-10-197-171-9 /home/cloud-user/mahesh]$ td-agent -c td-agent.conf
2018-09-11 12:44:21 +0000 [info]: parsing config file is succeeded path="td-agent.conf"
2018-09-11 12:44:22 +0000 [warn]: both of Plugin @id and path for <storage> are not specified. Using on-memory store.
2018-09-11 12:44:22 +0000 [warn]: both of Plugin @id and path for <storage> are not specified. Using on-memory store.
2018-09-11 12:44:22 +0000 [info]: using configuration file: <ROOT>

<source>
@type dummy
tag "log.test"

auto_increment_key "test"

    dummy {"hello":"world"}
</source>
<match log.*>
    @type copy
    <store>
      @type "elasticsearch"
      @log_level "info"
      include_tag_key true

host "172.17.0.2"

      port 9200
      logstash_format true
      logstash_prefix "journal"
      <buffer tag>
        @type "file"
        path "/tmp/fluentd122/"
        flush_mode interval
        flush_interval 1s
        retry_forever true
        retry_max_interval 5s

chunk_limit_size 2k

        queued_chunks_limit_size 2
      </buffer>
    </store>
    <store>
      @type "stdout"
    </store>
</match>

</ROOT>
2018-09-11 12:44:22 +0000 [info]: starting fluentd-1.2.2 pid=5446 ruby="2.4.4"
2018-09-11 12:44:22 +0000 [info]: spawn command to main: cmdline=["/opt/td-agent/embedded/bin/ruby", "-Eascii-8bit:ascii-8bit", "/sbin/td-agent", "-c", "td-agent.conf", "--under-supervisor"]
2018-09-11 12:44:22 +0000 [info]: gem 'fluent-plugin-elasticsearch' version '2.10.3'
2018-09-11 12:44:22 +0000 [info]: gem 'fluent-plugin-kafka' version '0.7.3'
2018-09-11 12:44:22 +0000 [info]: gem 'fluent-plugin-record-modifier' version '1.1.0'
2018-09-11 12:44:22 +0000 [info]: gem 'fluent-plugin-rewrite-tag-filter' version '2.1.0'
2018-09-11 12:44:22 +0000 [info]: gem 'fluent-plugin-s3' version '1.1.3'
2018-09-11 12:44:22 +0000 [info]: gem 'fluent-plugin-td' version '1.0.0'
2018-09-11 12:44:22 +0000 [info]: gem 'fluent-plugin-td-monitoring' version '0.2.4'
2018-09-11 12:44:22 +0000 [info]: gem 'fluent-plugin-webhdfs' version '1.2.3'
2018-09-11 12:44:22 +0000 [info]: gem 'fluentd' version '1.2.2'
2018-09-11 12:44:22 +0000 [info]: adding match pattern="log.*" type="copy"
2018-09-11 12:44:22 +0000 [info]: adding source type="dummy"
2018-09-11 12:44:22 +0000 [warn]: #0 both of Plugin @id and path for <storage> are not specified. Using on-memory store.
2018-09-11 12:44:22 +0000 [warn]: #0 both of Plugin @id and path for <storage> are not specified. Using on-memory store.
2018-09-11 12:44:22 +0000 [info]: #0 starting fluentd worker pid=5450 ppid=5446 worker=0
2018-09-11 12:44:22 +0000 [info]: #0 fluentd worker is now running worker=0
2018-09-11 12:44:23.031627925 +0000 log.test: {"hello":"world","test":0}
2018-09-11 12:44:24.038717060 +0000 log.test: {"hello":"world","test":1}
2018-09-11 12:44:25.040413341 +0000 log.test: {"hello":"world","test":2}
2018-09-11 12:44:25 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2018-09-11 12:44:26 +0000 chunk="57597d552b47fa0eb9fc417c8fe24de8" error_class=Fluent::Plugin::ElasticsearchOutput::ConnectionFailure error="Can not reach Elasticsearch cluster ({:host=>\"172.17.0.2\", :port=>9200, :scheme=>\"http\"})!"
2018-09-11 12:44:25 +0000 [warn]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluent-plugin-elasticsearch-2.10.3/lib/fluent/plugin/out_elasticsearch.rb:262:in `client'
2018-09-11 12:44:25 +0000 [warn]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluent-plugin-elasticsearch-2.10.3/lib/fluent/plugin/out_elasticsearch.rb:521:in `rescue in send_bulk'
2018-09-11 12:44:25 +0000 [warn]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluent-plugin-elasticsearch-2.10.3/lib/fluent/plugin/out_elasticsearch.rb:512:in `send_bulk'
2018-09-11 12:44:25 +0000 [warn]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluent-plugin-elasticsearch-2.10.3/lib/fluent/plugin/out_elasticsearch.rb:414:in `write'
2018-09-11 12:44:25 +0000 [warn]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.2.2/lib/fluent/plugin/output.rb:1099:in `try_flush'
2018-09-11 12:44:25 +0000 [warn]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.2.2/lib/fluent/plugin/output.rb:1378:in `flush_thread_run'
2018-09-11 12:44:25 +0000 [warn]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.2.2/lib/fluent/plugin/output.rb:440:in `block (2 levels) in start'
2018-09-11 12:44:25 +0000 [warn]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.2.2/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2018-09-11 12:44:26.044883361 +0000 log.test: {"hello":"world","test":3}
2018-09-11 12:44:26 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2018-09-11 12:44:26 +0000 chunk="57597d552b47fa0eb9fc417c8fe24de8" error_class=Fluent::Plugin::ElasticsearchOutput::ConnectionFailure error="Can not reach Elasticsearch cluster ({:host=>\"172.17.0.2\", :port=>9200, :scheme=>\"http\"})!"
2018-09-11 12:44:26 +0000 [warn]: #0 suppressed same stacktrace
2018-09-11 12:44:26 +0000 [warn]: #0 failed to flush the buffer. retry_time=1 next_retry_seconds=2018-09-11 12:44:26 +0000 chunk="57597d552b47fa0eb9fc417c8fe24de8" error_class=Fluent::Plugin::ElasticsearchOutput::ConnectionFailure error="Can not reach Elasticsearch cluster ({:host=>\"172.17.0.2\", :port=>9200, :scheme=>\"http\"})!"
2018-09-11 12:44:26 +0000 [warn]: #0 suppressed same stacktrace
2018-09-11 12:44:27.046902096 +0000 log.test: {"hello":"world","test":4}
2018-09-11 12:44:28.048874091 +0000 log.test: {"hello":"world","test":5}
2018-09-11 12:44:28 +0000 [warn]: #0 failed to flush the buffer. retry_time=2 next_retry_seconds=2018-09-11 12:44:28 +0000 chunk="57597d552b47fa0eb9fc417c8fe24de8" error_class=Fluent::Plugin::ElasticsearchOutput::ConnectionFailure error="Can not reach Elasticsearch cluster ({:host=>\"172.17.0.2\", :port=>9200, :scheme=>\"http\"})!"
2018-09-11 12:44:28 +0000 [warn]: #0 suppressed same stacktrace
2018-09-11 12:44:29.050509308 +0000 log.test: {"hello":"world","test":6}
2018-09-11 12:44:30.052141194 +0000 log.test: {"hello":"world","test":7}
2018-09-11 12:44:31.053766455 +0000 log.test: {"hello":"world","test":8}
2018-09-11 12:44:32.055380838 +0000 log.test: {"hello":"world","test":9}

The buffered directory list is as below:

[root@vm-10-197-171-9 /tmp/fluentd122]$ ls -lhrt

total 96K
-rw-------. 1 root root   60 Sep 11 12:44 buffer.q57597d552b47fa0eb9fc417c8fe24de8.log
-rw-------. 1 root root   72 Sep 11 12:44 buffer.q57597d552b47fa0eb9fc417c8fe24de8.log.meta
-rw-------. 1 root root   60 Sep 11 12:44 buffer.q57597d5715aa1d017eec6480debcd428.log
-rw-------. 1 root root   72 Sep 11 12:44 buffer.q57597d5715aa1d017eec6480debcd428.log.meta
-rw-------. 1 root root 2.0K Sep 11 12:45 buffer.q57597d58ff8c5d286795c35b92c0f5d5.log
-rw-------. 1 root root   72 Sep 11 12:45 buffer.q57597d58ff8c5d286795c35b92c0f5d5.log.meta
-rw-------. 1 root root 2.0K Sep 11 12:46 buffer.q57597d96f7d13f9a3e980ef3acd775e2.log
-rw-------. 1 root root   72 Sep 11 12:46 buffer.q57597d96f7d13f9a3e980ef3acd775e2.log.meta
-rw-------. 1 root root 2.0K Sep 11 12:47 buffer.q57597dd5fbe5432ac7182fcbb6e5c6aa.log
-rw-------. 1 root root   72 Sep 11 12:47 buffer.q57597dd5fbe5432ac7182fcbb6e5c6aa.log.meta
-rw-------. 1 root root   72 Sep 11 12:48 buffer.q57597e11fd2f45b013d18804d599549e.log.meta
-rw-------. 1 root root 2.0K Sep 11 12:48 buffer.q57597e11fd2f45b013d18804d599549e.log
-rw-------. 1 root root 2.0K Sep 11 12:49 buffer.q57597e4e15fda7acc217dfbc0febeedd.log
-rw-------. 1 root root   72 Sep 11 12:49 buffer.q57597e4e15fda7acc217dfbc0febeedd.log.meta
-rw-------. 1 root root 2.0K Sep 11 12:50 buffer.q57597e8847c6aecd2ce87369fdea5957.log
-rw-------. 1 root root   72 Sep 11 12:50 buffer.q57597e8847c6aecd2ce87369fdea5957.log.meta
-rw-------. 1 root root   72 Sep 11 12:51 buffer.q57597ec27b25c2e8202d252ee642d68c.log.meta
-rw-------. 1 root root 2.0K Sep 11 12:51 buffer.q57597ec27b25c2e8202d252ee642d68c.log
-rw-------. 1 root root   72 Sep 11 12:52 buffer.q57597efc949547fdb5baf37db8bf67f4.log.meta
-rw-------. 1 root root 2.0K Sep 11 12:52 buffer.q57597efc949547fdb5baf37db8bf67f4.log
-rw-------. 1 root root 2.0K Sep 11 12:53 buffer.q57597f36d0849cd6d3808ec9c090775c.log
-rw-------. 1 root root   72 Sep 11 12:53 buffer.q57597f36d0849cd6d3808ec9c090775c.log.meta
-rw-------. 1 root root   72 Sep 11 12:54 buffer.b57597f70f2f7f7034cccb24890701285.log.meta
-rw-------. 1 root root 1.5K Sep 11 12:54 buffer.b57597f70f2f7f7034cccb24890701285.log

Please suggest.

Thanks & Regards,

Mahesh

Mr. Fiber

unread,

Sep 18, 2018, 5:10:31 AM9/18/18

to Fluentd Google Group

We implemented queued_chunks_limit_size for short flush_interval issue,

so this doesn't consider very small chunk size issue for now.

In your case, the problem is small chunk size, not queued_chunks_limit_size because

the total number of buffer file is not changed.

If you want to avoid fd limitation problem, you need to set proper total_limit_size for buffer configuration.

Mahesh Gowda

unread,

Sep 19, 2018, 4:03:57 AM9/19/18

to Fluentd Google Group

Hi,

thanks for the reply.

    @type copy
    <store>
        @type elasticsearch

        host elasticsearch.default
        port 9200
        resurrect_after 5s
        type_name fluentd
        time_key time
        utc_index true
        time_key_exclude_timestamp
        index_name ${namespace}${type}%Y.%m.%d
        <buffer tag, time, namespace, type>
            @type file
            path /var/log/td-agent/elasticsearch-buffer/nokia.logging.all.all
            flush_mode interval
            flush_interval 1s
            timekey 3600

        </buffer>
    </store>
   <store>
       @type stdout
   </store>
</match>

With the above configuration in our environment where fluentd is expected to collect all the kubernetes logs, when elasticsearch was down, the number of FD opened by fluentd crossed more than 3 Lakhs and hence the system was becoming unresponsive. When checked the buffer directory there was around 3 Lakhs files created with file sizes ranging from few Kbs to mbs

So after reading the document our understanding was "queued_chunks_limit_size" parameter should limit the number of files created so we added this parameter and ran the fluentd with below configuration changes

    @type copy
    <store>
        @type elasticsearch

        host elasticsearch.default
        port 9200
        resurrect_after 5s
        type_name fluentd
        time_key time
        utc_index true
        time_key_exclude_timestamp
        index_name ${namespace}${type}%Y.%m.%d
        <buffer tag, time, namespace, type>
            @type file
            path /var/log/td-agent/elasticsearch-buffer/nokia.logging.all.all
            flush_mode interval
            flush_interval 1s
            timekey 3600
            chunk_limit_size 1m

            queued_chunks_limit_size 2
        </buffer>
    </store>
   <store>
       @type stdout
   </store>
</match>

But with this changes also the number of files created did not stop and we got to the same FD issue and system went down again.

Please correct if our understanding of queued_chunks_limit_size is wrong.

Also can you please suggest us how do we solve this FD issue when the destination of fluentd( eg. elasticsearch) is not reachable.