Buffer Never Flushed

555 views
Skip to first unread message

Jacob Williams

unread,
Jan 25, 2017, 5:46:22 PM1/25/17
to Fluentd Google Group
Our company uses fluentd to feed logs to our proprietary database/search server. For reasons I have not been able to identify, fluentd stopped flushing the memory buffer for our plugin that sends the aggregated records to our database.

It was working just fine for weeks, and then suddenly stopped at 5:06 am on Monday morning. No errors are logged, I've restarted fluentd, but it still will not flush the buffer to send the log entries to our database.

This is our config:

<source>

  @type tail

  path /opt/search/appliance5/logs/ps-usage.*.log

  pos_file /opt/search/appliance5/logs/ps-usage.log.pos

  format /(?<logEntry>{.*)/

  tag logentry

  read_from_head true

</source>


<filter logentry>

  @type psxml

</filter>


<match logentry>

  @type http_buffered

  <buffer>

    flush_interval 30s

    chunk_limit_size 2m

  </buffer>

  endpoint_url http://localhost:30000/cache/addrecords?store=%Y

  http_retry_statuses 500, 403, 404, 400

  http_read_timeout 30

  http_open_timeout 30

</match>


I've verified the psxml filter is working properly and the records are making it past that and into the buffer queue.
I doubt it's the http_buffered plugin, since it had been working perfectly fine. But if you want to see the code, it's on our GitHub: https://github.com/perfectsearch/fluent-plugin-out-http-buffered


The buffer fills up and starts throwing overflow exceptions everywhere, but the write method in the http_buffered plugin is never being called.

Mr. Fiber

unread,
Jan 25, 2017, 8:11:55 PM1/25/17
to Fluentd Google Group
>  I've restarted fluentd, but it still will not flush the buffer to send the log entries to our database.

It means this problem is reproducible, right?
If so, could you check sigdump result?

http://docs.fluentd.org/articles/trouble-shooting#dump-fluentd-internal-information

If the deadlock happens, stacktrace of thread show it.


> but the write method in the http_buffered plugin is never being called.

How to check this? Insert logging code or your endpoint doesn't receive requests?


Masahiro

--
You received this message because you are subscribed to the Google Groups "Fluentd Google Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jacob Williams

unread,
Jan 26, 2017, 2:24:15 PM1/26/17
to Fluentd Google Group
There were no sigdump logs from the 23rd. There were a couple from the 20th, but I don't know that those are important since fluentd continued to work properly for the next few days. The first real sigdump logging did not occur until after I restarted fluentd on the 24th around noon. Because my company works with hospitals/care centers, much of the data logged contains people's personal health information, some of which, has been included in the sigdump files. So I have removed anything that could possibly contain sensitive info before sending the sigdump log.

On the machine I added a $log.debug as the first line in the write method of the http_buffered plugin. Then started fluentd in full debug mode. That log message was never logged, which is how I know the write method was not called.
To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+u...@googlegroups.com.
sigdump-4069-clean.log

Mr. Fiber

unread,
Jan 27, 2017, 12:04:39 AM1/27/17
to Fluentd Google Group
There were a couple from the 20th

So the ttached sigdump result doesn't contain stopped situation's thread info?

> but it still will not flush the buffer to send the log entries to our database.

To be clear, in this situation, http_buffered receives new events but it doesn't flush, right?

On the machine I added a $log.debug as the first line in the write method of the http_buffered plugin.

Thanks for the explanation.


To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+unsubscribe@googlegroups.com.

Mr. Fiber

unread,
Jan 27, 2017, 12:27:51 AM1/27/17
to Fluentd Google Group
If get sigdump result of stopped situation,
we may detect the cause of the problem.
Currently, there are several situations:

- buffer has race condition between enqueue / flush
- Net::HTTP is blocking during writing data. Net::HTTP doesn't have write_timeout unlike httpclient.
- etc...

So I want to reduce the possible situations to resolve the problem.


On Fri, Jan 27, 2017 at 2:04 PM, Mr. Fiber <repea...@gmail.com> wrote:
There were a couple from the 20th

So the ttached sigdump result doesn't contain stopped situation's thread info?

> but it still will not flush the buffer to send the log entries to our database.

To be clear, in this situation, http_buffered receives new events but it doesn't flush, right?

On the machine I added a $log.debug as the first line in the write method of the http_buffered plugin.

Thanks for the explanation.

Jacob Williams

unread,
Jan 27, 2017, 12:58:52 PM1/27/17
to Fluentd Google Group
> To be clear, in this situation, http_buffered receives new events but it doesn't flush, right?

How would I check this? I know the plugin is started, and the buffer is being filled, but #write is never called. How do I tell if it is receiving new events?

> If get sigdump result of stopped situation, we may detect the cause of the problem.

I'll send all the other sigdump logs I have, just in case. After glancing through them myself (specifically sigdump-22727.log), it looks like it can't create the thread that it would use to flush the buffer? If I understand the logs correctly..
sigdump-4064.log
sigdump-4069.log
sigdump-22722.log
sigdump-22727.log
sigdump-142829.log
sigdump-142834.log
sigdump-147136.log
sigdump-147141.log

Jacob Williams

unread,
Jan 27, 2017, 1:02:51 PM1/27/17
to Fluentd Google Group
It was brought to my attention that before the fluentd process was killed, it had been using 100% of a processor core on the machine. This may be why it couldn't create a thread to flush the buffer. Are there specific reasons why fluentd would be using 100% of a CPU? This was running on a production machine with 128Gb RAM, and a 32 core processor, not sure about the CPUs Ghz though. In our small scale tests, we never saw fluentd using very much of a machines resources, even under strenuous conditions with massive amounts of logs to filter and send...

Jacob Williams

unread,
Jan 30, 2017, 5:04:30 PM1/30/17
to Fluentd Google Group
So we managed to identify the problem. There are a large number of very large files that are processed by fluentd. Since we have read_from_head set to true, the documentation states that the in_tail plugin will process each file before continuing to start other plugins. As a result, the out-http-buffered plugin was never being started and therefore, the buffer was never being flushed since the plugin that handled the flushes was not started.

My question for you now is, What happens if we set read_from_head to false? Several of the files being processed are actively being appended to. What happens to the new data added to the files when fluentd does not read from the top of the file? 

Or is there some way to still read from the head, but not have the tailing plugin process the files before continuing starting our other plugins?

Mr. Fiber

unread,
Feb 2, 2017, 5:49:39 AM2/2/17
to Fluentd Google Group
As a result, the out-http-buffered plugin was never being started and therefore

That's weird. Fluentd's start sequence is

- start output
- start filter
- start input

So output plugin should be started before in_tail starts to read files.

 if we set read_from_head to false?

Fluentd starts to read file from the bottom. This is same as tail command.
If pos_file contains previous position information, use its position instead of the bottom.






To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+unsubscribe@googlegroups.com.

Jacob Williams

unread,
Feb 2, 2017, 12:37:48 PM2/2/17
to Fluentd Google Group
We stepped through the ruby debugger line by line and discovered that the #start method was not being called on the output plugin we're using until after in_tail had finished processing files. So even though it was creating the buffer chunks, it was never flushing them since the output plugin hadn't started and therefore couldn't flush them.

We were using fluentd 0.14.9, although I've upgraded to 0.14.12 and the behavior appears to be the same. It will not flush any buffers until after in_tail has finished.
Reply all
Reply to author
Forward
0 new messages