"Secondary" output not appearing to work as backup.

Christian Hedegaard

unread,

Dec 17, 2014, 1:36:36 PM12/17/14

to flu...@googlegroups.com

Gentlemen, I have an issue with a couple of nodes in a terrible network (likely connected to a bad switch) which constantly drop their connection to our central log server. Because of this, their logs (mostly syslog) never reach the log server. However, when the network comes back up and works properly, there does appear to be a spurt of logs. Here’s where it gets interesting though, we have logs that just NEVER make it to the log server. The buffering and queuing appears to be dropping logs. Now that would be ok normally, because in my forwarder config I have it use a secondary output to a file which we alert on. The problem is that failed records also never make it to that file.

This situation is making me worry about how much trust I can put into fluentd to reliably deliver all my logs L

The error logs and forwarder configs:

http://pastebin.com/EwE9MbEH

The syntax for the failed records file was from here: http://docs.fluentd.org/articles/out_forward

Can someone help me understand why flushed/failed logs would not make it to this file? Right now we’re losing logs in production and that’s bad.

Mr. Fiber

unread,

Dec 17, 2014, 3:30:41 PM12/17/14

to flu...@googlegroups.com

Hi Christian,

> -rw-rw-r-- 1 td-agent td-agent 0 Dec 3 01:20 failed_records

That's weird because the result of `out_file` path should have a tag in this case.

% ls -al /tmp/
-rw-r--r-- 1 repeatedly wheel 30B 12 18 05:12 failed_records.td.foo_0.log

This is my secondary result with td.foo tag so only failed_records is something bad.

In addition, your log doesn't have `retry count exceededs limit. falling back to secondary output.`

message in your logs. It means secondary is not called.

I will check it deeply later.

P.S.

fluentd's buffer retry uses exponential back-off to avoid retry burst.

If you have a huge traffic, we recommend to tune `max_retry_wait` and `retry_limit`

on your network condition.

http://docs.fluentd.org/articles/out_forward#retrywait-maxretrywait

Masahiro

--
You received this message because you are subscribed to the Google Groups "Fluentd Google Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Christian Hedegaard

unread,

Dec 17, 2014, 3:40:13 PM12/17/14

to flu...@googlegroups.com

Ok, the “failed_records” file is created by us with the appropriate permissions when we install td-agent (we use chef to provision nodes, and that’s part of the chef process).

If td-agent is supposed to write to a different file just pre-pended with failed_records, then that’s just a misunderstanding on our part. That said, I have no machines in which there are failed records writing to a file with the tags appended to the end of the filename.

Mr. Fiber

unread,

Dec 18, 2014, 3:43:04 PM12/18/14

to flu...@googlegroups.com

> Ok, the “failed_records” file is created by us with the appropriate permissions when we install td-agent (we use chef to provision nodes, and that’s part of the chef process).

Okay, I understood the situation.

From your logs, almost retries are succeeded

> 2014-12-17 15:45:35 +0000 [warn]: retry succeeded. instance=69893119591080
> 2014-12-17 16:13:57 +0000 [warn]: retry succeeded. instance=69893119591080

but one retry is not logged:

> 2014-12-17 16:57:03 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2014-12-17 21:27:23 +0000 error_class="RuntimeError" error="no nodes are available" instance=69893096281760

This is weird. Did retry succeed after 2014-12-17 18:09:20 +000, the end time of your logs, or

does this happen at other time?

> Can someone help me understand why flushed/failed logs would not make it to this file?

https://github.com/fluent/fluentd/blob/8b4b94022167e0f80f71b2c2be4528cc80c5155a/lib/fluent/output.rb#L399

BufferedOutput calculates next retry time using this routine.

Your configuration uses default configurations, retry_limit is 17 and max_retry_wait is disabled.

So secondary is not used within your logs range, from 2014-12-17 17:06:14 +0000 to 2014-12-17 18:09:20 +0000.

> Right now we’re losing logs in production and that’s bad.

How to check losing logs?

The buffer of out_forward is growing during retry?

When retry is succeeded, aggregator td-agent doesn't receive full logs from forwarders?

There is no error excluding retry message but some logs are not stored in destination database?

I heard some users hit similar situation with network issue but

td-agent worked properly. So I want to know your detailed situation.

Masahiro

Christian Hedegaard

unread,

Dec 18, 2014, 5:16:53 PM12/18/14

to flu...@googlegroups.com

I was finally able to get my failed_records files populated by decreasing the amount of retries fluentd takes. So it appears that’s working.

I’m going to retract my statement about losing records.. I think what’s really happening is that fluentd is caching everything for up to the full 37 hours, but since it never flushes we replace the server by then (using AWS autoscaling, no server ever lasts more than 36 hours). Since the server gets replaced, fluentd tries to flush itself one last time on shutdown and if it can’t it just tosses the logs and then –poof- the server disappears forever. Heh.

So.. Here’s my next question.

Now that I can induce flushing to my secondary failed_records, is there a tool for getting those logs back into fluentd? I could use a different host that’s not in the subnet having the network issues.

Thanks guys!

Kiyoto Tamura

unread,

Dec 18, 2014, 9:16:37 PM12/18/14

to flu...@googlegroups.com

Christian-

If you are trying to re-upload file-buffered data, you can use this script: https://gist.github.com/kiyoto/e22c3238aced02b2547a

What that script does it that it opens the file buffer located in argv[0] and sends that content to Fluentd listening on localhost:24224 using in_forward. If you want to modify host/port combination, change it on Line 4.

You can run it like this:

ruby fluentd-recover.rb /path/to/failed_records

Make sure that you have fluent-logger gem. You can do that with "gem install fluent-logger"

Kiyoto

Check out Fluentd, the open source data collector to unify log management.

Kiyoto Tamura

unread,

Dec 19, 2014, 4:49:15 PM12/19/14

to flu...@googlegroups.com

Christian-

One more follow-up. As it turned out, there is an undocumented parameter for buffered output plugins called "flush_at_shutdown", which is set to false for file-buffered output plugins.

This feature is now documented > http://docs.fluentd.org/articles/out_forward#flushatshutdown

If set to true, fluentd waits for the buffer to flush at shutdown.

Kiyoto

Mr. Fiber

unread,

Dec 19, 2014, 5:24:27 PM12/19/14

to flu...@googlegroups.com

flush_at_shutdown is described in each buffer plugin document.

http://docs.fluentd.org/articles/buf_file

http://docs.fluentd.org/articles/buf_memory

But maybe, output plugin document doesn't have full buffer parameters.

We need to improve each configuration.

In addition, adding rough configuration examples, e.g. for on-premis and cloud,

is better to learn Fluentd.

Masahiro

Chingis Dugarzhapov

unread,

Feb 12, 2016, 10:27:17 AM2/12/16

to Fluentd Google Group

Hi Kiyoto,

I have tested your re-upload script, it works well. It needed a bit tweaking though: I have 0.12.12, and

with this configuration:

type file

path /var/log/td-agent/stream-failed/stream-failed

append true

</secondary>

failed records are written on disk with filenames "stream-failed.$tag.log". I needed to extract $tag and feed it

to your script as input parameter. I have quite tricky tag logic on aggregator (thanks, forest plugin..)

How about integrating your script together with <secondary> option? "auto-recover true", and when heartbeat

re-discovers that on of <server>'s is back online, just re-upload failed stuff, of course with automatic tag extraction

from filename

Kiyoto Tamura

unread,

Feb 12, 2016, 10:34:34 AM2/12/16

to flu...@googlegroups.com

Chingis-

Absolutely. If you have a new version that works, please link it here/or I'll incorporate your improvement and credit you.

Refer a friend to Treasure Data and get paid! Ask me about our referral program =)

Reply all

Reply to author

Forward