Gentlemen, I have an issue with a couple of nodes in a terrible network (likely connected to a bad switch) which constantly drop their connection to our central log server. Because of this, their logs (mostly syslog) never reach the log server. However, when the network comes back up and works properly, there does appear to be a spurt of logs. Here’s where it gets interesting though, we have logs that just NEVER make it to the log server. The buffering and queuing appears to be dropping logs. Now that would be ok normally, because in my forwarder config I have it use a secondary output to a file which we alert on. The problem is that failed records also never make it to that file.
This situation is making me worry about how much trust I can put into fluentd to reliably deliver all my logs L
The error logs and forwarder configs:
The syntax for the failed records file was from here: http://docs.fluentd.org/articles/out_forward
Can someone help me understand why flushed/failed logs would not make it to this file? Right now we’re losing logs in production and that’s bad.
--
You received this message because you are subscribed to the Google Groups "Fluentd Google Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Ok, the “failed_records” file is created by us with the appropriate permissions when we install td-agent (we use chef to provision nodes, and that’s part of the chef process).
If td-agent is supposed to write to a different file just pre-pended with failed_records, then that’s just a misunderstanding on our part. That said, I have no machines in which there are failed records writing to a file with the tags appended to the end of the filename.
I was finally able to get my failed_records files populated by decreasing the amount of retries fluentd takes. So it appears that’s working.
I’m going to retract my statement about losing records.. I think what’s really happening is that fluentd is caching everything for up to the full 37 hours, but since it never flushes we replace the server by then (using AWS autoscaling, no server ever lasts more than 36 hours). Since the server gets replaced, fluentd tries to flush itself one last time on shutdown and if it can’t it just tosses the logs and then –poof- the server disappears forever. Heh.
So.. Here’s my next question.
Now that I can induce flushing to my secondary failed_records, is there a tool for getting those logs back into fluentd? I could use a different host that’s not in the subnet having the network issues.
Thanks guys!