Hello,
This morning we logged an error in our td-agent aggregator, in the queue that takes in all our data and does some geoip processing. This is the error that was logged:
2016-05-11 05:44:29 -0500 [warn]: retry succeeded. plugin_id="object:3faa0246e6a0"
2016-05-11 08:49:15 -0500 [warn]: temporarily failed to flush the buffer. next_retry=2016-05-11 08:49:15 -0500 error_class="TypeError" error="no implicit conversion of String into Integer" plugin_id="object:3faa0255de1c"
2016-05-11 08:49:15 -0500 [warn]: suppressed same stacktrace
2016-05-11 08:49:15 -0500 [warn]: temporarily failed to flush the buffer. next_retry=2016-05-11 08:49:18 -0500 error_class="TypeError" error="no implicit conversion of String into Integer" plugin_id="object:3faa0255de1c"
2016-05-11 08:49:15 -0500 [warn]: suppressed same stacktrace
2016-05-11 08:49:18 -0500 [warn]: temporarily failed to flush the buffer. next_retry=2016-05-11 08:49:21 -0500 error_class="TypeError" error="no implicit conversion of String into Integer" plugin_id="object:3faa0255de1c"
2016-05-11 08:49:18 -0500 [warn]: suppressed same stacktrace
Once this was logged, the buffer kept building and building until it grew and it started to drop all our data:
2016-05-11 09:00:22 -0500 [error]: forward error error=#<Fluent::BufferQueueLimitError: queue size exceeds limit> error_class=Fluent::BufferQueueLimitError
2016-05-11 09:00:22 -0500 [error]: suppressed same stacktrace
2016-05-11 09:00:22 -0500 [warn]: emit transaction failed: error_class=Fluent::BufferQueueLimitError error="queue size exceeds limit" tag="tag_name"
I've restarted our agent to get data processing again, since we were dropping data.
So, a couple questions:
1) If this happens again, how can I find out exactly what went wrong?
2) If some bad data got in there somehow, is there a way to remove the troublesome data from the buffer and get it to start sending again?
3) Is there a way for the td-agent on our app servers to know that the aggregator instance is raising BufferQueueLimitError, so it knows it has to keep the data in its buffer to retry later?