td-agent: advice on dealing with bad data

454 views
Skip to first unread message

Will Platnick

unread,
May 11, 2016, 12:57:29 PM5/11/16
to Fluentd Google Group
Hello,
This morning we logged an error in our td-agent aggregator, in the queue that takes in all our data and does some geoip processing. This is the error that was logged:

2016-05-11 05:44:29 -0500 [warn]: retry succeeded. plugin_id="object:3faa0246e6a0"
2016-05-11 08:49:15 -0500 [warn]: temporarily failed to flush the buffer. next_retry=2016-05-11 08:49:15 -0500 error_class="TypeError" error="no implicit conversion of String into Integer" plugin_id="object:3faa0255de1c"
  2016-05-11 08:49:15 -0500 [warn]: suppressed same stacktrace
2016-05-11 08:49:15 -0500 [warn]: temporarily failed to flush the buffer. next_retry=2016-05-11 08:49:18 -0500 error_class="TypeError" error="no implicit conversion of String into Integer" plugin_id="object:3faa0255de1c"
  2016-05-11 08:49:15 -0500 [warn]: suppressed same stacktrace
2016-05-11 08:49:18 -0500 [warn]: temporarily failed to flush the buffer. next_retry=2016-05-11 08:49:21 -0500 error_class="TypeError" error="no implicit conversion of String into Integer" plugin_id="object:3faa0255de1c"
  2016-05-11 08:49:18 -0500 [warn]: suppressed same stacktrace

Once this was logged, the buffer kept building and building until it grew and it started to drop all our data:
2016-05-11 09:00:22 -0500 [error]: forward error error=#<Fluent::BufferQueueLimitError: queue size exceeds limit> error_class=Fluent::BufferQueueLimitError
  2016-05-11 09:00:22 -0500 [error]: suppressed same stacktrace
2016-05-11 09:00:22 -0500 [warn]: emit transaction failed: error_class=Fluent::BufferQueueLimitError error="queue size exceeds limit" tag="tag_name"


I've restarted our agent to get data processing again, since we were dropping data.

So, a couple questions:

1) If this happens again, how can I find out exactly what went wrong?
2) If some bad data got in there somehow, is there a way to remove the troublesome data from the buffer and get it to start sending again?
3) Is there a way for the td-agent on our app servers to know that the aggregator instance is raising BufferQueueLimitError, so it knows it has to keep the data in its buffer to retry later?


Mr. Fiber

unread,
May 12, 2016, 4:02:32 PM5/12/16
to Fluentd Google Group
Hi Will,

Sorry for late response.

1) If this happens again, how can I find out exactly what went wrong?

Basically, we should check fluentd configuration / plugin can handle incoming events correctly using actual data.
Maybe, first exception stacktrace shows what a position causes a problem.

2) If some bad data got in there somehow, is there a way to remove the troublesome data from the buffer and get it to start sending again?

Currently, we have <@ERROR> label and emit_error_event API to rescue such bad data but plugin support is needed.


So send a patch is one way.

Another way is using <secondary> and small retry_limit for sending chunks which contain bad data to local file.

3) Is there a way for the td-agent on our app servers to know that the aggregator instance is raising BufferQueueLimitError, so it knows it has to keep the data in its buffer to retry later?
Currently no direct way. Using "require_ack_response true" and no return ack do similar behavior.
Hmm... catching BufferQueueLimitError and return it to forwarder seems nice idea.
I will consider it.


Masahiro

--
You received this message because you are subscribed to the Google Groups "Fluentd Google Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages