td-agent incorrectly determines that no nodes are available

Dave McKenna

unread,

Nov 2, 2016, 1:49:31 PM11/2/16

to Fluentd Google Group

Has anyone experienced a problem with a td-agent “sender” process incorrectly determining that no nodes are available when it tries to flush the buffer? It looks like it experienced connection issues with the "receiver" process (which is on another machine) and from that point on, even after the receiver became reachable again, it thinks it's still unavailable.

I can confirm the receiver receives because when I send

```echo -e '{"message":"TEST MESSAGE","host":"hd1app1","service":"test_service"}\0' | nc 10.0.1.138 42185```

the receiver gets it.

Here are logs from td-agent.log when the connection failure first happened:

```

2016-11-02 03:58:18 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2016-11-02 03:57:16 +0000 error_class="Errno::ETIMEDOUT" error="Connection timed out - connect(2) for \"10.0.1.138\" port 42185" plugin_id="object:3fed2e39429c"

2016-11-02 03:58:18 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2016-11-02 03:57:17 +0000 error_class="RuntimeError" error="no nodes are available" plugin_id="object:3fed2e39429c"

2016-11-02 03:58:18 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2016-11-02 03:57:21 +0000 error_class="RuntimeError" error="no nodes are available" plugin_id="object:3fed2e39429c"

```

And after seeing the receiver work with netcat, I send a signal to the process to flush the buffer and it can't see the node:

```sudo kill -s USR1 18815```

produces this in td-agent.log:

```

2016-11-02 17:31:52 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2016-11-02 20:01:06 +0000 error_class="RuntimeError" error="no nodes are available" plugin_id="object:3fed2e39429c"

```

Here is the config on the sender:

```

@type copy

<store>

@type stdout

</store>

<store>

@type forward

# primary host

host 10.0.1.138

port 42185

</server>

buffer_type file

buffer_path /var/log/td-agent/buffer/hd.*.buffer

buffer_chunk_limit 128m

buffer_queue_limit 64

flush_interval 20s

</store>

</match>

```

and on the receiver:

```

type forward

port 42185

protocol_type tcp

tag hd

format none

</source>

```

A restart of the sending td-agent results in the buffer being flushed, but I don’t want to have to do that. I’d like td-agent to be able to tell realize that receiving node/service is back up.

Mr. Fiber

unread,

Nov 2, 2016, 3:01:26 PM11/2/16

to Fluentd Google Group

I'm not sure why sudp heartbeat still fails after receiver is reachable.

How about using "heartbeat_type tcp"?

Masahiro

--
You received this message because you are subscribed to the Google Groups "Fluentd Google Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dave McKenna

unread,

Nov 2, 2016, 4:50:34 PM11/2/16

to Fluentd Google Group

Yes! "heartbeat_type tcp" works. Sender now knows when the receiver is available again. Thanks a lot.

Dave

To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+u...@googlegroups.com.

Jeet Pandya

unread,

Jan 31, 2017, 7:58:06 AM1/31/17

to Fluentd Google Group

Hi,

Even after using heartbeat_type tcp the same issue is coming in my case

2017-01-31 17:52:20 +0530 [warn]: detached forwarding server '10.10.3.234:10443' host="10.10.3.234" port=10443 phi=55.882959776860744
2017-01-31 17:52:20 +0530 [warn]: temporarily failed to flush the buffer. next_retry=2017-01-31 17:52:21 +0530 error_class="RuntimeError" error="no nodes are available" plugin_id="object:3f88a461fb54"
  2017-01-31 17:52:20 +0530 [warn]: suppressed same stacktrace
2017-01-31 17:52:21 +0530 [warn]: temporarily failed to flush the buffer. next_retry=2017-01-31 17:52:23 +0530 error_class="RuntimeError" error="no nodes are available" plugin_id="object:3f88a461fb54"
  2017-01-31 17:52:21 +0530 [warn]: suppressed same stacktrace
2017-01-31 17:52:23 +0530 [warn]: temporarily failed to flush the buffer. next_retry=2017-01-31 17:52:27 +0530 error_class="RuntimeError" error="no nodes are available" plugin_id="object:3f88a461fb54"
  2017-01-31 17:52:23 +0530 [warn]: suppressed same stacktrace
2017-01-31 17:52:27 +0530 [warn]: temporarily failed to flush the buffer. next_retry=2017-01-31 17:52:35 +0530 error_class="RuntimeError" error="no nodes are available" plugin_id="object:3f88a461fb54"
  2017-01-31 17:52:27 +0530 [warn]: suppressed same stacktrace
2017-01-31 17:52:30 +0530 [warn]: recovered forwarding server '10.10.3.234:10443' host="10.10.3.234" port=10443
2017-01-31 17:52:35 +0530 [warn]: retry succeeded. plugin_id="object:3f88a461fb54"

Any Idea??

To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+u...@googlegroups.com.

Reply all

Reply to author

Forward