td-agent incorrectly determines that no nodes are available

2,103 views
Skip to first unread message

Dave McKenna

unread,
Nov 2, 2016, 1:49:31 PM11/2/16
to Fluentd Google Group
Has anyone experienced a problem with a td-agent “sender” process incorrectly determining that no nodes are available when it tries to flush the buffer? It looks like it experienced connection issues with the "receiver" process (which is on another machine) and from that point on, even after the receiver became reachable again, it thinks it's still unavailable.

I can confirm the receiver receives because when I send

```echo -e '{"message":"TEST MESSAGE","host":"hd1app1","service":"test_service"}\0' | nc 10.0.1.138 42185```

the receiver gets it.

Here are logs from td-agent.log when the connection failure first happened:

```
2016-11-02 03:58:18 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2016-11-02 03:57:16 +0000 error_class="Errno::ETIMEDOUT" error="Connection timed out - connect(2) for \"10.0.1.138\" port 42185" plugin_id="object:3fed2e39429c"
2016-11-02 03:58:18 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2016-11-02 03:57:17 +0000 error_class="RuntimeError" error="no nodes are available" plugin_id="object:3fed2e39429c"
2016-11-02 03:58:18 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2016-11-02 03:57:21 +0000 error_class="RuntimeError" error="no nodes are available" plugin_id="object:3fed2e39429c"
```

And after seeing the receiver work with netcat, I send a signal to the process to flush the buffer and it can't see the node:

```sudo kill -s USR1 18815```

produces this in td-agent.log:

```
2016-11-02 17:31:52 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2016-11-02 20:01:06 +0000 error_class="RuntimeError" error="no nodes are available" plugin_id="object:3fed2e39429c"
```

Here is the config on the sender:
```
<match hd.**>
  @type copy
  <store>
    @type stdout
  </store>
  <store>
    @type forward
    # primary host
    <server>
      host 10.0.1.138
      port 42185
    </server>

    buffer_type file
    buffer_path /var/log/td-agent/buffer/hd.*.buffer
    buffer_chunk_limit 128m
    buffer_queue_limit 64
    flush_interval 20s
  </store>
</match>
```

and on the receiver:

```
<source>
  type forward
  port 42185
  protocol_type tcp
  tag hd
  format none
</source>
```

A restart of the sending td-agent results in the buffer being flushed, but I don’t want to have to do that. I’d like td-agent to be able to tell realize that receiving node/service is back up.

Mr. Fiber

unread,
Nov 2, 2016, 3:01:26 PM11/2/16
to Fluentd Google Group
I'm not sure why sudp heartbeat still fails after receiver is reachable.
How about using "heartbeat_type tcp"?


Masahiro

--
You received this message because you are subscribed to the Google Groups "Fluentd Google Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dave McKenna

unread,
Nov 2, 2016, 4:50:34 PM11/2/16
to Fluentd Google Group
Yes! "heartbeat_type tcp" works. Sender now knows when the receiver is available again. Thanks a lot.

Dave
To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+u...@googlegroups.com.

Jeet Pandya

unread,
Jan 31, 2017, 7:58:06 AM1/31/17
to Fluentd Google Group
Hi,

Even after using heartbeat_type tcp the same issue is coming in my case


2017-01-31 17:52:20 +0530 [warn]: detached forwarding server '10.10.3.234:10443' host="10.10.3.234" port=10443 phi=55.882959776860744
2017-01-31 17:52:20 +0530 [warn]: temporarily failed to flush the buffer. next_retry=2017-01-31 17:52:21 +0530 error_class="RuntimeError" error="no nodes are available" plugin_id="object:3f88a461fb54"
  2017-01-31 17:52:20 +0530 [warn]: suppressed same stacktrace
2017-01-31 17:52:21 +0530 [warn]: temporarily failed to flush the buffer. next_retry=2017-01-31 17:52:23 +0530 error_class="RuntimeError" error="no nodes are available" plugin_id="object:3f88a461fb54"
  2017-01-31 17:52:21 +0530 [warn]: suppressed same stacktrace
2017-01-31 17:52:23 +0530 [warn]: temporarily failed to flush the buffer. next_retry=2017-01-31 17:52:27 +0530 error_class="RuntimeError" error="no nodes are available" plugin_id="object:3f88a461fb54"
  2017-01-31 17:52:23 +0530 [warn]: suppressed same stacktrace
2017-01-31 17:52:27 +0530 [warn]: temporarily failed to flush the buffer. next_retry=2017-01-31 17:52:35 +0530 error_class="RuntimeError" error="no nodes are available" plugin_id="object:3f88a461fb54"
  2017-01-31 17:52:27 +0530 [warn]: suppressed same stacktrace
2017-01-31 17:52:30 +0530 [warn]: recovered forwarding server '10.10.3.234:10443' host="10.10.3.234" port=10443
2017-01-31 17:52:35 +0530 [warn]: retry succeeded. plugin_id="object:3f88a461fb54"


Any Idea??
To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+u...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages