"detached forwarding server" every time on Forward and unable to flush buffer

4,077 views
Skip to first unread message

Reza Qorbani

unread,
Oct 31, 2013, 6:33:13 PM10/31/13
to flu...@googlegroups.com
I have problem with my setup:

App Server + Local td-agent  ---->  Log Aggregator (td-agent) 

App Server send logs to td-agent locally and then it will forward to log aggregators. Everything works fine but after running couple of hours (4-5), then I started receiving "detached forwarding server" in td-agent.log and local td-agent start saving in lots of file buffers without flushing them. Every couple of minutes I see forwarding servers attached/detached but no buffer flushed! Also aggregators stop flushing to local storage too! It looks like Aggregators become unstable and cause app layer disconnect from them, but I don't have any idea how to resolve this. There is no error in td-agent.log in aggregator, and they're getting some logs very slowly. 

Sample traffic which this environment handling is about 15MB/sec. I would appreciate if you can help me to resolve this. 

Here is my local td-agent configuration:

<source>
  type forward
  port 24224
  bind 0.0.0.0
</source>
<match bidder.*>
  type forward

  # Forward
  send_timeout 60s
  hard_timeout 60s
  recover_wait 10s
  heartbeat_type udp
  heartbeat_interval 1s
  phi_threshold 8

  # Buffer
  buffer_type file
  buffer_path /var/log/bidder/buffer
  flush_interval 1s
  flush_at_shutdown true

  # Servers
  <server>
    host 10.50.1.123
    port 24224
    weight 60
  </server>
 <server>
    host 10.50.1.124
    port 24224
    weight 60
  </server>

  # Backup Failure
  <secondary>
    type file
    path /var/log/bidder/failed
  </secondary>
</match>


My Log Aggregator Setup:

# TCP input
<source>
  type forward
  port 24224
</source>

# Save Bidder Logs
<match bidder.*>
  type tag_file
  path /var/log/bidder
  time_slice_format %Y/%m/%d/%H
  buffer_path /var/log/bidder/buffer
  buffer_chunk_limit 128m
  buffer_queue_limit 256
  flush_interval 1m
  flush_at_shutdown true
</match>


Kiyoto Tamura

unread,
Oct 31, 2013, 10:08:55 PM10/31/13
to flu...@googlegroups.com
Hi Reza,

Is your UDP port blocked? td-agent (and fluentd) uses UDP to check for heartbeat by default for node-to-node communication. So, if the UDP port is not open, then the heartbeat fails and the sender node assumes that the receiver node is dead and detaches it.

Also, you can configure out_forward plugin to use TCP for heartbeat like this.

<match bidder.*>
  type forward
  heartbeat_type tcp
  # the rest of the options
  ...
</match>

kiyoto

Kiyoto


--
You received this message because you are subscribed to the Google Groups "Fluentd Google Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reza Qorbani

unread,
Nov 1, 2013, 11:59:03 AM11/1/13
to flu...@googlegroups.com
Hi Kiyoto,

Thanks for your response. I tried TCP but no luck. Here is what I done:

1. Setup 4 App Servers to log in local td-agent to forward logs to log collectors
2. Setup 2 Log collectors to receive from forward and save in local disk (Actually it's mounted NFS location)
3. Execute portion of traffic to new App Servers
4. In about 7 hours everything was flowing without any problem.
5. When I checked App Servers, I found out buffers are not flushing - Reviewed td-agent.log and it showed Log Collectors detached/attached every couple of minutes
6. Checked Log Collectors and there was no error in td-agent.log! So I restart them one by one but no luck!
7. Get back to App Servers and changed heartbeat to TCP and restart td-agent - Still no luck

Initially I setup this environment in AWS and I got into this problem. Then I decided to bring everything in-house, so I re-build this on our datacenter but again it gave me same problem! At this point I'm thinking about implementing one of these options:

1. Rebuild Log Collectors with Scribed and change our App Servers to output to scribed servers (This way we only use td-agent for transferring logs from apps to log collectors)
2. Change App Servers to log internally in disk, then push logs via SSH Tunnel using SCP or rsync to Log Collectors and then process logs in our own log parser. (This way we only use td-agent to collect logs from apps)

We already open UDP/TCP in our firewalls and since it was working for couple of hours, I believe something else should cause it. But how can I be sure? Is there anyway to check it in FluentD?

Currently we're thinking about handling ~10k+ logs/sec per app server and based on fluentd documentation, it should be able to handle it but I'm concern about log collectors since they should handle 10K * Number_of_Servers and I'm not sure if log collectors can handle more than 20k+? I would appreciate if you can share your experience about high-volume traffics and what would be a good way to use FluentD. We really like FluentD features and plugin-based architecture, that's why I like to use it where ever possible.

Regards,
Reza

Kiyoto Tamura

unread,
Nov 1, 2013, 1:34:55 PM11/1/13
to flu...@googlegroups.com
Hi Reza,

Thank you for your detailed explanation. It looks like we need to enable verbose logging to get more information.

To enable verbose logging, please

1. Run

echo "DAEMON_ARGS=-vv" > /etc/default/td-agent

2. Restart td-agent

After the failure you previously described occurs, let us take a look at /var/log/td-agent.log

Thanks,

Kiyoto

Reza Qorbani

unread,
Nov 1, 2013, 3:39:34 PM11/1/13
to flu...@googlegroups.com
Hi Kiyoto,

I really appreciate your help. I added -vv and here is link to download td-agent.log for 4 servers (app01, app02, agg01, agg02):


Thanks again,
Reza


--
You received this message because you are subscribed to a topic in the Google Groups "Fluentd Google Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/fluentd/HMyhOPxEwa0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to fluentd+u...@googlegroups.com.

Joshua

unread,
Mar 2, 2021, 5:59:06 PM3/2/21
to Fluentd Google Group
i am getting a somewhat similar message but mines is (redacted some info)"
      detached forwarding server 'mngment-server' host="imngment-server-123456.elb.myregion.amazonaws.com" port=514 hard_timeout=true"
was there a resolution to this?
Reply all
Reply to author
Forward
0 new messages