"detached forwarding server" every time on Forward and unable to flush buffer

Reza Qorbani

unread,

Oct 31, 2013, 6:33:13 PM10/31/13

to flu...@googlegroups.com

I have problem with my setup:

App Server + Local td-agent ----> Log Aggregator (td-agent)

App Server send logs to td-agent locally and then it will forward to log aggregators. Everything works fine but after running couple of hours (4-5), then I started receiving "detached forwarding server" in td-agent.log and local td-agent start saving in lots of file buffers without flushing them. Every couple of minutes I see forwarding servers attached/detached but no buffer flushed! Also aggregators stop flushing to local storage too! It looks like Aggregators become unstable and cause app layer disconnect from them, but I don't have any idea how to resolve this. There is no error in td-agent.log in aggregator, and they're getting some logs very slowly.

Sample traffic which this environment handling is about 15MB/sec. I would appreciate if you can help me to resolve this.

Here is my local td-agent configuration:

type forward

port 24224

bind 0.0.0.0

</source>

type forward

# Forward

send_timeout 60s

hard_timeout 60s

recover_wait 10s

heartbeat_type udp

heartbeat_interval 1s

phi_threshold 8

# Buffer

buffer_type file

buffer_path /var/log/bidder/buffer

flush_interval 1s

flush_at_shutdown true

# Servers

host 10.50.1.123

port 24224

weight 60

</server>

host 10.50.1.124

port 24224

weight 60

</server>

# Backup Failure

type file

path /var/log/bidder/failed

</secondary>

</match>

My Log Aggregator Setup:

# TCP input

type forward

port 24224

</source>

# Save Bidder Logs

type tag_file

path /var/log/bidder

time_slice_format %Y/%m/%d/%H

buffer_path /var/log/bidder/buffer

buffer_chunk_limit 128m

buffer_queue_limit 256

flush_interval 1m

flush_at_shutdown true

</match>

Kiyoto Tamura

unread,

Oct 31, 2013, 10:08:55 PM10/31/13

to flu...@googlegroups.com

Hi Reza,

Is your UDP port blocked? td-agent (and fluentd) uses UDP to check for heartbeat by default for node-to-node communication. So, if the UDP port is not open, then the heartbeat fails and the sender node assumes that the receiver node is dead and detaches it.

Also, you can configure out_forward plugin to use TCP for heartbeat like this.

type forward

heartbeat_type tcp

# the rest of the options
...

</match>

kiyoto

Kiyoto

--
You received this message because you are subscribed to the Google Groups "Fluentd Google Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reza Qorbani

unread,

Nov 1, 2013, 11:59:03 AM11/1/13

to flu...@googlegroups.com

Hi Kiyoto,

Thanks for your response. I tried TCP but no luck. Here is what I done:

1. Setup 4 App Servers to log in local td-agent to forward logs to log collectors

2. Setup 2 Log collectors to receive from forward and save in local disk (Actually it's mounted NFS location)

3. Execute portion of traffic to new App Servers

4. In about 7 hours everything was flowing without any problem.

5. When I checked App Servers, I found out buffers are not flushing - Reviewed td-agent.log and it showed Log Collectors detached/attached every couple of minutes

6. Checked Log Collectors and there was no error in td-agent.log! So I restart them one by one but no luck!

7. Get back to App Servers and changed heartbeat to TCP and restart td-agent - Still no luck

Initially I setup this environment in AWS and I got into this problem. Then I decided to bring everything in-house, so I re-build this on our datacenter but again it gave me same problem! At this point I'm thinking about implementing one of these options:

1. Rebuild Log Collectors with Scribed and change our App Servers to output to scribed servers (This way we only use td-agent for transferring logs from apps to log collectors)

2. Change App Servers to log internally in disk, then push logs via SSH Tunnel using SCP or rsync to Log Collectors and then process logs in our own log parser. (This way we only use td-agent to collect logs from apps)

We already open UDP/TCP in our firewalls and since it was working for couple of hours, I believe something else should cause it. But how can I be sure? Is there anyway to check it in FluentD?

Currently we're thinking about handling ~10k+ logs/sec per app server and based on fluentd documentation, it should be able to handle it but I'm concern about log collectors since they should handle 10K * Number_of_Servers and I'm not sure if log collectors can handle more than 20k+? I would appreciate if you can share your experience about high-volume traffics and what would be a good way to use FluentD. We really like FluentD features and plugin-based architecture, that's why I like to use it where ever possible.

Regards,

Reza

Kiyoto Tamura

unread,

Nov 1, 2013, 1:34:55 PM11/1/13

to flu...@googlegroups.com

Hi Reza,

Thank you for your detailed explanation. It looks like we need to enable verbose logging to get more information.

To enable verbose logging, please

1. Run

echo "DAEMON_ARGS=-vv" > /etc/default/td-agent

2. Restart td-agent

After the failure you previously described occurs, let us take a look at /var/log/td-agent.log

Thanks,

Kiyoto

Reza Qorbani

unread,

Nov 1, 2013, 3:39:34 PM11/1/13

to flu...@googlegroups.com

Hi Kiyoto,

I really appreciate your help. I added -vv and here is link to download td-agent.log for 4 servers (app01, app02, agg01, agg02):

http://ge.tt/9NtL8ww?c

Thanks again,

Reza

--
You received this message because you are subscribed to a topic in the Google Groups "Fluentd Google Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/fluentd/HMyhOPxEwa0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to fluentd+u...@googlegroups.com.

Joshua

unread,

Mar 2, 2021, 5:59:06 PM3/2/21

to Fluentd Google Group

i am getting a somewhat similar message but mines is (redacted some info)"
detached forwarding server 'mngment-server' host="imngment-server-123456.elb.myregion.amazonaws.com" port=514 hard_timeout=true"
was there a resolution to this?

Reply all

Reply to author

Forward