Hi Kiyoto,
Thanks for your response. I tried TCP but no luck. Here is what I done:
1. Setup 4 App Servers to log in local td-agent to forward logs to log collectors
2. Setup 2 Log collectors to receive from forward and save in local disk (Actually it's mounted NFS location)
3. Execute portion of traffic to new App Servers
4. In about 7 hours everything was flowing without any problem.
5. When I checked App Servers, I found out buffers are not flushing - Reviewed td-agent.log and it showed Log Collectors detached/attached every couple of minutes
6. Checked Log Collectors and there was no error in td-agent.log! So I restart them one by one but no luck!
7. Get back to App Servers and changed heartbeat to TCP and restart td-agent - Still no luck
Initially I setup this environment in AWS and I got into this problem. Then I decided to bring everything in-house, so I re-build this on our datacenter but again it gave me same problem! At this point I'm thinking about implementing one of these options:
1. Rebuild Log Collectors with Scribed and change our App Servers to output to scribed servers (This way we only use td-agent for transferring logs from apps to log collectors)
2. Change App Servers to log internally in disk, then push logs via SSH Tunnel using SCP or rsync to Log Collectors and then process logs in our own log parser. (This way we only use td-agent to collect logs from apps)
We already open UDP/TCP in our firewalls and since it was working for couple of hours, I believe something else should cause it. But how can I be sure? Is there anyway to check it in FluentD?
Currently we're thinking about handling ~10k+ logs/sec per app server and based on fluentd documentation, it should be able to handle it but I'm concern about log collectors since they should handle 10K * Number_of_Servers and I'm not sure if log collectors can handle more than 20k+? I would appreciate if you can share your experience about high-volume traffics and what would be a good way to use FluentD. We really like FluentD features and plugin-based architecture, that's why I like to use it where ever possible.
Regards,
Reza