Hey guys,
I'm facing an annoying behavior of fluentd that I'm not completely able to explain. Here's current architecture:
app boxes -> aggregator boxes -> log processor boxes
So, application boxes send logs to aggregator boxes which run S3 and forward plugins. The S3 plugin aggregates the data by hour and saves it to S3. The same data is forwarded to log processor boxes using fluent_forward plugin. In the log processor boxes we have a fluentd server with a file plugin. It just receives the data and saves to disk. This data is then processed by a real time agent that keeps tailing this data and processing it.
In the real time agent we have an alert system that sends a message when the log stream is interrupted for, let's say, 2 minutes. This works fine. The only problem is that after the hour switch (like from 9:59 to 10:00 PM), data delays more than that to come in. And when data comes in, it comes at once. For example, between 10:00 PM and 10:03 PM there are no log entries and suddenly, thousands of log entries come in at once. Problem here is not losing data. Data is really coming in. Problem is this delay.
I tried different configurations for time_slice_wait and flush_interval. I tried setting them as low as 30s, but nothing changes.
My best guess is that aggregator boxes are busy sending data to S3 and forward plugin gets stuck until it finishes. Is this possible?
Do you guys have any insight on this problem?
Thank you a lot.
==============================
Here's the conf file for the log processors:
<match app.**>
type file
path /var/log/fluent/app.log
time_slice_format %Y%m%d-%H
time_slice_wait 1m
compress gzip
utc
</match>
####
## Source descriptions:
###
## catch tcp messages from webservers
<source>
type forward
port 24224
bind 0.0.0.0
</source>
=================================
Here's part of the config file for the log aggregator boxes:
<match matching_rule>
type copy
<store>
type s3
aws_key_id XXXXXXXXXXXXXXXXXXXXXXXXXX
aws_sec_key XXXXXXXXXXXXXXXXXXXXXXXXXXX
s3_bucket bucket-name
path s3-path
buffer_path /var/log/fluent/s3.buffer
buffer_chunk_limit 1024m
store_as lzo
time_slice_format %Y%m%d-%H
time_slice_wait 5m
utc
</store>
<store>
type forward
send_timeout 10s
recover_wait 10s
heartbeat_interval 1s
phi_threshold 8
hard_timeout 10s
flush_interval 1s
<server>
name processor1
host host1
port 24224
weight 100
</server>
<secondary>
type file
path /var/log/fluent/forward-failed.site_id
</secondary>
</store>
</match>