Fluentd taking longer than time_slice

Barreto, Rafael

unread,

Oct 18, 2013, 3:56:39 PM10/18/13

to flu...@googlegroups.com

Hey guys,

I'm facing an annoying behavior of fluentd that I'm not completely able to explain. Here's current architecture:

app boxes -> aggregator boxes -> log processor boxes

So, application boxes send logs to aggregator boxes which run S3 and forward plugins. The S3 plugin aggregates the data by hour and saves it to S3. The same data is forwarded to log processor boxes using fluent_forward plugin. In the log processor boxes we have a fluentd server with a file plugin. It just receives the data and saves to disk. This data is then processed by a real time agent that keeps tailing this data and processing it.

In the real time agent we have an alert system that sends a message when the log stream is interrupted for, let's say, 2 minutes. This works fine. The only problem is that after the hour switch (like from 9:59 to 10:00 PM), data delays more than that to come in. And when data comes in, it comes at once. For example, between 10:00 PM and 10:03 PM there are no log entries and suddenly, thousands of log entries come in at once. Problem here is not losing data. Data is really coming in. Problem is this delay.

I tried different configurations for time_slice_wait and flush_interval. I tried setting them as low as 30s, but nothing changes.

My best guess is that aggregator boxes are busy sending data to S3 and forward plugin gets stuck until it finishes. Is this possible?

Do you guys have any insight on this problem?

Thank you a lot.

==============================

Here's the conf file for the log processors:

type file

path /var/log/fluent/app.log

time_slice_format %Y%m%d-%H

time_slice_wait 1m

compress gzip

utc

</match>

####

## Source descriptions:

###

## catch tcp messages from webservers

type forward

port 24224

bind 0.0.0.0

</source>

=================================

Here's part of the config file for the log aggregator boxes:

type copy

<store>

type s3

aws_key_id XXXXXXXXXXXXXXXXXXXXXXXXXX

aws_sec_key XXXXXXXXXXXXXXXXXXXXXXXXXXX

s3_bucket bucket-name

s3_endpoint s3.amazonaws.com

path s3-path

buffer_path /var/log/fluent/s3.buffer

buffer_chunk_limit 1024m

store_as lzo

time_slice_format %Y%m%d-%H

time_slice_wait 5m

utc

</store>

<store>

type forward

send_timeout 10s

recover_wait 10s

heartbeat_interval 1s

phi_threshold 8

hard_timeout 10s

flush_interval 1s

name processor1

host host1

port 24224

weight 100

</server>

type file

path /var/log/fluent/forward-failed.site_id

</secondary>

</store>

</match>

Masahiro Nakagawa

unread,

Oct 20, 2013, 4:00:00 PM10/20/13

to flu...@googlegroups.com

Hi Barreto,

> My best guess is that aggregator boxes are busy sending data to S3 and forward plugin gets stuck until it finishes. Is this possible?

Yes. out_copy plugin emits records to <store>'s plugin in order from the top and waits each plugin flush.

Could you check spent time of S3 plugin?

If S3 plugin takes some times to upload records,

then changing <store> order or something is needed to resolve this delay problem.

Masahiro

--
You received this message because you are subscribed to the Google Groups "Fluentd Google Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Barreto, Rafael

unread,

Oct 21, 2013, 1:48:10 AM10/21/13

to flu...@googlegroups.com

Thanks Masahiro. But I'm now wondering if your suggestion really solves the problem. After all, pulling forward plugin up will not avoid locking when executing S3 plugin. So, by what I understand, if a new log entry comes in while S3 plugin is stuck, this new log entry will not be processed by the copy plugin anyway. Is that right? If so, is there any more general solution?

Thanks a lot in advance.

--
Sincerely,
Rafael

Masahiro Nakagawa

unread,

Oct 22, 2013, 12:36:43 AM10/22/13

to flu...@googlegroups.com

Sorry, I mistake.

I checked the code of S3 plugin.

This plugin uses TimeSlicedOutput and this output separate 'emit' and 'flush' thread.

Even if uploading records to S3 takes some times, it doesn't affect other <store>s.

Maybe, delay is another problem.

Barreto, Rafael

unread,

Oct 25, 2013, 1:54:48 PM10/25/13

to flu...@googlegroups.com

Indeed the delay was in another part of the system, unrelated to fluentd actually. Thanks for your pointers though. They helped a lot.

Reply all

Reply to author

Forward

Fluentd taking longer than time_slice_wait to flush data

Barreto, Rafael

Masahiro Nakagawa

Barreto, Rafael

Masahiro Nakagawa

Barreto, Rafael