Fluentd taking longer than time_slice_wait to flush data

376 views
Skip to first unread message

Barreto, Rafael

unread,
Oct 18, 2013, 3:56:39 PM10/18/13
to flu...@googlegroups.com
Hey guys,

I'm facing an annoying behavior of fluentd that I'm not completely able to explain. Here's current architecture:

app boxes -> aggregator boxes -> log processor boxes

So, application boxes send logs to aggregator boxes which run S3 and forward plugins. The S3 plugin aggregates the data by hour and saves it to S3. The same data is forwarded to log processor boxes using fluent_forward plugin. In the log processor boxes we have a fluentd server with a file plugin. It just receives the data and saves to disk. This data is then processed by a real time agent that keeps tailing this data and processing it.

In the real time agent we have an alert system that sends a message when the log stream is interrupted for, let's say, 2 minutes. This works fine. The only problem is that after the hour switch (like from 9:59 to 10:00 PM), data delays more than that to come in. And when data comes in, it comes at once. For example, between 10:00 PM and 10:03 PM there are no log entries and suddenly, thousands of log entries come in at once. Problem here is not losing data. Data is really coming in. Problem is this delay.

I tried different configurations for time_slice_wait and flush_interval. I tried setting them as low as 30s, but nothing changes.

My best guess is that aggregator boxes are busy sending data to S3 and forward plugin gets stuck until it finishes. Is this possible?

Do you guys have any insight on this problem?

Thank you a lot.

==============================
Here's the conf file for the log processors:

<match app.**>
  type file
  path /var/log/fluent/app.log
  time_slice_format %Y%m%d-%H
  time_slice_wait 1m

  compress gzip
  utc
</match>

####
## Source descriptions:
###

## catch tcp messages from webservers
<source>
  type forward
  port 24224
  bind 0.0.0.0
</source>
=================================
Here's part of the config file for the log aggregator boxes:

<match matching_rule>
  type copy
  <store>
    type s3

    aws_key_id XXXXXXXXXXXXXXXXXXXXXXXXXX
    aws_sec_key XXXXXXXXXXXXXXXXXXXXXXXXXXX
    s3_bucket bucket-name
    s3_endpoint s3.amazonaws.com
    path s3-path
    buffer_path /var/log/fluent/s3.buffer
    buffer_chunk_limit 1024m

    store_as lzo

    time_slice_format %Y%m%d-%H
    time_slice_wait 5m
    utc
  </store>
  <store>
    type forward
    send_timeout 10s
    recover_wait 10s
    heartbeat_interval 1s
    phi_threshold 8
    hard_timeout 10s
    flush_interval 1s
    <server>
      name processor1
      host host1
      port 24224
      weight 100
    </server>
    <secondary>
      type file
      path /var/log/fluent/forward-failed.site_id
    </secondary>
  </store>
</match>

Masahiro Nakagawa

unread,
Oct 20, 2013, 4:00:00 PM10/20/13
to flu...@googlegroups.com
Hi Barreto,

 My best guess is that aggregator boxes are busy sending data to S3 and forward plugin gets stuck until it finishes. Is this possible?

Yes. out_copy plugin emits records to <store>'s plugin in order from the top and waits each plugin flush.
Could you check spent time of S3 plugin?
If S3 plugin takes some times to upload records,
then changing <store> order or something is needed to resolve this delay problem.


Masahiro


--
You received this message because you are subscribed to the Google Groups "Fluentd Google Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Barreto, Rafael

unread,
Oct 21, 2013, 1:48:10 AM10/21/13
to flu...@googlegroups.com
Thanks Masahiro. But I'm now wondering if your suggestion really solves the problem. After all, pulling forward plugin up will not avoid locking when executing S3 plugin. So, by what I understand, if a new log entry comes in while S3 plugin is stuck, this new log entry will not be processed by the copy plugin anyway. Is that right? If so, is there any more general solution?

Thanks a lot in advance.
--
Sincerely,
Rafael

Masahiro Nakagawa

unread,
Oct 22, 2013, 12:36:43 AM10/22/13
to flu...@googlegroups.com
Sorry, I mistake.

I checked the code of S3 plugin.
This plugin uses TimeSlicedOutput and this output separate 'emit' and 'flush' thread.
Even if uploading records to S3 takes some times, it doesn't affect other <store>s.

Maybe, delay is another problem. 

Barreto, Rafael

unread,
Oct 25, 2013, 1:54:48 PM10/25/13
to flu...@googlegroups.com
Indeed the delay was in another part of the system, unrelated to fluentd actually. Thanks for your pointers though. They helped a lot.
Reply all
Reply to author
Forward
0 new messages