Fluentd behind an NLB in ECS

297 views
Skip to first unread message

Joshua Atkins

unread,
Aug 8, 2019, 1:40:55 PM8/8/19
to Fluentd Google Group
Hi all,

I've revisited running fluentd in ECS, behind an NLB, and I'm still experiencing a lot of TCP RST's. I'm hoping someone has run in to this and we can try to put our heads together and come up with a solution.
  • Fluentd running in ECS, built from the Docker image fluent/fluentd:v1.6-debian-1
  • in_forward config:
  • <source>
      @type forward
      @label @raw

      port 10500
    </source>
  • fluentbit 1.2.2 config (note that I also experience this with fluentd 1.6 as the forward as well as fluentbit):
  • [OUTPUT]
        Name          forward
        Match         *
        Host          $dest_fqdn
        Port          10500
        tls           off
        Require_ack_response  True
I see a huge amount of TCP RST on port 10500. I've confirmed I see this on the fluentbit sending end -- not just between the NLB and the container -- using:

 tcpdump -i eno1 -n 'tcp[tcpflags] & (tcp-rst) != 0'

The NLB is seeing between ~200 TCP RST's a minute.

I'm out of ideas on how to troubleshoot this. I've tried forcing /proc/sys/net/ipv4/tcp_keepalive_time to <350 (the unconfigurable NLB timeout) and I've tried using fluentd with various combinations of heartbeat on, heartbeat off, keepalive on, keepalive off, keepalive_timeout 300, etc.

Has anyone solved this?

Cheers,
Josh

Markus Bergholz

unread,
Aug 8, 2019, 1:50:34 PM8/8/19
to flu...@googlegroups.com
We're running fluentd also in an EC2 base ECS behind NLB and we got containers in the same ECS using fluentbit which are logging to fluentd over NLB.
And yeah, i've Seen a lot of strange effects in this setup.
What helps for us is ans was running fluentd service container in awsvpc mode. That means fluentd get its own ENI and IP adress.
You should try it. But you must redeploy the whole service, because the network mode is not changeable once it is setup.

--
You received this message because you are subscribed to the Google Groups "Fluentd Google Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/fluentd/52107b68-dc05-4c2a-ac26-d7b2611b3969%40googlegroups.com.

Joshua Atkins

unread,
Aug 8, 2019, 1:52:28 PM8/8/19
to flu...@googlegroups.com
Thanks Markus. I should have included that: we are running the ECS cluster in awsvpc mode and have the NLB set to target_type=ip as opposed to instance. NLB/ECS in the same VPC, in the same subnets/availability zones with cross-AZ disabled to reduce complexity while I try to troubleshoot this.

If you have a look at the monitoring tab on your load balancer, are you seeing a huge amount of client resets like below?

Screen Shot 2019-08-08 at 10.38.47 AM.png

Cheers,
Josh

Markus Bergholz

unread,
Aug 8, 2019, 2:05:48 PM8/8/19
to flu...@googlegroups.com
Yeah, arount 2500. A lot of more. But we got more behind the NLB (proxysql, smtp). Unfortunately you cannot reduce this to a target group
In our test account I see around 1000 reset counts, and there is only running fluentd behind the NLB.
So I'm not sure if we should worry or not :D Until now we don't lose any logs.


Joshua Atkins

unread,
Aug 8, 2019, 4:20:10 PM8/8/19
to flu...@googlegroups.com
I would check your fluentd/fluentbit forwarding metrics, because with the amount of TCP RST's, I think we are actually losing logs:

fluentbit_output_retries_total{name="forward.0"} 622
fluentbit_output_retries_failed_total{name="forward.0"} 64

622 retries on the output, 64 of them failed.

J

Mr. Fiber

unread,
Aug 8, 2019, 10:37:52 PM8/8/19
to Fluentd Google Group
>  I've tried forcing /proc/sys/net/ipv4/tcp_keepalive_time

Maybe, you also need to set parameter in in_forward for it.


Joshua Atkins

unread,
Aug 9, 2019, 12:56:48 AM8/9/19
to flu...@googlegroups.com
Huh, I didn't even see that option the last time I went through the documentation, thanks for pointing it out!

Switching it to true hasn't fixed the issue but that's with fluent-bit as the forwarder and I can't find any documentation suggesting that fluent-bit actually supports keepalives, so I'll try switching back to fluentd-fluentd with keepalives set and see if it resolves.

Thanks!

J

Markus Bergholz

unread,
Aug 9, 2019, 4:24:45 AM8/9/19
to flu...@googlegroups.com
hm atm I'm sure that we don't lose logs. I guess the tcp rst are normal connection cleanups that the TCP/IP stack. to be sure, we need to do a tcpdump between fluent-bit and the NLB.

I want to enable the monitoring too, but it failed. the HTTP server simple does not start when I add

    HTTP_Server  On
    HTTP_Listen  0.0.0.0
    HTTP_PORT    2020

to our config
Our fluent-bit configuration looks like this

[SERVICE]
    Flush        1
    Log_Level    info
    Parsers_File /etc/fluent-bit/parsers.conf
    Daemon off
    storage.path              /var/log/flb-storage/
    storage.sync              normal
    storage.checksum          off
    storage.backlog.mem_limit 5M

[INPUT]
    Name     syslog
    Parser   syslog-rfc3164-local
    Listen   127.0.0.1
    Port     5140
    Mode tcp
    Tag rawsyslog

[INPUT]
    Name        tail
    Path        /var/log/suricata/fast.log
    Tag         suricata
    DB          /var/log/flb-storage/keep_track.db

[FILTER]
    Name record_modifier
    Match suricata
    Record hostname ${HOSTNAME}


[OUTPUT]
    Name   forward
    Match  *
    Host fluentd.xxxx
    Port 24224
    Retry_Limit False

Any ideas about this?
Reply all
Reply to author
Forward
0 new messages