Describe the bug
We are up against a really strange - frustrating problem. I do not have any experience with FluentD at all, so I will try to give a representation as complete as possible.
We have deployed FluentD as a DaemonSet in a Kubernetes cluster. FluentD is configured to gather logs from multiple sources (Docker daemon, network, etc...) and send them to a hosted AWS ElasticSearch.
Along with the mentioned logging, we have in-app mechanisms that log directly to FluentD through a separate @type forward source created only for these in-app logging mechanisms, which is then forwarded through a match @type elasticsearch.
The problem is that this in-app log-flow creates a steady-but-slow memory leak on the node which it runs on.... The even stranger thing is that this leak is not happening in userspace application memory. Both the apps' and Fluentd's app memory remain stable. What is constantly increasing is kernel memory resulting in a constantly decreasing available memory of the node, until memory starvation problems begin. Note that I am referring to non-cache kernel memory that is not freed when requested. The applications are not that logging heavy. Max throughput should be around 10 loglines/sec from all together.
This is not hapenning with any of the other log configuration in Fluent where docker, system, kubernetes logs are scraped etc... If I turn off this in-app mechanism then there is no memory leak!
I have installed different monitoring tools on the server trying to see if some other metric's trend is related to the memory decrease... The only thing that I found matching a lot is IPv4 TCP memory usage, which kinda makes sense since this is how the in-app logs are sent to FluentD and also kernel related. However although the trend is similar, the actual memory amount does not match. In the screenshots attached below for the same time period, you can see that the system memory is decreased around 700MB while TCP memory usage increases only 30MB. However the trend is a complete match!
Any help with this problem would be really appreciated! Feel free to ask any extra information that you might need.
Below are the details about my configuration and set up.
To Reproduce
A simple pod running a NodeJS app sending directly logs to FluentD using the fluent-logger npm package is enough to cause the memory problem.
Expected behavior
I expect the kernel memory to remain stable when usage is also stable, as is the case with the rest of the logging configuration.
Your Environment
Your Configuration
FluentD DaemonSet is deployed using latest (v11.3.0) chart version found in https://github.com/kokuwaio/helm-charts/blob/main/charts/fluentd-elasticsearch/Chart.yaml
Since there is a lot of configuration, I will only put here the relevant config that creates the problem. If all is needed let me know to paste it in a pastebin or sth....
--
You received this message because you are subscribed to the Google Groups "Fluentd Google Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/fluentd/feb5501f-2126-4e9f-81a7-e9b4fc9ca6dbn%40googlegroups.com.