Custom Fluent Config for Logging

Michael Ben-David

unread,

Jan 27, 2021, 11:27:34 AM1/27/21

to Google Stackdriver Discussion Forum

Hello, I am new here. I was happy to discover this forum! :)

Should we ever expect to inject a custom fluent config in GKE for field extractions / multi-line event merging to an existing cluster?

Our clusters are setup with:

[X] Enable Cloud Operations for GKE

System and workload logging and monitoring

Current concerns:

No timestamp extraction
No log severity extraction (everything is Info)
No multiline event merging (e.g. Java exception stack traces) - this is a cost factor as we end up with many events per log record, each event has a copy of fields that should be 1 : record.
Having to launch new clusters to get a customizable logging config is an operational challenge

Thoughts?

Mary Koes

unread,

Jan 27, 2021, 1:32:13 PM1/27/21

to Michael Ben-David, Nathan Beach, Charles Baer, Google Stackdriver Discussion Forum

@Nathan Beach and @Charles Baer re: GKE logging user experience

--
© 2020 Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043

Email preferences: You received this email because you signed up for the Google Stackdriver Discussion Google Group (google-stackdr...@googlegroups.com) to participate in discussions with other members of the GoogleStackdriver community.
---
You received this message because you are subscribed to the Google Groups "Google Stackdriver Discussion Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-stackdriver-d...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-stackdriver-discussion/4fab6708-bc61-4e78-b311-9d5af5159e57n%40googlegroups.com.

Charles Baer

unread,

Jan 27, 2021, 3:18:16 PM1/27/21

to Michael Ben-David, Igor Peshansky, Nathan Beach, Google Stackdriver Discussion Forum, Mary Koes

Michael,

I'm a product manager in Cloud Logging.

First, welcome and thanks for your questions.

Using the default "System and workload logging and monitoring", there are several special fields that are mapped by default from a JSON log entry that you write. The severity and timestamp fields are included in this list. In the doc, there are examples of mapping the written JSON field to the resulting log entries with timestamp and severity fields written.

+Igor Peshansky for structured logging mapping as well.

Can you share a bit more about what you mean by multi-line. Is this an issue where a single log line ends up with multiple log entries?

Also, on the last comment, can you share what you mean by "Having to launch new clusters to get a customizable logging config is an operational challenge"?

Thanks,

-Charles

Michael Ben-David

unread,

Jan 27, 2021, 6:22:33 PM1/27/21

to Charles Baer, Igor Peshansky, Nathan Beach, Google Stackdriver Discussion Forum, Mary Koes

Thanks Charles,

Here is an example screenshot of Log Explorer showing:

Top line:

LogEvent Severity Info that should be Error
LogEvent Timestamp that has not been extracted from the record

Remaining lines: should be merged with the top line as a single LogEvent

Raw content of /var/lib/docker/containers/... for lines 3-7:

{"log":"Was expecting one of:\n","stream":"stdout","time":"2021-01-21T14:38:11.098829115Z"}
{"log":" \"and\" ...\n","stream":"stdout","time":"2021-01-21T14:38:11.098834063Z"}
{"log":" \"or\" ...\n","stream":"stdout","time":"2021-01-21T14:38:11.098838921Z"}
{"log":" \")\" ...\n","stream":"stdout","time":"2021-01-21T14:38:11.098843516Z"}
{"log":" \n","stream":"stdout","time":"2021-01-21T14:38:11.098848258Z"}

re: "Having to launch new clusters to get a customizable logging config is an operational challenge"?

Perhaps I am mistaken, I have the impression that it is not supported to add any custom fluent config to the out of the box /etc/google-fluentd configs.
A quick survey shows most of our GKE clusters are 1.16.15-gke.6000 and 1.15.12-gke.20, if that matters. I'm not sure if we have older ones.
Our Google Engineering rep is helping me confirm whether this is a real limitation or my misunderstanding.
My first impression of the limitation were from these:
- https://cloud.google.com/solutions/customizing-stackdriver-logs-fluentd
- https://cloud.google.com/community/tutorials/kubernetes-engine-customize-fluentbit
If it is necessary to teardown all of the google-fluend config, or launch new clusters then I would prefer to proceed with fluent bit. However, if we can simply augment the out of the box fluentd config to get multi-line merging and extraction of timestamp / severity, then I'm happy to stick with fluentd.
Fluentd / Fluent Bit / Google Ops are all new to me (my past experience in this space includes Splunk, Logstash and Filebeat for forwarding, filtering and transforms).

Thanks,

Michael.

Igor Peshansky

unread,

Jan 28, 2021, 12:36:52 AM1/28/21

to Michael Ben-David, Charles Baer, Nathan Beach, Google Stackdriver Discussion Forum, Mary Koes

Michael,

Generally, you will not be able to add custom configs to the managed fluentd or fluent-bit DaemonSet. To fully customize the configuration, you'd need to disable the managed DaemonSet and deploy your own, as you've already discovered.

However, you may not need to customize the configs at all, at least for some of your use cases. The way the managed configs are set up, any logs sent to stdout default to severity INFO, and any logs sent to stderr default to ERROR. As Charles said, if your log line is written as a serialized JSON object, certain fields in that object will be used to control the resulting log entry, as described in [1] and [2]. Specifically, fields named "timestamp", "timestampSeconds"+"timestampNanos", and "time" control the entry timestamp (in various formats), while "severity" controls the entry severity. So if you write the following line to stdout:

{"message":"This should be at severity ERROR\n","severity":"ERROR","time":"2021-01-27T18:00:00Z"}

You should get a log entry like the following:

{"jsonPayload":{"message":"This should be at severity ERROR\n"},"timestamp": "2021-01-27T18:00:00Z","severity": "ERROR","receiveTimestamp": "2021-01-28T01:06:40.709652601Z"}

Regarding multiline exception detection, the logging agent does detect Java exception stack traces by default, but it does not handle multiline exception messages, which is what you seem to have. For reference, the full state machine for detecting Java exceptions in the fluentd plugin sources is at [3]. The fluent-bit-based agent implements something similar. FWIW, even if the agent could detect and join that exception trace, I am not sure that Cloud Error Reporting would recognize it as a valid stack trace once it's ingested. You can verify the latter by piping your full stack trace (as written to stdout/stderr by the application) into the following command:

gcloud beta error-reporting events report \
--service manual \
--service-version test-version \
--project=$(gcloud config get-value project) \
--message-file=/dev/stdin

and checking the results on the Error Reporting page in the Cloud Console. Sadly, if it doesn't work, I don't really have a good mitigation for you other than changing the format of that exception message to not contain newlines.

Hope this helps,

Igor

[1] https://cloud.google.com/logging/docs/agent/configuration#process-payload

[2] https://cloud.google.com/logging/docs/structured-logging#special-payload-fields

[3] https://github.com/GoogleCloudPlatform/fluent-plugin-detect-exceptions/blob/master/lib/fluent/plugin/exception_detector.rb#L54-L78

Michael Ben-David

unread,

Feb 3, 2021, 11:59:50 AM2/3/21

to Google Stackdriver Discussion Forum

Thanks for your detailed response Igor. My delayed response was to take some time to ramp up further on kubernetes to be better positioned to continue this effort.

I may consider converting apps to log in JSON that currently do not, however there is some opposition to that due to the additional verbosity and cost that adds when flowing logs through products that meter by ingest bytes. From first principles I would rather log in a versioned pipe delimited structured format for standardized fields, with an optional terminal JSON field for non-standard fields. This worked very well on my last project which was collecting 50,000 events/second in production across polyglot services.
logversion|timestamp|severity|<10+ context fields>|message|{"service_name":{"field1":"value1", "field2":"value2"}}

The logversion field drove parsing rules applied to each event.

The Log Router only offers sampling at the event level. I would also like to sample at the field level and be positioned to apply other log transformations outside the application, before hitting the Log Router.

I'm going to create test clusters today configured as our existing ones are and then update them with --no-enable-stackdriver-kubernetes to see if I can retrofit a custom fluent(d| bit) config without having to launch new clusters following the approach of one of the following:

I'll let you know how it goes.

Thanks,

Michael.

Reply all

Reply to author

Forward