On Tue, Nov 16, 2021 at 07:59:45AM -0800, 'Christian Turri' via gce-discussion wrote:
> It really is quite silly to say that "that the logs are not duplicated at
> all" but rather "it is now printing double". Of course if print double you
> WILL duplicate the logs. You can call that "printing double" instead of
> duplicating the logs but it's the same thing. In any case the fact that
> "This was included in order to solve other logging issues customers had, so
> it is a known consequence of that" doesn't make this change right. It is
> still a horrible Kludge <
https://en.wikipedia.org/wiki/Kludge>. I ask
> myself why can't whoever made this change not find the root cause of the
> missing logging events rather than duplicating all logs by printing double?
>
> It is really disappointing that a company that claims to have the best
> software engineering in the world ends up with "solutions" like this one.
> At least if someone else does a Google search and finds this post they will
> not waste their time with Google Support and will understand why the GCE
> Console Serial Output is now a mess.
I can't speak to exactly why the Google engineers that are working on
agent made the design choices they did, but I can make a few
observations:
[ 14.685254] google_metadata_script_runner[2043]: No startup scripts to run.
This message indicates that it was routed through the kernel. There
are multiple ways this can be done, but one of the ways is to issue
the message through /dev/kmesg. For example:
# echo test message > /dev/kmsg
# dmesg | tail -1
[437049.229082] test message
The number is square brackets is the number of seconds since the
kernel was booted.
Oct 15 17:51:00 instance-1 google_metadata_script_runner[2043]: No startup scripts to run.
This message indicates that the message was routed through syslog, and
there is probably a /etc/syslogd.conf entry that is causing the
message to be written to the console. It could *also* mean that the
message was forwarded to the console via systemd-journal (see
/etc/systemd/journald.conf for either a ForwardToConsole or
ForwardToWall specification.) This depends on the Linux distribution,
and how the user may have customized their Linux distribution. (Many
distributions are now using systemd-journald, but many enterprise
customers are still used to using syslogd, and they may made changes
to how their system was configured away from the default.)
As far as why the Google Engineers might prefer log entries being
potentially printed to the console twice, this is just speculation,
but....
Some customers may be using syslogd; other customers might be using
systemd-journald. Some customers might be using both. Some customers
might have configured dmesg[1] to suppress some kernel messages from
being sent to the console. Some customers may have configured
systemd-journald or syslogd not to send messags to the console either.
Furthermore, different distribution may be using different defaults,
and so even if the customers weren't making changes to the logging
facilities, it's hard to know how the overall system logging has been
set up.
[1] See the man page for the dmesg command where it describes how
"dmesg --console-level <LEVEL>" works.
Furthermore, after the VM dies, usually the PD containing the root
file system gets automatically deleted, at which point all of the log
files in /var/log/* and stored by systemd-journald will be gone.
There are solutions such as Stackdriver, but they are not free, and so
many customers choose not to use them.
There is a better chance that the output of the serial console will be
saved, so if the customer is looking for help, it's more likely that
the customer can be helped if the log messages are sent to the serial
console at least once, regardless of how the distro has been
configured, either by default or by the customer after installing the
Linux distro and creating a custom image. After all, if the messages
are sent twice, it's possible to filter them out, but if the messages
aren't visible to the serial console at *all*, and the user asks for
help, even if it was the user's choices in configuration that caused
one or both of the messages to be suppressed, the user is not likely
to blame themselves for the fact that their problem couldn't be easily
root caused.
Cheers,
- Ted