Duplicated Serial Console events on latest RHEL image

131 views
Skip to first unread message

Christian Turri

unread,
Oct 15, 2021, 5:51:11 PMOct 15
to gce-discussion
Hi,

We have rebuilt our Linux RHEL images based on new GCP Linux images and noticed the Serial port 1 (console) output of the VM is now duplicating lines. In addition to this there are lines without date/time prefix. And finally the prefix used for these log events has been changed from "GCEMetadataScripts" to "google_metadata_script_runner: startup-script" which is now different from Windows VMs. This change seems to have happened between these two GCP images: rhel-7-v20210721 (console output is fine) rhel-7-v20210916 (console output is not fine) 

Here is a sample console output from a VM built based on rhel-7-v20210721: 

Oct 15 09:39:57 dev1-dtaiku-du10 GCEMetadataScripts[1353]: 2021/10/15 09:39:57 GCEMetadataScripts: startup-script: ***************************************************************** 
Oct 15 09:39:57 dev1-dtaiku-du10 GCEMetadataScripts[1353]: 2021/10/15 09:39:57 GCEMetadataScripts: startup-script: Starting startup script 
Oct 15 09:39:57 dev1-dtaiku-du10 GCEMetadataScripts[1353]: 2021/10/15 09:39:57 GCEMetadataScripts: startup-script: ***************************************************************** 
Oct 15 09:39:57 dev1-dtaiku-du10 GCEMetadataScripts[1353]: 2021/10/15 09:39:57 GCEMetadataScripts: startup-script: Project ID: emerging-seahorse-584557

Here is the same sample console output from a VM built based on rhel-7-v20210916: 

 [ 9.719888] google_metadata_script_runner[1481]: startup-script: ***************************************************************** [ 9.842026] google_metadata_script_runner[1481]: startup-script: Starting startup script [ 9.842193] google_metadata_script_runner[1481]: startup-script: ***************************************************************** [ 9.842344] google_metadata_script_runner[1481]: startup-script: Project ID: emerging-seahorse-584557 
Oct 15 09:51:22 dev1-dtaiku-du10 google_metadata_script_runner: Starting startup scripts (version 20210908.1). 
Oct 15 09:51:22 dev1-dtaiku-du10 google_metadata_script_runner: Found startup-script in metadata. 
Oct 15 09:51:22 dev1-dtaiku-du10 google_metadata_script_runner: startup-script: ***************************************************************** [ 9.842540] google_metadata_script_runner[1481]: startup-script: Environment Name: dev1 [ 9.842767] google_metadata_script_runner[1481]: startup-script: Vault Folder: non-prod [ 9.842910] google_metadata_script_runner[1481]: startup-script: Check /fusion exists Oct 15 09:51:22 dev1-dtaiku-du10 google_metadata_script_runner: startup-script: Starting startup script 
Oct 15 09:51:22 dev1-dtaiku-du10 google_metadata_script_runner: startup-script: ***************************************************************** 
Oct 15 09:51:22 dev1-dtaiku-du10 google_metadata_script_runner: startup-script: Project ID: emerging-seahorse-584557 


This is causing issues in our tools that analise the console output looking for errors due to the change in prefix, the lack of date/time on some console events and the duplication of console events. Why has this been changed and can these issues be fixed? Thanks

Christian Turri

unread,
Oct 15, 2021, 8:15:51 PMOct 15
to gce-discussion
Here is a way to reproduce this issue:

2) Search for RHEL
3) On the rhel-8-v20210721 image click on the Actions ... and select Create instance.
4) Once the instance is created click on it
5) Then click on "Serial port 1 (console)"
6) Verify the console output when the startup script is run is similar to this:

Oct 15 17:54:41 instance-2 systemd[1]: Starting Google Compute Engine Startup Scripts...
Oct 15 17:54:42 instance-2 GCEGuestAgent[1043]: 2021-10-15T17:54:42.0464Z GCEGuestAgent Info: Created google sudoers file
Oct 15 17:54:42 instance-2 GCEMetadataScripts[1376]: 2021/10/15 17:54:42 GCEMetadataScripts: Starting startup scripts (version 20210629.00).
Oct 15 17:54:42 instance-2 GCEMetadataScripts[1376]: 2021/10/15 17:54:42 GCEMetadataScripts: No startup scripts to run.
Oct 15 17:54:42 instance-2 systemd[1]: google-startup-scripts.service: Succeeded.

All lines have the date/time prefix, the GCEMetadataScripts prefix is used same as Windows VMs, there are no duplicate console output lines. 

7) Now repeat the same steps for the new rhel-8-v20210916 image. 
8) Verify the console output when the startup script is run is similar to this:

[   14.685254] google_metadata_script_runner[2043]: No startup scripts to run.
Oct 15 17:51:00 instance-1 google_metadata_script_runner[2043]: Starting startup scripts (version 20210908.1).
Oct 15 17:51:00 instance-1 kernel: fbcon: Taking over console
Oct 15 17:51:00 instance-1 google_metadata_script_runner[2043]: No startup scripts to run.

Issues:

a) Some lines don't have the date/time prefix, 
b) The GCEMetadataScripts prefix has been changed to google_metadata_script_runner
c) Some logging lines are duplicated, like "No startup scripts to run."

Why is this happening?

Christian Turri

unread,
Oct 19, 2021, 10:42:50 AMOct 19
to gce-discussion

So I have raised Case 28987919 with Google Support. Is it worth raising the issue on issuetracker.google.com as well? 

Adebisi Ibirogba

unread,
Oct 19, 2021, 2:27:01 PMOct 19
to gce-discussion
Since you have opened case 28987919, you don't need to open an issue tracker case.

And from checking the case, it has already been assigned to a support person and it would be handled effectively from there on

Christian Turri

unread,
Oct 19, 2021, 3:15:46 PMOct 19
to gce-discussion
Thanks. Is anyone able to replicate the issue and confirm it? Support are saying that they can't replicate but I was able to replicate it on my personal Cloud environment. It only takes 5 mins... 

Thanks!

Marc Vintró Alonso

unread,
Oct 20, 2021, 1:15:09 PMOct 20
to gce-discussion
Hello, Google Support will contact you through the case with updates.
They were able to reproduce the issue.

Christian Turri

unread,
Nov 17, 2021, 3:33:45 AM (12 days ago) Nov 17
to gce-discussion
Well I have tried to get this fixed and I am afraid I have failed miserably. After some back and forward this is what Google Support said about my support case:

After discussing the issue with the GCE team, they've confirmed that this is working as intended meaning that the logs are not duplicated at all. The only thing that has recently changed is the guest agent started printing to standard output, so it is now printing double. This was included in order to solve other logging issues customers had, so it is a known consequence of that.

It really is quite  silly to say that "that the logs are not duplicated at all" but rather "it is now printing double". Of course if print double you WILL duplicate the logs. You can call that "printing double" instead of duplicating the logs but it's the same thing. In any case the fact that "This was included in order to solve other logging issues customers had, so it is a known consequence of that" doesn't make this change right. It is still a horrible Kludge. I ask myself why can't whoever made this change not find the root cause of the missing logging events rather than duplicating all logs by printing double?

It is really disappointing that a company that claims to have the best software engineering in the world ends up with "solutions" like this one. At least if someone else does a Google search and finds this post they will not waste their time with Google Support and will understand why the GCE Console Serial Output is now a mess.

Theodore Ts'o

unread,
Nov 17, 2021, 12:59:59 PM (11 days ago) Nov 17
to Christian Turri, gce-discussion
On Tue, Nov 16, 2021 at 07:59:45AM -0800, 'Christian Turri' via gce-discussion wrote:
> It really is quite silly to say that "that the logs are not duplicated at
> all" but rather "it is now printing double". Of course if print double you
> WILL duplicate the logs. You can call that "printing double" instead of
> duplicating the logs but it's the same thing. In any case the fact that
> "This was included in order to solve other logging issues customers had, so
> it is a known consequence of that" doesn't make this change right. It is
> still a horrible Kludge <https://en.wikipedia.org/wiki/Kludge>. I ask
> myself why can't whoever made this change not find the root cause of the
> missing logging events rather than duplicating all logs by printing double?
>
> It is really disappointing that a company that claims to have the best
> software engineering in the world ends up with "solutions" like this one.
> At least if someone else does a Google search and finds this post they will
> not waste their time with Google Support and will understand why the GCE
> Console Serial Output is now a mess.

I can't speak to exactly why the Google engineers that are working on
agent made the design choices they did, but I can make a few
observations:

[ 14.685254] google_metadata_script_runner[2043]: No startup scripts to run.

This message indicates that it was routed through the kernel. There
are multiple ways this can be done, but one of the ways is to issue
the message through /dev/kmesg. For example:

# echo test message > /dev/kmsg
# dmesg | tail -1
[437049.229082] test message

The number is square brackets is the number of seconds since the
kernel was booted.

Oct 15 17:51:00 instance-1 google_metadata_script_runner[2043]: No startup scripts to run.

This message indicates that the message was routed through syslog, and
there is probably a /etc/syslogd.conf entry that is causing the
message to be written to the console. It could *also* mean that the
message was forwarded to the console via systemd-journal (see
/etc/systemd/journald.conf for either a ForwardToConsole or
ForwardToWall specification.) This depends on the Linux distribution,
and how the user may have customized their Linux distribution. (Many
distributions are now using systemd-journald, but many enterprise
customers are still used to using syslogd, and they may made changes
to how their system was configured away from the default.)

As far as why the Google Engineers might prefer log entries being
potentially printed to the console twice, this is just speculation,
but....

Some customers may be using syslogd; other customers might be using
systemd-journald. Some customers might be using both. Some customers
might have configured dmesg[1] to suppress some kernel messages from
being sent to the console. Some customers may have configured
systemd-journald or syslogd not to send messags to the console either.
Furthermore, different distribution may be using different defaults,
and so even if the customers weren't making changes to the logging
facilities, it's hard to know how the overall system logging has been
set up.

[1] See the man page for the dmesg command where it describes how
"dmesg --console-level <LEVEL>" works.

Furthermore, after the VM dies, usually the PD containing the root
file system gets automatically deleted, at which point all of the log
files in /var/log/* and stored by systemd-journald will be gone.
There are solutions such as Stackdriver, but they are not free, and so
many customers choose not to use them.

There is a better chance that the output of the serial console will be
saved, so if the customer is looking for help, it's more likely that
the customer can be helped if the log messages are sent to the serial
console at least once, regardless of how the distro has been
configured, either by default or by the customer after installing the
Linux distro and creating a custom image. After all, if the messages
are sent twice, it's possible to filter them out, but if the messages
aren't visible to the serial console at *all*, and the user asks for
help, even if it was the user's choices in configuration that caused
one or both of the messages to be suppressed, the user is not likely
to blame themselves for the fact that their problem couldn't be easily
root caused.

Cheers,

- Ted

Christian Turri

unread,
Nov 17, 2021, 3:05:52 PM (11 days ago) Nov 17
to gce-discussion
Thanks for the additional information Ted, it is useful. I would like to add some points:

1) The issue is happening in the newer GCP RHEL Linux images (ie  rhel-7-v20210916) so all the different variations in terms of how different Linux distributions handle logging are not relevant since we are using RHEL. I don't know RHEL enough to understand if the double printing to the serial console is really required to have at least one log event printed in all scenarios but the fact that I am seeing every message twice (duplicated) seems to indicate that it is not really required. Clearly there could be Linux distributions that behave differently but surely this change could be done selectively to those distribution images rather than to all distribution images. 

2) We have no control on what gets sent to serial console. One of our Unix SAs looked at this and he is sure the logging to the serial console is done by a binary in /usr/bin/google_metadata_script_runner which is embedded in the distribution image Google produces. So even if you were to make changes to the way the distribution logs to syslog or dmesg this double printing to the console will still remain. Again a more targeted approach would have been a better suited solution, perhaps by analysing the  syslog/dmesg configuration to determine which events might be missing in the serial console and only log them if required rather than blindly logging both syslog and dmesg. An even better approach will be to allow GCE users to specify exactly what the /usr/bin/google_metadata_script_runner logs to the console via the VM's custom metadata definition. That way Google could default to "double printing" but still allow customers that know exactly what they want in the console output to customise it if needed. 

3) The serial console buffer in Cloud Console is limited in size so when that limit is reached Console keeps the most recent serial console events, not all. By logging both syslog and dmesg in the RHEL image duplicating all messages Google is reducing the available serial console output in Console by half. If the objective as you say is to make sure the serial console is preserved then this change doesn't help much in the RHEL image. 

4) The serial console is only available while the VM is running if you try to see the serial console with the VM down you will get "Could not display serial port output. The instance is not running. Start the instance to see the serial port output." error. So your speculation that "there is a better chance that the output of the serial console will be
saved" is not applicable to VMs that are stopped or deleted.

5) You say that "it's possible to filter them out", actually we tried to do this. That was the first we tried before raising the support case. We were able to filter a lot of the added kernel dmesg messages by grepping for "google_metadata_script_runner: startup-script". When we do this we get mostly a clean log aside for few hundred lines. Have a look at these two lines:
[ 475.283532] google_metadata_script_runner[2130]: Oct 31 23:30:32 dev-dtaiku-du01 google_metadata_script_runner: startup-script: /usr/bin/install -c -m 644 ./Lib/multiprocessing/resource_sharer.py /usr/local/lib/python3.7/multiprocessing [ 475.622387] google_metadata_script_runnerOct 31 23:30:32 dev-dtaiku-du01 google_metadata_script_runner: startup-script: /usr/bin/install -c -m 644 ./Lib/unittest/util.py /usr/local/lib/python3.7/unittest As you can see we now seem to have both logging events from syslog format and dmesg format in the single line. This is making filtering pretty much impossible as we will need to start parsing these lines to separate the syslog event from the dmesg event. As there are no clear demarcations between the events this is not trivial.

6) While you said it's possible to use Cloud Logging to capture these events and review them later we prefer to use "gcloud compute instances tail-serial-port-output" and have a live "tail -f" of the console. We apply custom filtering so we can highlight specific events that look like errors to the person doing the build. I don't believe there is an interactive way of watching logs tail live using Cloud Logging.

7) These changes were not included in the image release notes, see https://groups.google.com/g/gce-image-notifications/c/tVBT_QPLDLg

Thanks,
Christian
Reply all
Reply to author
Forward
0 new messages