Apr 22 15:53:21 gae-integration--track--gce-20160422t105255-pd6s vm_runtime_init: Apr 22 15:53:21 vm_runtime_init: start 'ah_start'.
Apr 22 15:53:21 gae-integration--track--gce-20160422t105255-pd6s vm_runtime_init: Apr 22 15:53:21 vm_runtime_init: ah_start: INFO: app container running
Apr 22 15:53:21 gae-integration--track--gce-20160422t105255-pd6s vm_runtime_init: Apr 22 15:53:21 vm_runtime_init: ah_start: app not healthy, won't send /_ah/start yet.
... repeated multiple times, with access.log showing 500 errors fetching /_ah/health
Apr 22 15:55:13 gae-integration--track--gce-20160422t105255-pd6s vm_runtime_init: Apr 22 15:55:13 vm_runtime_init: ah_start: WARNING: never got healthy response from app, but sending /_ah/start query anyway.
Apr 22 15:55:15 gae-integration--track--gce-20160422t105255-pd6s vm_runtime_init: Apr 22 15:55:15 vm_runtime_init: Done start 'ah_start'.
At this point it starts successfully, and nothing seems to try hitting /_ah/health again in the access.log.
"Stuck" running docker pull for 20 minutes
One instance appeared to fail to accept traffic (CPU utilization was 0%), so I SSHed to it and found it stuck running docker pull. ps axf showed the following process tree (truncated):
2771 ? S 0:00 | \_ /bin/bash /usr/share/google/run-scripts /var/run/google.startup.script startup
2772 ? S 0:00 | \_ /bin/bash /usr/share/google/run-scripts /var/run/google.startup.script startup
2907 ? S 0:00 | \_ /bin/sh /usr/local/bin/gcloud docker pull us.gcr.io/(TRUNCATED)
2915 ? S 0:00 | \_ python -S /usr/local/bin/../share/google/google-cloud-sdk/./lib/googlecloudsdk/gcloud/gcloud.py docker pull us.gcr.io/(TRUNCATED)
2923 ? Sl 0:00 | \_ docker pull us.gcr.io/(TRUNCATED)
syslog shows this process was stuck for 18 minutes:
Apr 22 15:36:33 gae-integration--track--gce-20160422t105255-pd6s vm_runtime_init: Apr 22 15:36:33 vm_runtime_init: start 'pull_app'.
Apr 22 15:36:35 gae-integration--track--gce-20160422t105255-pd6s vm_runtime_init: Apr 22 15:36:35 Pulling GAE_FULL_APP_CONTAINER: us.gcr.io/triggeredmail/appengine/integration-track-gce.20160422t105255@sha256:52dde4c6b3d053419c247afa51d4ec4093392bba5bd7f713639cbc92e561bccf
Apr 22 15:45:33 gae-integration--track--gce-20160422t105255-pd6s vm_unlocker: Restarting OpenBSD Secure Shell server: sshd.
Apr 22 15:53:09 gae-integration--track--gce-20160422t105255-pd6s vm_runtime_init: Apr 22 15:53:09 Done pulling app container
Apr 22 15:53:09 gae-integration--track--gce-20160422t105255-pd6s vm_runtime_init: Apr 22 15:53:09 vm_runtime_init: Done start 'pull_app'.
API is not available yet because of exception: The remote RPC to the application server failed for call modules.GetNumInstances().
I may have another clue: I found an instance stuck at 0% CPU consumption because it somehow got stuck while attempting to reboot. This means I could capture the entire serial console log. I have the entire file in case it is useful, but I *think* it tried to restart because of these API errors? It also entered single user mode for some reason instead of actually restarting, which seems like a separate bug:
===== Unexpected error during VM startup =====
=== Dump of VM runtime system logs follows ===
WARNING: HTTP 404 error while fetching metadata key gae_redirect_appengine_googleapis_com. Will treat it as an empty string.
WARNING: HTTP 404 error while fetching metadata key gae_api_proxy_image_name. Will treat it as an empty string.
WARNING: HTTP 404 error while fetching metadata key gae_app_container_memory_mb. Will treat it as an empty string.
WARNING: HTTP 404 error while fetching metadata key gae_cloud_sql_instances. Will treat it as an empty string.
WARNING: HTTP 404 error while fetching metadata key gae_cloud_sql_proxy_image_name. Will treat it as an empty string.
WARNING: HTTP 404 error while fetching metadata key gae_extra_nginx_confs. Will treat it as an empty string.
WARNING: HTTP 404 error while fetching metadata key gae_http_loadbalancer_enabled. Will treat it as an empty string.
WARNING: HTTP 404 error while fetching metadata key gae_loadbalancer. Will treat it as an empty string.
WARNING: HTTP 404 error while fetching metadata key gae_loadbalancer_ip. Will treat it as an empty string.
WARNING: HTTP 404 error while fetching metadata key gae_memcache_proxy. Will treat it as an empty string.
WARNING: HTTP 404 error while fetching metadata key gae_monitoring_image_name. Will treat it as an empty string.
WARNING: HTTP 404 error while fetching metadata key gae_use_api_proxy. Will treat it as an empty string.
WARNING: HTTP 404 error while fetching metadata key gae_use_cloud_monitoring. Will treat it as an empty string.
vm_runtime_init: Apr 24 09:38:43 Invoking all VM runtime components. /dev/fd/63
vm_runtime_init: Apr 24 09:38:43 vm_runtime_init: start 'allow_ssh'.
vm_runtime_init: Apr 24 09:38:43 vm_runtime_init: Done start 'allow_ssh'.
vm_runtime_init: Apr 24 09:38:43 vm_runtime_init: start 'unlocker'.
vm_runtime_init: Apr 24 09:38:43 vm_runtime_init: Done start 'unlocker'.
vm_runtime_init: Apr 24 09:38:43 vm_runtime_init: start 'fluentd_logger'.
vm_runtime_init: Apr 24 09:38:44 vm_runtime_init: Done start 'fluentd_logger'.
vm_runtime_init: Apr 24 09:38:44 vm_runtime_init: start 'pull_app'.
vm_runtime_init: Apr 24 09:38:45 Pulling GAE_FULL_APP_CONTAINER: us.gcr.io/triggeredmail/appengine/integration-track-gce.20160422t105255@sha256:52dde4c6b3d053419c247afa51d4ec4093392bba5bd7f713639cbc92e561bccf
vm_runtime_init: Apr 24 09:39:19 Done pulling app container
vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: Done start 'pull_app'.
vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: start 'patch_dockerfile'.
vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: 'patch-dockerfile': skipping for empty patch.
vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: Done start 'patch_dockerfile'.
vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: start 'build_app'.
vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: 'build_app': skipping for gcloud based deployment.
vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: Done start 'build_app'.
vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: start 'api_verifier'.
vm_runtime_init: Apr 24 09:39:19 checking the status of api_verifier
vm_runtime_init: Apr 24 09:39:19 starting api_verifier
No handlers could be found for logger "google.appengine.ext.vmruntime.vmstub"
Apr 24 09:38:43 Invoking all VM runtime components. /dev/fd/63
Apr 24 09:38:43 vm_runtime_init: start 'allow_ssh'.
Apr 24 09:38:43 vm_runtime_init: Done start 'allow_ssh'.
Apr 24 09:38:43 vm_runtime_init: start 'unlocker'.
Apr 24 09:38:43 vm_runtime_init: Done start 'unlocker'.
Apr 24 09:38:43 vm_runtime_init: start 'fluentd_logger'.
8a7e9bf2f84fdf6a9381d00dc6766e2cf8766dc9c927952223bc3547ddd9913a
Apr 24 09:38:44 vm_runtime_init: Done start 'fluentd_logger'.
Apr 24 09:38:44 vm_runtime_init: start 'pull_app'.
Apr 24 09:38:45 Pulling GAE_FULL_APP_CONTAINER: us.gcr.io/triggeredmail/appengine/integration-track-gce.20160422t105255@sha256:52dde4c6b3d053419c247afa51d4ec4093392bba5bd7f713639cbc92e561bccf
sha256:52dde4c6b3d053419c247afa51d4ec4093392bba5bd7f713639cbc92e561bccf: Pulling from triggeredmail/appengine/integration-track-gce.20160422t105255
... lots of docker checksum messages
Digest: sha256:52dde4c6b3d053419c247afa51d4ec4093392bba5bd7f713639cbc92e561bccf
Status: Downloaded newer image for us.gcr.io/triggeredmail/appengine/integration-track-gce.20160422t105255@sha256:52dde4c6b3d053419c247afa51d4ec4093392bba5bd7f713639cbc92e561bccf
Apr 24 09:39:19 Done pulling app container
Apr 24 09:39:19 vm_runtime_init: Done start 'pull_app'.
Apr 24 09:39:19 vm_runtime_init: start 'patch_dockerfile'.
Apr 24 09:39:19 vm_runtime_init: 'patch-dockerfile': skipping for empty patch.
Apr 24 09:39:19 vm_runtime_init: Done start 'patch_dockerfile'.
Apr 24 09:39:19 vm_runtime_init: start 'build_app'.
Apr 24 09:39:19 vm_runtime_init: 'build_app': skipping for gcloud based deployment.
Apr 24 09:39:19 vm_runtime_init: Done start 'build_app'.
Apr 24 09:39:19 vm_runtime_init: start 'api_verifier'.
Apr 24 09:39:19 checking the status of api_verifier
Apr 24 09:39:19 starting api_verifier
API is not available yet because of exception: The remote RPC to the application server failed for call modules.GetNumInstances().
... about 100 more lines of this
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d8962ad1c641 gcr.io/google_appengine/api-verifier "./api_verifier.py" 2 minutes ago Exited (1) 2 seconds ago sad_ptolemy
8a7e9bf2f84f gcr.io/google_appengine/fluentd-logger "/opt/google-fluentd/" 2 minutes ago Up 2 minutes desperate_bell
Container: d8962ad1c641
No handlers could be found for logger "google.appengine.ext.vmruntime.vmstub"
API is not available yet because of exception: The remote RPC to the application server failed for call modules.GetNumInstances().
... about another 100 lines
API is not available yet because of exception: The remote RPC to the application server failed for call modules.GetNumInstances().
Container: 8a7e9bf2f84f
========= rebooting. ========================
INIT:
INIT: Sending processes the TERM signal
INIT: Sending processes the KILL signal
Apr 24 09:41:27 gae-integration--track--gce-20160422t105255-o4qv init: Switching to runlevel: 1
gcm-StatusUpdate:TIME=1461490887000;STATUS=COMMAND_FAILED;INVOCATION_ID=0
[ [36minfo [39;49m] Using makefile-style concurrent boot in runlevel 1.
Stopping supervisor: supervisord.
udhcpd: Disabled. Edit /etc/default/udhcpd to enable it.
Apr 24 09:41:28 gae-integration--track--gce-20160422t105255-o4qv rpcbind: rpcbind terminating on signal. Restart with "rpcbind -w"
Apr 24 09:41:28 gae-integration--track--gce-20160422t105255-o4qv rpc.statd[1783]: Caught signal 15, un-registering and exiting
[....] Stopping rpcbind daemon... [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0c.
[....] Unmounting iscsi-backed filesystems: Unmounting all devices marked _netdev [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0c.
Apr 24 09:41:28 gae-integration--track--gce-20160422t105255-o4qv google: shutdown script found in metadata.
Apr 24 09:41:28 gae-integration--track--gce-20160422t105255-o4qv shutdownscript: Running shutdown script /var/run/google.shutdown.script
[....] Stopping NFS common utilities: idmapd statd [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0c.
Apr 24 09:41:28 gae-integration--track--gce-20160422t105255-o4qv iscsid: iscsid shutting down.
[....] Unmounting iscsi-backed filesystems: Unmounting all devices marked _netdev [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0c.
[....] Disconnecting iSCSI targets:iscsiadm: No matching sessions found
[?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0c.
[....] Stopping iSCSI initiator service: [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0c.
Apr 24 09:41:28 gae-integration--track--gce-20160422t105255-o4qv shutdownscript: Finished running shutdown script /var/run/google.shutdown.script
[....] Stopping Docker: docker [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0c.
[....] Stopping The Kubernetes container manager: kubelet [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0c.
[....] Stopping enhanced syslogd: rsyslogd [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0c.
[ 187.624222] docker0: port 1(vethffde071) entered disabled state
[ 187.692308] docker0: port 1(vethffde071) entered disabled state
[ 187.699568] device vethffde071 left promiscuous mode
[ 187.705014] docker0: port 1(vethffde071) entered disabled state
gcm-Heartbeat:1461490903000
Terminating on signal number 15
[....] Asking all remaining processes to terminate...acpid: exiting
[?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0cdone.
[....] All processes ended within 1 seconds.... [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0cdone.
[ [36minfo [39;4
INIT
INIT: Sending processes the KILL signal
sulogin: root account is locked, starting shell
root@gae-integration--track--gce-20160422t105255-o4qv:~#