Flexible environment auto-scaling: Instances stopped after ~2 minutes, before app even runs

1,068 views
Skip to first unread message

Evan Jones

unread,
Apr 20, 2016, 6:14:46 PM4/20/16
to Google App Engine
TL;DR: I'm seeing the flexible environment autoscaler start a machine, then stop it 2 minutes later, before it has even finished starting the application. Is this a misconfiguration on my side or a bug? Could the issue be that our application loads too slowly? Can I reconfigure this somehow? We see lots of instances "fail to start" like this, and I can't figure out why exactly.


Longer description

We've been experimenting with moving parts of our application to the new "Flexible" environment. So far it works well! However, I've noticed that the autoscaler is extremely aggressive at stopping instances, which contradicts the documentation. I'm assuming the Flexible environment using the GCE autoscaler. The documentation there states "the 10 minute delay prevents virtual machine instances from being terminated before the instance has been running for 10 minutes. Cool down periods for new instances are ignored when deciding whether to scale down a group":


However, I've been looking at when instances are created and stopped, and I see many that get stopped after ~2-3 minutes before the machine has even booted! It seems to me from the description above, it should let my machine run for at least 10 minutes before stopping it. Could this be getting killed by some sort of health checking? This module is using the defaults, which looks like it should wait at least 5 minutes before declaring an instance unhealthy.

I've actually seen a number of other issues that seem to cause VMs to fail to start:

* Long sequence of "API is not available yet because of exception: The remote RPC to the application server failed for call modules.GetNumInstances()" errors, followed by shutdown after about 


Technical details

In the compute engine API logs, I see some create/delete records like the following, showing an instance created then deleted within ~2 minutes:


17:11:15.264 compute.instances.insert type:GCE_API_CALL resource id:"2166106899089084061"
17:11:51.535   type:GCE_OPERATION_DONE (matching the previous call)
17:13:40.427 compute.instances.delete type:GCE_API_CALL resource id:"2166106899089084061"
17:14:30.572   type":GCE_OPERATION_DONE


When I search for logs from this instance itself, the last messages across all logs are from syslog, showing it was still running the "startup" process provided by Google when the shutdown signal from the delete operation was executed. It seems from this log it didn't even get to our code; this is killed while running Google's code I think. I see similar issues where instances run for ~5 minutes, and have a sequence of errors related to "api_verifier" returning many errors.

Any ideas what is causing this, and what we can do to make it happen less often?

Thank you!




Apr 20 21:11:50 gae-integration--track--gce-20160415t135311-cyl8 vm_runtime_init: Apr 20 21:11:50 Pulling GAE_FULL_APP_CONTAINER: us.gcr.io/TRUNCATED@sha256:f70b09a8217430bde2e45d6ccc868e3699b92f965eff12cc6ed19865ce131585
Apr 20 21:12:03 gae-integration--track--gce-20160415t135311-cyl8 kernel: [   27.964060] docker0: port 1(veth75e5c89) entered forwarding state
Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 kernel: [   72.892834] aufs au_opts_verify:1570:docker[2262]: dirperm1 breaks the protection by the permission bits on the lower branch
Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 vm_runtime_init: Apr 20 21:12:48 vm_runtime_init: 'build_app': skipping for gcloud based deployment.
Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 vm_runtime_init: Apr 20 21:12:48 vm_runtime_init: 'patch-dockerfile': skipping for empty patch.
Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 kernel: [   73.034744] aufs au_opts_verify:1570:docker[2262]: dirperm1 breaks the protection by the permission bits on the lower branch
Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 vm_runtime_init: Apr 20 21:12:48 vm_runtime_init: start 'patch_dockerfile'.
Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 kernel: [   73.348051] docker0: port 2(veth0dcc920) entered forwarding state
Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 kernel: [   73.076657] aufs au_opts_verify:1570:docker[2262]: dirperm1 breaks the protection by the permission bits on the lower branch
Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 vm_runtime_init: Apr 20 21:12:48 vm_runtime_init: start 'api_verifier'.
Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 vm_runtime_init: Apr 20 21:12:48 vm_runtime_init: Done start 'build_app'.
Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 kernel: [   73.354433] docker0: port 2(veth0dcc920) entered forwarding state
Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 vm_runtime_init: Apr 20 21:12:48 vm_runtime_init: start 'build_app'.
Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 kernel: [   73.094180] device veth0dcc920 entered promiscuous mode
Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 vm_runtime_init: Apr 20 21:12:48 vm_runtime_init: Done start 'pull_app'.
Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 vm_runtime_init: Apr 20 21:12:48 starting api_verifier
Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 vm_runtime_init: Apr 20 21:12:48 checking the status of api_verifier
Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 kernel: [   73.106564] IPv6: ADDRCONF(NETDEV_UP): veth0dcc920: link is not ready
Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 kernel: [   73.340512] IPv6: ADDRCONF(NETDEV_CHANGE): veth0dcc920: link becomes ready
Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 vm_runtime_init: Apr 20 21:12:48 Done pulling app container
Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 vm_runtime_init: Apr 20 21:12:48 vm_runtime_init: Done start 'patch_dockerfile'.
Apr 20 21:13:03 gae-integration--track--gce-20160415t135311-cyl8 kernel: [   88.380041] docker0: port 2(veth0dcc920) entered forwarding state
Apr 20 21:13:42 gae-integration--track--gce-20160415t135311-cyl8 init: Switching to runlevel: 0
Apr 20 21:13:42 gae-integration--track--gce-20160415t135311-cyl8 shutdown[3370]: shutting down for system halt
Apr 20 21:13:42 gae-integration--track--gce-20160415t135311-cyl8 shutdown[3352]: shutting down for system halt
Apr 20 21:13:43 gae-integration--track--gce-20160415t135311-cyl8 shutdownscript: Running shutdown script /var/run/google.shutdown.script
Apr 20 21:13:43 gae-integration--track--gce-20160415t135311-cyl8 iscsid: iscsid shutting down.
Apr 20 21:13:43 gae-integration--track--gce-20160415t135311-cyl8 shutdownscript: Finished running shutdown script /var/run/google.shutdown.script
Apr 20 21:13:43 gae-integration--track--gce-20160415t135311-cyl8 google: shutdown script found in metadata.
---------------------------------------------------------------------
--------------------------Tail of app logs---------------------------
"  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current"
---------------------------------------------------------------------
Current app health: 0
"---------------App was unhealthy, grabbing debug logs----------------"
App logs not found.
"                                 Dload  Upload   Total   Spent    Left  Speed"
Sending SIGUSR1 to fluentd to trigger a log flush.
App shutdown handlers done.
Container CID file not found.
---------------------------------------------------------------------
"
0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) couldn't connect to host"
47e7078ed32619023f23f15022d4965381f0abc634ea9e48d173e312beb43b51
Triggering app shutdown handlers.
-------------------------VM shutdown initiated------------------------





Nicholas (Google Cloud Support)

unread,
Apr 21, 2016, 3:19:37 PM4/21/16
to Google App Engine
Thanks for posting your issue here and included various logs.

Regarding your question about the autoscaler, I do not know if the flexible runtime infrastructure uses a standard GCE autoscaler for managing VM instances.

With the errors your getting, we'd need to investigate this deeper to understand what has been happening:
  • Is this still happening?
  • Is this limited to a specific module or all your flexible runtime modules?
  • What health checking configuration do you have setup in your yaml file?
  • Can you provide a code snippet of the code responding to health checks?
  • Are you using a standard runtime or custom runtime? If custom, how is your Dockerfile setup?
Forgive me for asking for so much generally but it's not entirely clear where this issue is arising. The information requested above may be helpful to learning more about this.

Evan Jones

unread,
Apr 21, 2016, 3:27:54 PM4/21/16
to Google App Engine
Thanks for the response.

1. Yes, this is still happening. It happens probably about ~3-30 times per hour.
2. Currently we only have one "flexible" module in production, so it currently only happens on that one.
3. I have no health checking configured; I'm relying on the default configuration.
4. We have no custom code responding to health checks.
5. We are using the standard runtime:

runtime: python-compat
threadsafe: true
vm: true


One other fact that might matter: our application container is pretty large. Since the flexible environment doesn't seem to obey "skip files", its ~100MB. Maybe it just takes a while for it to download.


I'm happy to file a support ticket with our Silver level enterprise support if that is a better forum than these groups. Thanks!

Evan

Nicholas (Google Cloud Support)

unread,
Apr 21, 2016, 4:03:39 PM4/21/16
to Google App Engine
Thanks for the information.

Could you try setting up health checking? If not, explicitly disabling it in app.yaml:
health_check:
  enable_health_check: False

Regarding your docker image size, this can be reduced by using a .dockerignore file. Unfortunately, I can't find any official documentation stating explicitly that skip_files is ignored with such runtimes but this is the case. Try populating a .dockerfile with files and directories you wish to exclude and see how it affects your build. It should at least reduce build time.

Hope this helps.


On Wednesday, April 20, 2016 at 6:14:46 PM UTC-4, Evan Jones wrote:

Evan Jones

unread,
Apr 22, 2016, 12:47:06 PM4/22/16
to Google App Engine
I just redeployed. It been running for the last few hours, and so far I haven't seen this happen again. I'll report back after it has run for a few days. I did notice some slightly unusual and possibly buggy behaviour though:

1. First: During the "set up" process, the vm boot code still tries to health check. It then prints a message about giving up, then sends /_ah/start. Seems to me that if health checking is disabled, it shouldn't do this (logs below).

2. I observed once instance that didn't accept traffic for a long time. It was stuck downloading some docker package (details below). It eventually did continue and start correctly, so this isn't a huge concern.


VM Boot Health Checking

Logs on all machines have the following:

Apr 22 15:53:21 gae-integration--track--gce-20160422t105255-pd6s vm_runtime_init: Apr 22 15:53:21 vm_runtime_init: start 'ah_start'.

Apr 22 15:53:21 gae-integration--track--gce-20160422t105255-pd6s vm_runtime_init: Apr 22 15:53:21 vm_runtime_init: ah_start: INFO: app container running

Apr 22 15:53:21 gae-integration--track--gce-20160422t105255-pd6s vm_runtime_init: Apr 22 15:53:21 vm_runtime_init: ah_start: app not healthy, won't send /_ah/start yet.

   ... repeated multiple times, with access.log showing 500 errors fetching /_ah/health 

Apr 22 15:55:13 gae-integration--track--gce-20160422t105255-pd6s vm_runtime_init: Apr 22 15:55:13 vm_runtime_init: ah_start: WARNING: never got healthy response  from app, but sending /_ah/start query anyway.

Apr 22 15:55:15 gae-integration--track--gce-20160422t105255-pd6s vm_runtime_init: Apr 22 15:55:15 vm_runtime_init: Done start 'ah_start'.


At this point it starts successfully, and nothing seems to try hitting /_ah/health again in the access.log.



"Stuck" running docker pull for 20 minutes


One instance appeared to fail to accept traffic (CPU utilization was 0%), so I SSHed to it and found it stuck running docker pull. ps axf showed the following process tree (truncated):


 2771 ?        S      0:00  |   \_ /bin/bash /usr/share/google/run-scripts /var/run/google.startup.script startup

 2772 ?        S      0:00  |       \_ /bin/bash /usr/share/google/run-scripts /var/run/google.startup.script startup

 2907 ?        S      0:00  |           \_ /bin/sh /usr/local/bin/gcloud docker pull us.gcr.io/(TRUNCATED)

 2915 ?        S      0:00  |               \_ python -S /usr/local/bin/../share/google/google-cloud-sdk/./lib/googlecloudsdk/gcloud/gcloud.py docker pull us.gcr.io/(TRUNCATED)

 2923 ?        Sl     0:00  |                   \_ docker pull us.gcr.io/(TRUNCATED)



syslog shows this process was stuck for 18 minutes:



Apr 22 15:36:33 gae-integration--track--gce-20160422t105255-pd6s vm_runtime_init: Apr 22 15:36:33 vm_runtime_init: start 'pull_app'.

Apr 22 15:36:35 gae-integration--track--gce-20160422t105255-pd6s vm_runtime_init: Apr 22 15:36:35 Pulling GAE_FULL_APP_CONTAINER: us.gcr.io/triggeredmail/appengine/integration-track-gce.20160422t105255@sha256:52dde4c6b3d053419c247afa51d4ec4093392bba5bd7f713639cbc92e561bccf

Apr 22 15:45:33 gae-integration--track--gce-20160422t105255-pd6s vm_unlocker: Restarting OpenBSD Secure Shell server: sshd.

Apr 22 15:53:09 gae-integration--track--gce-20160422t105255-pd6s vm_runtime_init: Apr 22 15:53:09 Done pulling app container

Apr 22 15:53:09 gae-integration--track--gce-20160422t105255-pd6s vm_runtime_init: Apr 22 15:53:09 vm_runtime_init: Done start 'pull_app'.


Evan Jones

unread,
Apr 24, 2016, 12:44:40 PM4/24/16
to Google App Engine
After a day, the problem is still happening. If there are any workarounds, I'd love to hear it, because I think this is costing us real money. Might I avoid this issue by using manual scaling, and calling the modules.SetNumInstances() API myself?

I can see the problem very clearly on the chart of CPU usage of these instances over time, as charted by Stackdriver. I'm attaching a graph that shows many instances getting started then killed within a few minutes (see all the instances whose usage never goes over 50%, none of them ever started serving traffic). Unfortunately, the stackdriver chart is completely illegible with horizontal lines if I set the time period to ~1 day, so I can't tell how long it took between the deploy and for this problem to start occurring. If it is useful, I'm happy to try pulling the raw data and looking at it.



I checked the logs, and the symptoms are the same: The instance is started and deleted by the autoscaler very quickly. This does not seem to be how this should be working.

For one example instance, the first message in syslog from the kernel is timestamped "Apr 24 15:25:03". I can see that vm_runtime_init is making progress, then the instance is killed, 3 minutes after it starts. (The create API was called at 11:24:37.572 and the delete completed at 11:30:50.494, so it ran for less than 7 minutes, meaning we got charged for 10 minutes of work that was totally useless.)


Apr 24 15:25:54 gae-integration--track--gce-20160422t105255-j9wu vm_runtime_init: Apr 24 15:25:54 vm_runtime_init: Done start 'build_app'.
Apr 24 15:25:54 gae-integration--track--gce-20160422t105255-j9wu vm_runtime_init: Apr 24 15:25:54 vm_runtime_init: Done start 'patch_dockerfile'.
Apr 24 15:25:54 gae-integration--track--gce-20160422t105255-j9wu vm_runtime_init: Apr 24 15:25:54 vm_runtime_init: Done start 'pull_app'.
Apr 24 15:25:54 gae-integration--track--gce-20160422t105255-j9wu vm_runtime_init: Apr 24 15:25:54 vm_runtime_init: start 'api_verifier'.
Apr 24 15:25:54 gae-integration--track--gce-20160422t105255-j9wu vm_runtime_init: Apr 24 15:25:54 vm_runtime_init: start 'build_app'.
Apr 24 15:25:54 gae-integration--track--gce-20160422t105255-j9wu vm_runtime_init: Apr 24 15:25:54 vm_runtime_init: start 'patch_dockerfile'.
Apr 24 15:26:09 gae-integration--track--gce-20160422t105255-j9wu kernel: [   76.160053] docker0: port 2(veth9d7335e) entered forwarding state
Apr 24 15:27:56 gae-integration--track--gce-20160422t105255-j9wu kernel: [  183.088252] docker0: port 2(veth9d7335e) entered disabled state
Apr 24 15:27:56 gae-integration--track--gce-20160422t105255-j9wu kernel: [  183.184346] docker0: port 2(veth9d7335e) entered disabled state
Apr 24 15:27:56 gae-integration--track--gce-20160422t105255-j9wu kernel: [  183.192162] device veth9d7335e left promiscuous mode
Apr 24 15:27:56 gae-integration--track--gce-20160422t105255-j9wu kernel: [  183.197552] docker0: port 2(veth9d7335e) entered disabled state
Apr 24 15:28:00 gae-integration--track--gce-20160422t105255-j9wu init: Switching to runlevel: 1


the crash.log does have many messages like the following, so maybe there is a bug in the Appengine API proxy, and the health checking is correctly killing "bad" instances?

API is not available yet because of exception: The remote RPC to the application server failed for call modules.GetNumInstances().



I may have another clue: I found an instance stuck at 0% CPU consumption because it somehow got stuck while attempting to reboot. This means I could capture the entire serial console log. I have the entire file in case it is useful, but I *think* it tried to restart because of these API errors? It also entered single user mode for some reason instead of actually restarting, which seems like a separate bug:


===== Unexpected error during VM startup =====

=== Dump of VM runtime system logs follows ===

WARNING: HTTP 404 error while fetching metadata key gae_redirect_appengine_googleapis_com. Will treat it as an empty string.

WARNING: HTTP 404 error while fetching metadata key gae_api_proxy_image_name. Will treat it as an empty string.

WARNING: HTTP 404 error while fetching metadata key gae_app_container_memory_mb. Will treat it as an empty string.

WARNING: HTTP 404 error while fetching metadata key gae_cloud_sql_instances. Will treat it as an empty string.

WARNING: HTTP 404 error while fetching metadata key gae_cloud_sql_proxy_image_name. Will treat it as an empty string.

WARNING: HTTP 404 error while fetching metadata key gae_extra_nginx_confs. Will treat it as an empty string.

WARNING: HTTP 404 error while fetching metadata key gae_http_loadbalancer_enabled. Will treat it as an empty string.

WARNING: HTTP 404 error while fetching metadata key gae_loadbalancer. Will treat it as an empty string.

WARNING: HTTP 404 error while fetching metadata key gae_loadbalancer_ip. Will treat it as an empty string.

WARNING: HTTP 404 error while fetching metadata key gae_memcache_proxy. Will treat it as an empty string.

WARNING: HTTP 404 error while fetching metadata key gae_monitoring_image_name. Will treat it as an empty string.

WARNING: HTTP 404 error while fetching metadata key gae_use_api_proxy. Will treat it as an empty string.

WARNING: HTTP 404 error while fetching metadata key gae_use_cloud_monitoring. Will treat it as an empty string.

vm_runtime_init: Apr 24 09:38:43 Invoking all VM runtime components. /dev/fd/63

vm_runtime_init: Apr 24 09:38:43 vm_runtime_init: start 'allow_ssh'.

vm_runtime_init: Apr 24 09:38:43 vm_runtime_init: Done start 'allow_ssh'.

vm_runtime_init: Apr 24 09:38:43 vm_runtime_init: start 'unlocker'.

vm_runtime_init: Apr 24 09:38:43 vm_runtime_init: Done start 'unlocker'.

vm_runtime_init: Apr 24 09:38:43 vm_runtime_init: start 'fluentd_logger'.

vm_runtime_init: Apr 24 09:38:44 vm_runtime_init: Done start 'fluentd_logger'.

vm_runtime_init: Apr 24 09:38:44 vm_runtime_init: start 'pull_app'.

vm_runtime_init: Apr 24 09:38:45 Pulling GAE_FULL_APP_CONTAINER: us.gcr.io/triggeredmail/appengine/integration-track-gce.20160422t105255@sha256:52dde4c6b3d053419c247afa51d4ec4093392bba5bd7f713639cbc92e561bccf

vm_runtime_init: Apr 24 09:39:19 Done pulling app container

vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: Done start 'pull_app'.

vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: start 'patch_dockerfile'.

vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: 'patch-dockerfile': skipping for empty patch.

vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: Done start 'patch_dockerfile'.

vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: start 'build_app'.

vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: 'build_app': skipping for gcloud based deployment.

vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: Done start 'build_app'.

vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: start 'api_verifier'.

vm_runtime_init: Apr 24 09:39:19 checking the status of api_verifier

vm_runtime_init: Apr 24 09:39:19 starting api_verifier

No handlers could be found for logger "google.appengine.ext.vmruntime.vmstub"

Apr 24 09:38:43 Invoking all VM runtime components. /dev/fd/63

Apr 24 09:38:43 vm_runtime_init: start 'allow_ssh'.

Apr 24 09:38:43 vm_runtime_init: Done start 'allow_ssh'.

Apr 24 09:38:43 vm_runtime_init: start 'unlocker'.

Apr 24 09:38:43 vm_runtime_init: Done start 'unlocker'.

Apr 24 09:38:43 vm_runtime_init: start 'fluentd_logger'.

8a7e9bf2f84fdf6a9381d00dc6766e2cf8766dc9c927952223bc3547ddd9913a

Apr 24 09:38:44 vm_runtime_init: Done start 'fluentd_logger'.

Apr 24 09:38:44 vm_runtime_init: start 'pull_app'.

Apr 24 09:38:45 Pulling GAE_FULL_APP_CONTAINER: us.gcr.io/triggeredmail/appengine/integration-track-gce.20160422t105255@sha256:52dde4c6b3d053419c247afa51d4ec4093392bba5bd7f713639cbc92e561bccf

sha256:52dde4c6b3d053419c247afa51d4ec4093392bba5bd7f713639cbc92e561bccf: Pulling from triggeredmail/appengine/integration-track-gce.20160422t105255


  ... lots of docker checksum messages


Digest: sha256:52dde4c6b3d053419c247afa51d4ec4093392bba5bd7f713639cbc92e561bccf

Status: Downloaded newer image for us.gcr.io/triggeredmail/appengine/integration-track-gce.20160422t105255@sha256:52dde4c6b3d053419c247afa51d4ec4093392bba5bd7f713639cbc92e561bccf

Apr 24 09:39:19 Done pulling app container

Apr 24 09:39:19 vm_runtime_init: Done start 'pull_app'.

Apr 24 09:39:19 vm_runtime_init: start 'patch_dockerfile'.

Apr 24 09:39:19 vm_runtime_init: 'patch-dockerfile': skipping for empty patch.

Apr 24 09:39:19 vm_runtime_init: Done start 'patch_dockerfile'.

Apr 24 09:39:19 vm_runtime_init: start 'build_app'.

Apr 24 09:39:19 vm_runtime_init: 'build_app': skipping for gcloud based deployment.

Apr 24 09:39:19 vm_runtime_init: Done start 'build_app'.

Apr 24 09:39:19 vm_runtime_init: start 'api_verifier'.

Apr 24 09:39:19 checking the status of api_verifier

Apr 24 09:39:19 starting api_verifier

API is not available yet because of exception: The remote RPC to the application server failed for call modules.GetNumInstances().


... about 100 more lines of this


CONTAINER ID        IMAGE                                    COMMAND                  CREATED             STATUS                     PORTS               NAMES

d8962ad1c641        gcr.io/google_appengine/api-verifier     "./api_verifier.py"      2 minutes ago       Exited (1) 2 seconds ago                       sad_ptolemy

8a7e9bf2f84f        gcr.io/google_appengine/fluentd-logger   "/opt/google-fluentd/"   2 minutes ago       Up 2 minutes                                   desperate_bell

Container: d8962ad1c641

No handlers could be found for logger "google.appengine.ext.vmruntime.vmstub"

API is not available yet because of exception: The remote RPC to the application server failed for call modules.GetNumInstances().


... about another 100 lines


API is not available yet because of exception: The remote RPC to the application server failed for call modules.GetNumInstances().

Container: 8a7e9bf2f84f

========= rebooting. ========================


INIT: 

INIT: Sending processes the TERM signal



INIT: Sending processes the KILL signal


Apr 24 09:41:27 gae-integration--track--gce-20160422t105255-o4qv init: Switching to runlevel: 1

gcm-StatusUpdate:TIME=1461490887000;STATUS=COMMAND_FAILED;INVOCATION_ID=0

[ [36minfo [39;49m] Using makefile-style concurrent boot in runlevel 1.

Stopping supervisor: supervisord.

udhcpd: Disabled. Edit /etc/default/udhcpd to enable it.

Apr 24 09:41:28 gae-integration--track--gce-20160422t105255-o4qv rpcbind: rpcbind terminating on signal. Restart with "rpcbind -w"

Apr 24 09:41:28 gae-integration--track--gce-20160422t105255-o4qv rpc.statd[1783]: Caught signal 15, un-registering and exiting

[....] Stopping rpcbind daemon... [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0c.

[....] Unmounting iscsi-backed filesystems: Unmounting all devices marked _netdev [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0c.

Apr 24 09:41:28 gae-integration--track--gce-20160422t105255-o4qv google: shutdown script found in metadata.

Apr 24 09:41:28 gae-integration--track--gce-20160422t105255-o4qv shutdownscript: Running shutdown script /var/run/google.shutdown.script

[....] Stopping NFS common utilities: idmapd statd [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0c.

Apr 24 09:41:28 gae-integration--track--gce-20160422t105255-o4qv iscsid: iscsid shutting down.

[....] Unmounting iscsi-backed filesystems: Unmounting all devices marked _netdev [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0c.

[....] Disconnecting iSCSI targets:iscsiadm: No matching sessions found

[?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0c.

[....] Stopping iSCSI initiator service: [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0c.

Apr 24 09:41:28 gae-integration--track--gce-20160422t105255-o4qv shutdownscript: Finished running shutdown script /var/run/google.shutdown.script

[....] Stopping Docker: docker [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0c.

[....] Stopping The Kubernetes container manager: kubelet [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0c.

[....] Stopping enhanced syslogd: rsyslogd [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0c.

[  187.624222] docker0: port 1(vethffde071) entered disabled state

[  187.692308] docker0: port 1(vethffde071) entered disabled state

[  187.699568] device vethffde071 left promiscuous mode

[  187.705014] docker0: port 1(vethffde071) entered disabled state

gcm-Heartbeat:1461490903000

Terminating on signal number 15

[....] Asking all remaining processes to terminate...acpid: exiting


[?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0cdone.

[....] All processes ended within 1 seconds.... [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0cdone.

[ [36minfo [39;4

INIT

INIT: Sending processes the KILL signal


sulogin: root account is locked, starting shell

root@gae-integration--track--gce-20160422t105255-o4qv:~# 



Evan Jones

unread,
May 2, 2016, 11:35:54 AM5/2/16
to Google App Engine
This is still happening, although we are re-deploying the module more frequently now, which mitigates the problem somewhat. I filed an issue tracker issue for this:


I'm hoping that maybe if we write our own autoscaler, we might avoid this bug.

Nicholas (Google Cloud Support)

unread,
May 13, 2016, 2:58:58 PM5/13/16
to Google App Engine
Thank you for this thorough investigation and for posting this on our issue tracker. That is indeed the suitable forum for this type of issue and the work to determine the root cause of this will continue there.
Reply all
Reply to author
Forward
0 new messages