I am seeing new issue with Java Managed VMs: the additional VM instances started by the auto-scaler never make it past the api_verifier stage in startup.
From VM console output it's obvious that api_verifier never finishes and that vm_runtime_init never moves on to the next service (cloud_sql_proxy, memcache_proxy, etc).
By attaching strace to the running process "python api_verifier.py", I could see that verifier is in a dead loop of connecting to 169.254.169.253 and sending POST request to /rpc_http, only to get back 500 response each time:
[pid 2858] connect(3, {sa_family=AF_INET, sin_port=htons(10001), sin_addr=inet_addr("169.254.169.253")}, 16) = -1 EINPROGRESS (Operation now in progress)
[pid 2858] poll([{fd=3, events=POLLOUT}], 1, 61000) = 1 ([{fd=3, revents=POLLOUT}])
[pid 2858] getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
[pid 2858] poll([{fd=3, events=POLLOUT}], 1, 61000) = 1 ([{fd=3, revents=POLLOUT}])
[pid 2858] sendto(3, "POST /rpc_http HTTP/1.1\r\nHost: a"..., 440, 0, NULL, 0) = 440
[pid 2858] fcntl(3, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK)
[pid 2858] fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 2858] poll([{fd=3, events=POLLIN}], 1, 61000 <unfinished ...>
[pid 2916] <... select resumed> ) = 0 (Timeout)
[pid 2916] select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout)
.....
[pid 2858] recvfrom(3, "HTTP/1.1 500 Internal Server Err"..., 8192, 0, NULL, NULL) = 289
[pid 2858] gettimeofday({1449086107, 553383}, NULL) = 0
[pid 2858] gettimeofday({1449086107, 553490}, NULL) = 0
[pid 2858] gettimeofday({1449086107, 553578}, NULL) = 0
[pid 2858] close(3) = 0
This happens with every single VM instance started by the auto-scaler. Yet the initial instance started when I deploy the app starts and works fine.
Is anyone else seeing this?
-Drago