Endpoints-runtime-serverless container exited

710 views
Skip to first unread message

David Haupt

unread,
Jun 8, 2021, 7:47:14 AM6/8/21
to Google Cloud Endpoints
My team works on on a platform that's gradually migrating away from App Engine on Python2.7 to a combination of a gRPC server container managed by Cloud Run, a gateway service composed of an ESPv2 container managed by Cloud Run, and Cloud Functions.

As you've likely guessed, our App Engine server uses gRPC clients to make calls to our gRPC server, authenticated by and proxied through the ESPv2 container.

On 7 June at around ~16:40GMT, it appears our ESPv2 container stopped handling requests entirely. The effect on our platform was immediate, as a good chunk of critical functionality is handled by our gRPC server. I haven't been able to find much information, however there are some interesting log messages a few minutes before this began:

Screen Shot 2021-06-07 at 7.28.42 PM.png

This is really all I've been able to discover - a few minutes after these "Container called exit(1)" messages, this service stopped handling requests at all - Cloud Run also stopped reporting any idle or active containers (despite having specified a minimum of 1 in the service config).

The only way we were able to mitigate this issue was by deploying a new revision of the service, which started new containers.

I'm also pursuing the Cloud Run side of this, but I'd also like to know more about how our endpoints-runtime-serverless container apparently exited badly. 

If there's any other information I can provide, I'd be happy to. This container was built on 2021-05-26 using a version of gcloud_build_image.sh that defaults to using gcr.io/endpoints-release/endpoints-runtime-serverless:2, which I believe means it would have used the 2.26 release.

David

David Haupt

unread,
Jun 8, 2021, 8:07:47 AM6/8/21
to Google Cloud Endpoints
Apologies, I thought the image would be expandable. I believe this link should work.

Teju Nareddy

unread,
Jun 8, 2021, 10:55:11 AM6/8/21
to David Haupt, Google Cloud Endpoints
This is indeed odd. Are there no other logs from ESPv2 Cloud Run service? What were the last log lines before the "Container called exit(1)" messages?

I suggest opening a bug with Cloud Support for Cloud Run. It is possible it is a Cloud Run issue, as I've never seen "Container called exit(1)" messages before.

--
You received this message because you are subscribed to the Google Groups "Google Cloud Endpoints" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-endp...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-endpoints/2bd23811-3337-4b5d-907d-726fd25fc3aan%40googlegroups.com.


--

Teju Nareddy

nare...@google.com

Software Engineer

David Haupt

unread,
Jun 8, 2021, 11:42:58 AM6/8/21
to Google Cloud Endpoints
There are some more logs - my mistake, I didn't realize that some relevant Envoy logging wasn't at the expected level! I'm still relatively new to Envoy, having only used it on a personal project and then with my (still relatively new to me) team for the last 6 or so months - I've parsed out what I think are the correct messages and included some below. Right before this started, we saw a large spike in usage, so quite a few (~30) containers started at roughly the same time, which logged a ton of information.

I've attached an export of what seem to be the Envoy-related logs from this timeframe, some of the interesting lines are below. This is the output from a single instance:

2021-06-07 16:41:37.231 GMT I0607 16:41:36.811 15 envoy] [15][main] envoy.matching.http.input: request-headers, request-trailers, response-headers, response-trailers
2021-06-07 16:41:37.231 GMT I0607 16:41:36.811 15 envoy] [15][main] envoy.filters.network: envoy.filters.network.http_connection_manager, envoy.http_connection_manager
2021-06-07 16:41:37.231 GMT I0607 16:41:36.811 15 envoy] [15][main] envoy.tls.cert_validator: envoy.tls.cert_validator.default
2021-06-07 16:41:37.231 GMT I0607 16:41:36.811 15 envoy] [15][main] envoy.stats_sinks: envoy.metrics_service, envoy.stat_sinks.metrics_service, envoy.stat_sinks.statsd, envoy.statsd
2021-06-07 16:41:37.231 GMT I0607 16:41:36.811 15 envoy] [15][main] envoy.matching.action: skip
2021-06-07 16:41:37.231 GMT I0607 16:41:36.811 15 envoy] [15][main] envoy.tracers: envoy.tracers.opencensus
2021-06-07 16:41:37.231 GMT I0607 16:41:36.811 15 envoy] [15][main] envoy.resolvers: envoy.ip
2021-06-07 16:41:37.231 GMT I0607 16:41:36.811 15 envoy] [15][main] envoy.access_loggers: envoy.access_loggers.file, envoy.access_loggers.http_grpc, envoy.file_access_log, envoy.http_grpc_access_log
2021-06-07 16:41:37.231 GMT I0607 16:41:36.811 15 envoy] [15][main] envoy.bootstrap: envoy.extensions.network.socket_interface.default_socket_interface
2021-06-07 16:41:37.231 GMT I0607 16:41:36.811 15 envoy] [15][main] envoy.upstream_options: envoy.extensions.upstreams.http.v3.HttpProtocolOptions, envoy.upstreams.http.http_protocol_options
2021-06-07 16:41:37.231 GMT I0607 16:41:37.005 15 envoy] [15][main]HTTP header map info:
2021-06-07 16:41:37.231 GMT I0607 16:41:37.206 15 envoy] [15][main] request header map: 536 bytes: :authority,:method,:path,:protocol,:scheme,accept,access-control-request-method,authorization,connection,content-length,content-type,expect,grpc-accept-encoding,grpc-timeout,keep-alive,origin,proxy-connection,referer,te,transfer-encoding,upgrade,user-agent,via,x-client-trace-id,x-envoy-attempt-count,x-envoy-decorator-operation,x-envoy-downstream-service-cluster,x-envoy-downstream-service-node,x-envoy-expected-rq-timeout-ms,x-envoy-external-address,x-envoy-force-trace,x-envoy-hedge-on-per-try-timeout,x-envoy-internal,x-envoy-ip-tags,x-envoy-max-retries,x-envoy-original-path,x-envoy-original-url,x-envoy-retriable-header-names,x-envoy-retriable-status-codes,x-envoy-retry-grpc-on,x-envoy-retry-on,x-envoy-upstream-alt-stat-name,x-envoy-upstream-rq-per-try-timeout-ms,x-envoy-upstream-rq-timeout-alt-response,x-envoy-upstream-rq-timeout-ms,x-forwarded-client-cert,x-forwarded-for,x-forwarded-proto,x-request-id
2021-06-07 16:42:30.460 GMT CRITICAL: Config Manager is down, killing all processes.
2021-06-07 16:42:30.641 GMT I0607 16:41:37.206 15 envoy] [15][main] request trailer map: 144 bytes:
2021-06-07 16:42:30.641 GMT I0607 16:41:37.206 15 envoy] [15][main] response header map: 384 bytes: :status,access-control-allow-credentials,access-control-allow-headers,access-control-allow-methods,access-control-allow-origin,access-control-expose-headers,access-control-max-age,connection,content-length,content-type,date,grpc-message,grpc-status,keep-alive,location,proxy-connection,server,transfer-encoding,upgrade,via,x-envoy-attempt-count,x-envoy-decorator-operation,x-envoy-degraded,x-envoy-immediate-health-check-fail,x-envoy-ratelimited,x-envoy-upstream-canary,x-envoy-upstream-healthchecked-cluster,x-envoy-upstream-service-time,x-request-id
2021-06-07 16:42:30.641 GMT I0607 16:41:37.206 15 envoy] [15][main] response trailer map: 168 bytes: grpc-message,grpc-status
2021-06-07 16:42:30.641 GMT W0607 16:41:37.306 15 envoy] [15][main]No admin address given, so no admin HTTP server started.
2021-06-07 16:42:30.641 GMT I0607 16:41:37.306 15 envoy] [15][main]runtime: layers:\n - name: deprecation\n static_layer:\n re2.max_program_size.error_level: 1000
2021-06-07 16:42:30.641 GMT I0607 16:41:37.338 15 envoy] [15][config]loading tracing configuration
2021-06-07 16:42:30.641 GMT I0607 16:41:37.338 15 envoy] [15][config]loading 0 static secret(s) 
2021-06-07 16:42:30.641 GMT I0607 16:41:37.338 15 envoy] [15][config]loading 1 cluster(s)
2021-06-07 16:42:30.641 GMT I0607 16:41:37.609 15 envoy] [15][config]loading 0 listener(s)
2021-06-07 16:42:30.641 GMT I0607 16:41:37.609 15 envoy] [15][config]loading stats configuration
2021-06-07 16:42:30.641 GMT I0607 16:41:37.609 15 envoy] [15][runtime]RTDS has finished initialization
2021-06-07 16:42:30.641 GMT I0607 16:41:37.609 15 envoy] [15][upstream]cm init: initializing cds
2021-06-07 16:42:30.641 GMT W0607 16:41:37.630 15 envoy] [15][main]there is no configured limit to the number of allowed active connections. Set a limit via the runtime key overload.global_downstream_max_connections
2021-06-07 16:42:30.641 GMT I0607 16:41:37.652 15 envoy] [15][main]starting main dispatch loop
2021-06-07 16:42:30.641 GMT W0607 16:41:37.669 15 envoy] [15][config]StreamAggregatedResources gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure
2021-06-07 16:42:30.641 GMT W0607 16:41:37.781 15 envoy] [15][config]StreamAggregatedResources gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure
2021-06-07 16:42:30.641 GMT W0607 16:41:38.287 15 envoy] [15][config]StreamAggregatedResources gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure
2021-06-07 16:42:30.641 GMT W0607 16:41:38.793 15 envoy] [15][config]StreamAggregatedResources gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure
2021-06-07 16:42:30.641 GMT W0607 16:41:39.506 15 envoy] [15][config]StreamAggregatedResources gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure
2021-06-07 16:42:30.641 GMT F0607 16:41:41.150784 2 server.go:54] fail to initialize config manager: fail to initialize ServiceInfo, error processing authentication provider (google_id_token): failed OpenID Connect Discovery protocol: Failed to fetch jwks_uri from https://accounts.google.com/.well-known/openid-configuration: Get "https://accounts.google.com/.well-known/openid-configuration": net/http: TLS handshake timeout
2021-06-07 16:42:30.641 GMT W0607 16:41:44.790 15 envoy] [15][config]StreamAggregatedResources gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure
2021-06-07 16:42:30.641 GMT W0607 16:41:52.611 15 envoy] [15][config]gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.cluster.v3.Cluster
2021-06-07 16:42:30.641 GMT I0607 16:41:52.617 15 envoy] [15][upstream]cm init: all clusters initialized
2021-06-07 16:42:30.641 GMT I0607 16:41:52.618 15 envoy] [15][main]all clusters initialized. initializing init manager
2021-06-07 16:42:30.641 GMT W0607 16:41:57.845 15 envoy] [15][config]StreamAggregatedResources gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure
2021-06-07 16:42:30.641 GMT W0607 16:42:07.613 15 envoy] [15][config]gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.listener.v3.Listener
2021-06-07 16:42:30.641 GMT I0607 16:42:07.613 15 envoy] [15][config]all dependencies initialized. starting workers
2021-06-07 16:42:30.641 GMT W0607 16:42:08.809 15 envoy] [15][config]StreamAggregatedResources gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure
2021-06-07 16:42:30.641 GMT W0607 16:42:09.372 15 envoy] [15][config]StreamAggregatedResources gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure
2021-06-07 16:42:30.641 GMT W0607 16:42:11.250 15 envoy] [15][config]StreamAggregatedResources gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure
2021-06-07 16:42:31.097 GMT Container called exit(1).
downloaded-logs-20210608-113600.csv

Teju Nareddy

unread,
Jun 8, 2021, 12:04:17 PM6/8/21
to David Haupt, Google Cloud Endpoints
Thank you, this is helpful. It does seem like an issue with ESPv2, caused by the spike in container startups.

These log lines indicate why the container crashed:
2021-06-07 16:42:30.641 GMT F0607 16:41:41.150784 2 server.go:54] fail to initialize config manager: fail to initialize ServiceInfo, error processing authentication provider (google_id_token): failed OpenID Connect Discovery protocol: Failed to fetch jwks_uri from https://accounts.google.com/.well-known/openid-configuration: Get "https://accounts.google.com/.well-known/openid-configuration": net/http: TLS handshake timeout

It seems like you configured authentication in ESPv2. But the way you configured authentication requires ESPv2 to use OpenID Connect Discovery, which requires making remote calls during startup. This results in higher cold start times when new containers are created. In this case, it seems like https://accounts.google.com/.well-known/openid-configuration denied requests from ESPv2 (maybe rate limiting?), that resulted in ESPv2 failing to startup.

The last few paragraphs in this section of OpenAPI documentation describe this cold start behavior:
Note that omitting x-google-jwks_uri will result in higher cold start times, as ESP has to make an extra remote call on startup. Therefore, it is only recommended to omit this field if the JWKS URI changes often. Most certified OpenID providers (such as Google, Auth0, and Okta) have stable JWKS URIs.

I recommend you modify your gRPC service configuration to explicitly include the JWKS URI. I think for your service, it should be: https://www.googleapis.com/oauth2/v3/certs
You would configure this in AuthProvider.jwks_uri for gRPC service configuration.

It seems like our documentation is lacking on this cold start behavior for gRPC. We can improve it.

David Haupt

unread,
Jun 8, 2021, 12:33:47 PM6/8/21
to Google Cloud Endpoints
This is incredibly helpful - thank you very much! I'll take your suggestions and make sure we implement them to hopefully avoid this in the future.
Reply all
Reply to author
Forward
0 new messages