Hi,
I'm having trouble with an error that I don't really understand and that I can't really debug completely and so I would appreciate any helpful hints.
Here's the setup. I have deployed a gRPC server written in Java to a Kubernetes cluster on the Google Kubernetes Engine (GKE). The deployment consists of one Docker container for my gRPC server and another one for the Extensible Service Proxy (ESP) v1 as a sidecar to the gRPC server. For that deployment, there is a service of type NodePort, which forwards HTTP/2 traffic to the ESP (which forwards it to my gRPC server). Finally, I am using an ingress with a Google-managed certificate to make my gRPC service available via TLS on the Internet (by creating and configuring an external HTTPS load balancer, among other things).
The gRPC server uses bidirectional streaming. Simplifying matters a little, after an initial request received from the gRPC client (written in C# and targeting the .NET Framework, running on my local machine), the gRPC server downloads files from another web service and reports each file it downloaded in a gRPC response. And this also works fine unless the number of files that are reported is below some threshold (not a fixed one, though). However, after it has reported around 76-82 files in the corresponding number of responses, an exception is thrown in my client because of an RST_STREAM frame received from the server.
Having enabled logging by setting GRPC_VERBOSITY to DEBUG and GRPC_TRACE to all, the only line item I could find in my gRPC client's log that says something about this error has the following information:
{
"created":"@1590858674.654000000",
"description":"Error received from peer ipv4:[load-balancer's-public-ip-address]:443",
"file":"T:\src\github\grpc\workspace_csharp_ext_windows_x64\src\core\lib\surface\call.cc",
"file_line":1056,
"grpc_message":"Received RST_STREAM with error code 2",
"grpc_status":13
}
My gRPC server does not report any error. In fact, it happily continues streaming further responses after my client received the RST_STREAM, and only when waiting for the next client request, the cancellation of the request is reported.
Having enabled logging for ESP, the error log shows the following line items, among others:
2020/05/30 11:45:31 [debug] 9#9: worker cycle
2020/05/30 11:45:31 [debug] 9#9: epoll timer: 498
2020/05/30 11:45:31 [debug] 9#9: epoll: fd:21 ev:0001 d:00007F63709A54E8
2020/05/30 11:45:31 [debug] 9#9: *51 http2 read handler
2020/05/30 11:45:31 [debug] 9#9: *51 SSL_read: 13
2020/05/30 11:45:31 [debug] 9#9: *51 SSL_read: -1
2020/05/30 11:45:31 [debug] 9#9: *51 SSL_get_error: 2
2020/05/30 11:45:31 [debug] 9#9: *51 http2 frame type:3 f:0 l:4 sid:1
2020/05/30 11:45:31 [debug] 9#9: *51 http2 RST_STREAM frame, sid:1 status:8
2020/05/30 11:45:31 [info] 9#9: *51 client canceled stream 1, client: [my-client's-public-ip-address], server: , request: "POST /dokumate.dss.v1.SignatureService/CreateSignature HTTP/2.0", host: "signature-service.dss.dokumate.com"
In the last line item, the ESP log says that the client canceled the stream, which my code does not do and which I could also not see in my client's log.
I tried to find something in the ingress or load balancer logs, but those are silent about any errors. Thus, I am at a loss as to what is happening here.
It all works perfectly fine when I (1) run the client and the server on my local machine and (2) when I run the server on the GKE and expose the pod and my server's port through a service of type load balancer. In that case, there is no TLS, however, which is a no-go for productive use.
Regards, Thomas