gRPC connection not working after server restarted

Shawn Cao

unread,

Jul 25, 2023, 2:31:47 PM7/25/23

to grpc.io

Hi,

I'm facing a problem, I thought it's mostly related to gRPC, but not 100% sure. Need some help to find the right direction.

Here is my setup (in GKE):

service-A runs in one k8s pod (node.js) which uses gPRC client to connect service-B (C++, gRpc server), and it uses a service name for the connection. Both services are defined using NodePort type in k8s spec.

[Problem]

Everything works fine, but sometimes when service-B restarts (crashes), the connection doesn't work anymore, until I restart service-A manually, it works again. Yes, it is sometimes, other times, the connection still works for some other times when service-B restarts.

Obviously, this issue won't allow me to sleep because we can't guarantee that service B does not restart/crash.

I have a few thoughts, but can't tell which is more likely the issue.

Is it because gRpc client doesn't re-resolve the DNS name (service name)?
Is it because the connection to the old pod is still not closed?
Did I miss any gRPC client option that handles this type of case?
Did I misuse the service type - instead of NodePort, I should use a headless service?

Since there are many pieces, I still don't have a good picture to debug this issue properly, your input will be very appreciated.

Reference:

gRpc client (this client is reused, makeClient is called when RPC call has error)

const serviceAddr = 'nebula:9190';

const cred = nebula.grpc.credentials.createInsecure();

const options = {

'grpc.max_receive_message_length': MESSAGE_LENGTH,

'grpc.max_send_message_length': MESSAGE_LENGTH,

'grpc.min_reconnect_backoff_ms': MIN_RECONNECT_BACKOFF_MS, 'grpc.max_reconnect_backoff_ms': MAX_RECONNECT_BACKOFF_MS,

};

const makeClient = () => new nebula.V1Client(serviceAddr, cred, options);

K8s spec

apiVersion: v1

kind: Service

metadata:

name: nebula

spec:

type: NodePort

selector:

app: nebula-server

ports:

- port: 9190

name: server

targetPort: 9190

Any suggestion is appreciated. (I also posted this question in StackOverflow.)

Michael Lumish

unread,

Jul 26, 2023, 4:28:13 PM7/26/23

to Shawn Cao, grpc.io

The diagnosis here depends partly on what you mean by "the connection doesn't work anymore". What symptoms are you observing on service A that are leading you to that conclusion?

You said "this client is reused, makeClient is called when RPC call has error". If you are observing request errors on service A, then that should be triggering those calls to makeClient. Creating a new client forces DNS re-resolution, so if you are seeing those errors on multiple successive requests that would have used the newly created client, the problem is definitely not about whether the client is re-resolving DNS names. However, the gRPC library reuses connections to the same IP address when possible, so if the connection is broken and DNS resolution is returning the same addresses, it will continue to use the same broken connection.

Since a broken connection is a likely cause of this sort of problem, I would recommend using gRPC's keepalives feature to try to detect and resolve that. That will have the client periodically ping the server while a connection is in use, and discard the connection if the server does not respond. First, update your dependencies to get the latest version of @grpc/grpc-js, because the recent patches have some fixes to the keepalive behavior. Then you should add to your options object an entry with the key "grpc.keepalive_time_ms". This controls how frequently the client sends the pings. You should coordinate with the service owner to determine what the right number is, but 60,000 (1 per minute) is probably reasonable. You can also set the option "grpc.keepalive_timeout_ms" to control how long to wait for a response after sending a ping before the connection is considered broken, but the default value of 20000 (20 seconds) is probably reasonable.

I also want to circle back to "this client is reused, makeClient is called when RPC call has error". In general, we recommend using a single gRPC client object to communicate with a single service for the whole lifetime of a process. It generally does the work of connection management and handling connection drops, and you rarely gain much from recreating it.

--
You received this message because you are subscribed to the Google Groups "grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/050a9d6d-737c-41dc-8807-eeb9fb18cca5n%40googlegroups.com.

Shawn Cao

unread,

Jul 26, 2023, 7:08:54 PM7/26/23

to Michael Lumish, grpc.io

Hi Michael,

This is an awesome answer, I appreciate it very much.

First, the symptoms are that all RPC calls didn't reach service B, I was trying to figure out whether those calls ended up with errors in reply or just stuck there, hence I instrumented logs in the RPC call callbacks to check the error code when Service B restarts. It looks like callback was not invoked when this happened.

FYI. Code Snippet.

Currently, I'm still using the old grpc package (https://www.npmjs.com/package/grpc) at version 1.24.4, I'll definitely follow your suggestion to upgrade it to @grpc/grpc.js latest version.

Regarding the DNS resolution part, your explanation sounds reasonable to me, because sometimes the connection still works when Service B restarts, some other times it doesn't, I guess it is possible the broken connection is in use if the Service B IP address is not changed for the restart. Am I understanding this correctly?

However the keeplive.timeout, regardless if set it or not, sounds like there will be always a default value (the 20s) there, but it never comes back when that break happens until I manually restart it, probably because the package I use doesn't have the keepalive feature on or the feature is not working well?

Thanks for pointing out the client object creation part too, if that is the case, I can see that the makeClient method does not have much value then.

Shawn Cao | Twitter | Linkedin | Meet

Michael Lumish

unread,

Jul 26, 2023, 7:27:14 PM7/26/23

to Shawn Cao, grpc.io

I cannot give any more specific information about the behavior of the old grpc library. It was deprecated years ago and I have not touched it in a long time.

I think you misunderstood my suggestion about keepalives. There are two relevant options for that feature: "grpc.keepalive_time_ms" and "grpc.keepalive_timeout_ms". The first of those needs to be set to enable the feature. The second one has a default value of 20 seconds, but can be modified to a value that makes more sense for your usage.

Shawn Cao

unread,

Jul 26, 2023, 7:54:28 PM7/26/23

to Michael Lumish, grpc.io

Ah, I see, thanks a lot for the clarification, Michael. I thought it was by default enabled.

Let me upgrade to the latest version of grpc.js and try those options, I appreciate your help very very much!

Shawn Cao | Twitter | Linkedin | Meet

Shawn Cao

unread,

Jul 27, 2023, 1:46:38 PM7/27/23

to grpc.io

Hi Michael,

I have finished migrating from the legacy 'grpc' package to the latest '@grpc/grapc.js', and set keepalive options now.

Our service A (API) now is happily working with Service B regardless of it restarted many times (which is a separate issue).

Just sending this update to let you know, you helped fix this issue for us, thank you very much!

Reply all

Reply to author

Forward