I am working on a asynchronous server-side integration of the GRPC in C++. I already solved quite some mistakes and misunderstandings, and overall it is very stable. Just 1 issue in the startup behavior is making my life difficult for the time being.
Introduction:
I wrote a test that starts 1 client and restarts the server-side multiple times. With restarting I mean shutting down the completion queues with attached threads, including the 'grpc::Server' and re-creating them again. The client is never restarted and just reconnects. This consistently happens without any lockups or complaints from GRPC.
Server-side there are 2 CompletionQueues, handled in 2 separate threads:
1. is accepting requests from the client and respond using ServerAsyncResponseWriter.
2. is accepting streams from the client and send updates from server to client using ServerAsyncWriter.
Client-side there is 1 CompletionQueue to handle ClientAsyncReader events in it's own thread. Requests to the server are implemented synchronously.
The backoff algorithm is configured to reconnect to the server within 1s +-0.2s. The client monitors the channel status using (async) NotifyOnStateChange with a timeout of 2 seconds and sends the stream requests as soon as the channel is up.
I've separated the client and server implementation into 2 separate processes to ensure there is no interference whatsoever.
The issue:
Sometimes, the server seems to block all events in the 'stream CompletionQueue' (thread 2) when blocked in the request thread (1). More specifically: Thread 2 is blocked until ::grpc::CompletionQueue::Next is called in thread 1. I've deliberately added a long sleep just before calling cq1->Next in thread 1 to ensure the issue still reproduces and it does. The printf in thread 2 just after cq2->Next is not triggered until the sleep finishes.
While sleeping, multiple stream connection attempts arrive from the client (supposedly in the second CompletionQueue). I verified this by capturing the TCP stream. These events arrive directly after the request message. As soon as 'Next' is called in thread 1, these connection attempts in thread 2 are handled immediately.
For me it reproduces about every 5-10 cycles. Is there a proper way to debug this behavior in GRPC? Which verbosity flags should I enable? Am I doing any correct assumptions about multiple CompletionQueues?