c++: gRPC shutdown() hangs for ever

2,332 views
Skip to first unread message

Prasanta Behera

unread,
Jan 20, 2022, 7:02:43 AM1/20/22
to grpc.io
gRPC version : 1.30.2 (c++)
$ gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609

We use gRPC is synchronous mode and stop our gRPC services with this code

m_server->Shutdown(std::chrono::system_clock::now() + std::chrono::seconds(1));
m_server->Wait();

On rare occasions (about once in 1000  tries) the Shutdown() call does NOT return and the process just hangs.

Is there anything I missing to have a reliable shutdown of the server?

Anh Tran

unread,
Feb 1, 2022, 12:45:04 AM2/1/22
to grpc.io
Unfortunately I came across the same problem.
When trying to isolate the problem (having a simple hello world server-client), the problem seems to go away, i.e server can be shut down.

But when using in my own project, this happens. And despite my effort to track down the problem, I can't seem to spot one.
This is weird since this version of Shutdown should force shut down after a certain period, as describe in the doc.

Prasanta, do you happen to find the solution to your problem? Maybe it will give a hint to mine.

Anh Tran

unread,
Feb 1, 2022, 3:12:43 AM2/1/22
to grpc.io
I should also add that this behavior happens during threading.
Message has been deleted

Prasanta Behera

unread,
Feb 1, 2022, 3:15:54 AM2/1/22
to grpc.io
No, I've not found any solution yet! I was trying to have a hello-world kind of example to reproduce to open a ticket, but as you said with such simple examples it never reproduces...

Easwar Swaminathan

unread,
Feb 2, 2022, 1:51:16 PM2/2/22
to grpc.io
Has anyone of y'all seen this in a more recent version of gRPC. `1.30.2` is really old.

Anh Tran

unread,
Feb 3, 2022, 1:24:14 AM2/3/22
to grpc.io
The last commit on my local copy of gRPC is 06 Jan 2022, so not that old.

veb...@google.com

unread,
Feb 9, 2022, 3:54:53 PM2/9/22
to grpc.io
Would you capture the call-stack when Shutdown function appears to hang?

Ctmahapa95

unread,
Feb 9, 2022, 6:08:39 PM2/9/22
to veb...@google.com, grpc.io
I have also experienced the same issue in my work environment. Our work around was to have N second timeout and check a shutdown flag and then call shutdown.



--
You received this message because you are subscribed to the Google Groups "grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/96350454-3e52-4068-b3e1-1fea25537098n%40googlegroups.com.

Anh Tran

unread,
Feb 10, 2022, 4:10:12 AM2/10/22
to grpc.io

Thanks for getting back to me at this. Please see my stack trace from valgrind:

==287419== Memcheck, a memory error detector
==287419== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==287419== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==287419== Command: /home/atran/codes/workflow_client/dist/workflow/job_client_cli --job-type slam
==287419== Parent PID: 287418
==287419==
==287419== Conditional jump or move depends on uninitialised value(s)
==287419==    at 0x91743A: void** NextStackFrame<true, false>(void**, void const*, unsigned long, unsigned long) (in /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
==287419==    by 0x916F98: int UnwindImpl<false, false>(void**, int*, int, int, void const*, int*) (in /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
==287419==    by 0x916D8D: absl::lts_20211102::GetStackTrace(void**, int, int) (in /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
==287419==    by 0x90F27C: absl::lts_20211102::(anonymous namespace)::GetStack(void**, int) (in /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
==287419==    by 0x924569: absl::lts_20211102::synchronization_internal::GraphCycles::UpdateStackTrace(absl::lts_20211102::synchronization_internal::GraphId, int, int (*)(void**, int)) (in /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
==287419==    by 0x90F342: absl::lts_20211102::DeadlockCheck(absl::lts_20211102::Mutex*) (in /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
==287419==    by 0x90F84C: absl::lts_20211102::DebugOnlyDeadlockCheck(absl::lts_20211102::Mutex*) (in /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
==287419==    by 0x90FB6E: absl::lts_20211102::Mutex::Lock() (in /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
==287419==    by 0x8A3AD0: gpr_mu_lock (in /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
==287419==    by 0x7DFA25: begin_worker(grpc_pollset*, grpc_pollset_worker*, grpc_pollset_worker**, long) (in /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
==287419==    by 0x7DFFDA: pollset_work(grpc_pollset*, grpc_pollset_worker**, long) (in /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
==287419==    by 0x7716F4: pollset_work(grpc_pollset*, grpc_pollset_worker**, long) (in /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
==287419==  Uninitialised value was created by a stack allocation
==287419==    at 0x1C3FD0: Emesent::GRPCQueue::Stub::Register(grpc::ClientContext*, Emesent::Empty const&, Emesent::RegistrationResponse*) (in /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
==287419==
==287419== Thread 7 grpcpp_sync_ser:
==287419== Conditional jump or move depends on uninitialised value(s)
==287419==    at 0x916F7F: int UnwindImpl<false, false>(void**, int*, int, int, void const*, int*) (in /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
==287419==    by 0x916D8D: absl::lts_20211102::GetStackTrace(void**, int, int) (in /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
==287419==    by 0x90F27C: absl::lts_20211102::(anonymous namespace)::GetStack(void**, int) (in /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
==287419==    by 0x924569: absl::lts_20211102::synchronization_internal::GraphCycles::UpdateStackTrace(absl::lts_20211102::synchronization_internal::GraphId, int, int (*)(void**, int)) (in /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
==287419==    by 0x90F342: absl::lts_20211102::DeadlockCheck(absl::lts_20211102::Mutex*) (in /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
==287419==    by 0x90F84C: absl::lts_20211102::DebugOnlyDeadlockCheck(absl::lts_20211102::Mutex*) (in /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
==287419==    by 0x90FB6E: absl::lts_20211102::Mutex::Lock() (in /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
==287419==    by 0x8A3AD0: gpr_mu_lock (in /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
==287419==    by 0x5121A4: grpc_cq_begin_op(grpc_completion_queue*, void*) (in /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
==287419==    by 0x5199C4: grpc_core::Server::ShutdownAndNotify(grpc_completion_queue*, void*) (in /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
==287419==    by 0x51CD63: grpc_server_shutdown_and_notify (in /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
==287419==    by 0x231AA2: grpc::Server::ShutdownInternal(gpr_timespec) (in /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
==287419==  Uninitialised value was created by a stack allocation
==287419==    at 0x1D37E0: std::_Function_handler<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), workflow_client::GRPCIcomm::start_and_wait()::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#1}>::_M_invoke(std::_Any_data const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
==287419==

In case this is interesting, this is how I start my server and wait until shutdown:

std::string server_addr_and_port = base_address_ + ":" + id_;
grpc::ServerBuilder server_builder;
server_builder.AddListeningPort(server_addr_and_port, grpc::InsecureServerCredentials());

GRPCQueueImpl grpc_queue_service;
server_builder.RegisterService(&grpc_queue_service);
server_ = server_builder.BuildAndStart();
server_->Wait();

veb...@google.com

unread,
Feb 11, 2022, 6:00:29 PM2/11/22
to grpc.io
Hi Ahn,

Thanks for the stack trace but I'm a bit confused. Is this what you got while your program was stuck waiting forever? It looks a bit off because Valgrind complains about "Conditional jump or move depends on uninitialised value" and callstack appears to show some deadline inside.

Anh Tran

unread,
Feb 13, 2022, 10:37:12 PM2/13/22
to grpc.io
Hi team,

Yes. that is the case. I am also confused...

Anh Tran

unread,
Mar 4, 2022, 8:18:19 AM3/4/22
to grpc.io
I have some time to revisit this problem today. Running with GRPC VERBOSITY set to DEBUG gives me this (anything starts with [2022-03-04T23 is my own logging):

[2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457, thread 501457] Logger level is set at info.
[2022-03-04T23:15:41+10:00] [workflow_client] [warning] [process 501457, thread 501457] Persistent not specified and will default to false
D0304 23:15:41.640009936  501457 ev_posix.cc:172]            Using polling engine: epollex
D0304 23:15:41.640079032  501457 lb_policy_registry.cc:42]   registering LB policy factory for "grpclb"
D0304 23:15:41.640106084  501457 lb_policy_registry.cc:42]   registering LB policy factory for "priority_experimental"
D0304 23:15:41.640124027  501457 lb_policy_registry.cc:42]   registering LB policy factory for "weighted_target_experimental"
D0304 23:15:41.640127329  501457 lb_policy_registry.cc:42]   registering LB policy factory for "pick_first"
D0304 23:15:41.640150128  501457 lb_policy_registry.cc:42]   registering LB policy factory for "round_robin"
D0304 23:15:41.640153574  501457 lb_policy_registry.cc:42]   registering LB policy factory for "ring_hash_experimental"
D0304 23:15:41.640162650  501457 dns_resolver_ares.cc:624]   Using ares dns resolver
D0304 23:15:41.640230087  501457 certificate_provider_registry.cc:33] registering certificate provider factory for "file_watcher"
D0304 23:15:41.640236782  501457 lb_policy_registry.cc:42]   registering LB policy factory for "cds_experimental"
D0304 23:15:41.640241375  501457 lb_policy_registry.cc:42]   registering LB policy factory for "xds_cluster_impl_experimental"
D0304 23:15:41.640258747  501457 lb_policy_registry.cc:42]   registering LB policy factory for "xds_cluster_resolver_experimental"
D0304 23:15:41.640262652  501457 lb_policy_registry.cc:42]   registering LB policy factory for "xds_cluster_manager_experimental"
[2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457, thread 501457] gRPC client at localhost:4500 is registering with the server...
I0304 23:15:41.640909979  501458 subchannel.cc:945]          subchannel 0x55e6713dcea0 {address=ipv6:[::1]:4500, args=grpc.client_channel_factory=0x55e6713d6a20, grpc.default_authority=localhost:4500, grpc.internal.subchannel_pool=0x55e6713bb810, grpc.primary_user_agent=grpc-c++/1.44.0-dev, grpc.resource_quota=0x55e6713d7b40, grpc.server_uri=dns:///localhost:4500}: connect failed: {"created":"@1646399741.640833261","description":"Failed to connect to remote host: Connection refused","errno":111,"file":"/home/hoover/workspace/DPAV/Linux/OpenSource/gRPC-focal/src/core/lib/iomgr/tcp_client_posix.cc","file_line":200,"os_error":"Connection refused","syscall":"connect","target_address":"ipv6:[::1]:4500"}
[2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457, thread 501457] ... gRPC client at localhost:4500 successfully registered with server
I0304 23:15:41.642209150  501462 server_builder.cc:327]      Synchronous server. Num CQs: 1, Min pollers: 1, Max Pollers: 2, CQ timeout (msec): 10000
I0304 23:15:41.642437546  501462 socket_utils_common_posix.cc:353] TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter
[2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457, thread 501457] JobClient client constructed.
[2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457, thread 501457] HelloWorld client constructed.
[2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457, thread 501457] New id assigned: localhost:4502:598ec6d9-8a75-4500-b34f-9cab3160c642
[2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457, thread 501457] Client localhost:4502:598ec6d9-8a75-4500-b34f-9cab3160c642 is trying to register client for job type: : hello_world
[2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457, thread 501463] Client localhost:4502:598ec6d9-8a75-4500-b34f-9cab3160c642 registered with Job Server 4501
[2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457, thread 501457] Client localhost:4502:598ec6d9-8a75-4500-b34f-9cab3160c642 received ACK for RegisterNewClient.
[2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457, thread 501484] Client localhost:4502:598ec6d9-8a75-4500-b34f-9cab3160c642 identified JSON message as issueNewJob.
[2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457, thread 501484] Client localhost:4502:598ec6d9-8a75-4500-b34f-9cab3160c642 accepted job.
[2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457, thread 501484] Client localhost:4502:598ec6d9-8a75-4500-b34f-9cab3160c642 acknowledged the job.
[2022-03-04T23:15:42+10:00] [workflow_client] [info] [process 501457, thread 501485] Client localhost:4502:598ec6d9-8a75-4500-b34f-9cab3160c642 identified JSON message as jobProgressUpdateAck.
[2022-03-04T23:15:42+10:00] [workflow_client] [info] [process 501457, thread 501484] Client localhost:4502:598ec6d9-8a75-4500-b34f-9cab3160c642 received ACK for final JobProgressUpdate.
[2022-03-04T23:15:42+10:00] [workflow_client] [info] [process 501457, thread 501484] Client localhost:4502:598ec6d9-8a75-4500-b34f-9cab3160c642 finished processing a job and now shutting down.
D0304 23:15:42.482888232  501484 chttp2_transport.cc:1715]   ipv4:127.0.0.1:56916: Sending goaway err={"created":"@1646399742.482860742","description":"Server shutdown","file":"/home/hoover/workspace/DPAV/Linux/OpenSource/gRPC-focal/src/core/lib/surface/server.cc","file_line":481,"grpc_status":0}

Plin Chen

unread,
Mar 6, 2022, 10:06:32 PM3/6/22
to grpc.io
Our Situation is that when we try to close App without shutting gRPC client down, the app hangs up there.
Of course, if we shutdown gRPC client manually several seconds before quiting the app, things will go as normal.
----
We are using the latest gRPC-CSharp version. I notice that shutdown method is an aync method.
When the "OnApplicationQuit" event is catched and we try to invoke shutdown(gRPC) before the app really quit,
gRPC ThreadPool would seem like blocking with the quitting process.
----
We've tried 2 workarounds, and either works:
1, catch OnApplicationQuit, invoke "shutdown" of gRPC, let the application wait for 2 seconds then try to close the app.
2, Everytime when client finish calling remote procedure of grpc server, shutdown the grpc client at once. 

Anh Tran

unread,
Mar 7, 2022, 6:56:17 PM3/7/22
to grpc.io
It seems like there's some deadlock bug or undocumented expected behaviour  in gRPC source code.

However for the life of me, I can't make a minimal example to reproduce the problem.

Anh Tran

unread,
Mar 9, 2022, 8:10:04 PM3/9/22
to grpc.io
That problem seems to go away when I upgrade to protobuf 3.20 for some reason.

Prasanta Behera

unread,
Jun 7, 2022, 11:07:10 AM6/7/22
to grpc.io

We are using gRPC with a purpose built GCC v7.5-2019.12 cross-compiler targeting ARM Cortex-A15 or Cortex-A9 based upon this software release from Linaro http://releases.linaro.org/components/toolchain/gcc-linaro/7.5-2019.12 

The hang can be easily seen by running out gTests on these devices just a few iterations. And whenever it happens the following message is printed on the console, in a never ending loop, once in every few seconds:

E0607 14:59:50.307433749    6785 ev_epollex_linux.cc:519]    Error shutting down fd 29. errno: 9

E0607 15:00:17.878449028    6775 ev_epollex_linux.cc:519]    Error shutting down fd 31. errno: 9

E0607 15:01:02.998995428    6775 ev_epollex_linux.cc:519]    Error shutting down fd 31. errno: 9

E0607 15:01:18.042279014    6775 ev_epollex_linux.cc:519]    Error shutting down fd 31. errno: 9

When I searched for this message on internet, I found this open bug on gRPC: https://github.com/grpc/grpc/issues/25289 

Reply all
Reply to author
Forward
0 new messages