gRPC frozen state

1,038 views
Skip to first unread message

ry...@quizlet.com

unread,
Sep 19, 2017, 11:41:27 AM9/19/17
to grpc.io

Hello all -

I've been developing a gRPC-HHVM extension. In the process of stress testing this extension we've consistently seen a condition where gRPC gets into a state where any request that is made gets indefinitely stuck on the epoll_wait syscall (or times out if a timeout is set on grpc_completion_queue_next). Everything works fine and is chugging along and then this problem happens and all of the requests get stuck/timeout. This is what it looks like in gdb:

Thread 21 (Thread 0x7f54f6bff700 (LWP 51033)):

#0  0x00007f554ad90d43 in epoll_wait () from /lib64/libc.so.6 <-- indefinitely stuck or times out if one is set

#1  0x00007f55416969e5 in add_poll_object (exec_ctx=0x2482ad88, bag=0x188924, bag_type=(POLL_OBJ_POLLSET | unknown: 32596), item=0x0, item_type=(POLL_OBJ_POLLSET | unknown: 1097427428)) at src/core/lib/iomgr/ev_epollsig_linux.c:1484

#2  0x00007f5541697972 in fd_ref (fd=0x4, reason=0x33b8b87c0 <Address 0x33b8b87c0 out of bounds>, file=0x0, line=0) at src/core/lib/iomgr/ev_poll_posix.c:447

#3  0x00007f55416a4691 in tcp_drop_uncovered_then_handle_write (exec_ctx=0x7f54f6bf8680, arg=0x7f4c4e2c9010, error=0x0) at src/core/lib/iomgr/tcp_posix.c:239

#4  0x00007f55416c71a8 in grpc_server_request_registered_call (server=0x0, rmp=0x0, call=0x2482ad88, deadline=0x188924, initial_metadata=0x7f4c4e2c8f00, optional_payload=0x0, cq_bound_to_call=0x4, cq_for_notification=0x33b8b87c0, 

    tag=0x0) at src/core/lib/surface/server.c:1486


#5  0x00007f55416c75d2 in conforms_to (slice=..., legal_bits=0x7f4c22ce2e50 "\270\064\245\214L\177", err_desc=0x7f54f6bf8701 "\211\030") at src/core/lib/surface/validate_metadata.c:40

#6  0x00007f5541b97105 in HPHP::c_Call_ni_startBatch (this_=0x7f4c22ce2e60, actions=...) at /home/ryangordon/hhvm/grpc/src/php/ext/grpc/hhvm/call.cpp:529


The stack trace doesn't always look like that but it always is stuck on epoll_wait/epoll_pwait often from gpr_cv_wait.


I've managed to get GPRC_TRACE=all GRPC_VERBOSITY=DEBUG of a good request and a request when it is in the bad state to compare side by side.


Good Request: https://gist.githubusercontent.com/RyanGordon/831a8213e735bb1a4e5b93a854aab3c8/raw/7e37d595dbe56a77274bff83b64ab661a80e0c27/Not-Frozen%2520Log%2520Snippet

Bad Request: https://gist.githubusercontent.com/RyanGordon/831a8213e735bb1a4e5b93a854aab3c8/raw/7e37d595dbe56a77274bff83b64ab661a80e0c27/Frozen%2520log%2520snippet


The bad request looks like this when it's execution deviates:


D0919 14:54:36.312349629   51030 combiner.c:163]             C:0x7f54faf35c00 grpc_combiner_execute c=0x7f4c22edbfb0 last=61995
D0919 14:54:36.312353011   51030 exec_ctx.c:73]              closure 0x7f4c22edbec0 finished
I0919 14:54:36.312358626   51030 completion_queue.c:834]     grpc_completion_queue_next(cq=0x7f4c22e9e200, deadline=gpr_timespec { tv_sec: 4, tv_nsec: 999000000, clock_type: 3 }, reserved=(nil))
...

Whereas, the good request does this:
D0915 15:19:50.178995777   13689 combiner.c:163]             C:0x7f932172f700 grpc_combiner_execute c=0x7f9313d67fb0 last=1
D0915 15:19:50.179005880   13689 exec_ctx.c:73]              closure 0x7f9313d67ec0 finished
D0915 15:19:50.179074826   13689 combiner.c:225]             C:0x7f932172f700 grpc_combiner_continue_exec_ctx contended=0 exec_ctx_ready_to_finish=1 time_to_execute_final_list=0
D0915 15:19:50.179077765   13689 combiner.c:243]             C:0x7f932172f700 maybe_finish_one n=0x7f9313d67fb0
... (and a whole lot more)



Effectively it seems like the something is not getting scheduled when it should be (grpc_combiner_continue_exec_ctx never gets run). I've been working on this for a few weeks now and haven't been able to track down why or a fix. I have the full logs if they might be helpful. I've been testing on my branch that is kept pretty close to grpc/grpc master.

Does anyone have any idea what is going on? I have the bad process open on a test server in gdb and have timeouts implemented so I can run repeated requests at it to continue debugging if necessary.

Thanks,
Ryan

Michael Lumish

unread,
Sep 19, 2017, 12:19:11 PM9/19/17
to ry...@quizlet.com, grpc.io
It looks like your stack trace includes a call to grpc_server_request_registered_call with a NULL server. That is suspicious.

--
You received this message because you are subscribed to the Google Groups "grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+u...@googlegroups.com.
To post to this group, send email to grp...@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/b3cc406f-9b93-4259-b8f9-df3eb904132a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ry...@quizlet.com

unread,
Sep 19, 2017, 12:55:10 PM9/19/17
to grpc.io
Err... Actually I just realized that my source files had changed so that stack isn't entirely accurate. Here is a proper stack trace:

#0  0x00007f6ddb4a1a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f6dcbf4471c in gpr_cv_wait (cv=0x7f6db13f8a90, mu=0x7f6a65b9cd90, abs_deadline=...) at src/core/lib/support/sync_posix.c:80
#2  0x00007f6dcbe9bea8 in begin_worker (pollset=0x7f6a65b9cd90, worker=0x7f6db13f8a70, worker_hdl=0x0, now=0x7f6db13f8a30, deadline=...) at src/core/lib/iomgr/ev_epoll1_linux.c:747
#3  0x00007f6dcbe9c7c9 in pollset_work (exec_ctx=0x7f6db13f8b80, ps=0x7f6a65b9cd90, worker_hdl=0x0, now=..., deadline=...) at src/core/lib/iomgr/ev_epoll1_linux.c:941
#4  0x00007f6dcbea9691 in grpc_pollset_work (exec_ctx=0x7f6db13f8b80, pollset=0x7f6a65b9cd90, worker=0x0, now=..., deadline=...) at src/core/lib/iomgr/ev_posix.c:210
#5  0x00007f6dcbecc1a8 in cq_next (cq=0x7f6a65b9cc80, deadline=..., reserved=0x0) at src/core/lib/surface/completion_queue.c:914
#6  0x00007f6dcbecc5d2 in grpc_completion_queue_next (cq=0x7f6a65b9cc80, deadline=..., reserved=0x0) at src/core/lib/surface/completion_queue.c:992
#7  0x00007f6dcc394ecc in HPHP::c_Call_ni_startBatch (this_=0x7f6a9288a670, actions=...) at /opt/build/hhvm/grpc/src/php/ext/grpc/hhvm/call.cpp:583

and

#0  0x00007f6dd56b6d43 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f6dcbe9b9e5 in do_epoll_wait (exec_ctx=0x7f6d99ff8b80, ps=0x7f6a65867710, now=..., deadline=...) at src/core/lib/iomgr/ev_epoll1_linux.c:640
#2  0x00007f6dcbe9c972 in pollset_work (exec_ctx=0x7f6d99ff8b80, ps=0x7f6a65867710, worker_hdl=0x0, now=..., deadline=...) at src/core/lib/iomgr/ev_epoll1_linux.c:964
#3  0x00007f6dcbea9691 in grpc_pollset_work (exec_ctx=0x7f6d99ff8b80, pollset=0x7f6a65867710, worker=0x0, now=..., deadline=...) at src/core/lib/iomgr/ev_posix.c:210
#4  0x00007f6dcbecc1a8 in cq_next (cq=0x7f6a65867600, deadline=..., reserved=0x0) at src/core/lib/surface/completion_queue.c:914
#5  0x00007f6dcbecc5d2 in grpc_completion_queue_next (cq=0x7f6a65867600, deadline=..., reserved=0x0) at src/core/lib/surface/completion_queue.c:992
#6  0x00007f6dcc394ecc in HPHP::c_Call_ni_startBatch (this_=0x7f6a624e0670, actions=...) at /opt/build/hhvm/grpc/src/php/ext/grpc/hhvm/call.cpp:583 

These are both proper examples of it just timing out on epoll_wait/pthread_cond_timedwait.

The other thing I've noticed is on the bad request:

D0919 14:54:36.312349629 51030 combiner.c:163] C:0x7f54faf35c00 grpc_combiner_execute c=0x7f4c22edbfb0 last=61995

last is 61995 which seems to indicate the number of scheduled items in the queue. In a normal request this number is kept low.

ry...@quizlet.com

unread,
Sep 19, 2017, 7:58:13 PM9/19/17
to grpc.io

Hmm so it looks like the reason why grpc_combiner_continue_exec_ctx doesn't get called is because exec_ctx->active_combiner is actually NULL and so the function just returns false. Not yet sure why / how that became NULL

Sree Kuchibhotla

unread,
Sep 20, 2017, 12:32:54 PM9/20/17
to ry...@quizlet.com, grpc.io
The line does look a bit suspicious to me (It means the combiner has 30597 closures - which is a lot - the least significant bit in "last" is ignored - the actual count is stored in the remaining bits)
D0919 14:54:36.312349629   51030 combiner.c:163]             C:0x7f54faf35c00 grpc_combiner_execute c=0x7f4c22edbfb0 last=61995
Not sure why we ran in to this situation. Putting that aside for now, based on the logs (in the failed case), the threads are waiting for the batch to complete 

I0919 14:54:36.312170818   51030 call.c:2069]                grpc_call_start_batch(call=0x7f4c22edb020, ops=0x7f550e39fde0, nops=4, tag=0x7f4c22edb020, reserved=(nil))
I0919 14:54:36.312199708   51030 call.c:1677]                ops[0]: SEND_INITIAL_METADATA
..
I0919 14:54:36.312210605   51030 call.c:1677]                ops[1]: RECV_INITIAL_METADATA ptr=0x7f4c22e9e510
I0919 14:54:36.312214004   51030 call.c:1677]                ops[2]: SEND_MESSAGE ptr=0x7f4c262df200
I0919 14:54:36.312216804   51030 call.c:1677]                ops[3]: SEND_CLOSE_FROM_CLIENT
..
D0919 14:54:36.312362846   51030 completion_queue.c:812]     PENDING TAGS: 0x7f4c22edb020

The real reason to look here is why isn't the call making any progress (I am assuming the sends are successful but the receives aren't coming) - In other words, I think the threads are waiting in epoll_wait and cv_wait because there is nothing incoming on the fds. 

1) Could you turn on the "tcp" traces as well? : i.e  export GRPC_TRACE = "tcp,api,call_combiner,op_failure,client_channel"  
2) Could you sync to the latest HEAD ? (we did fix a few bugs recently in timers and the polling engine very recently.. want to make sure you are on the latest HEAD. Based on the line numbers in callstacks, looks like your branch is a few days behind)


thanks,
Sree

PS: 

>> Hmm so it looks like the reason why grpc_combiner_continue_exec_ctx doesn't get called is because exec_ctx->active_combiner is actually NULL and so the function just returns false. Not yet sure why / how that became NULL
This happens when we executed all the closures on the combiner .. so this may be a red-herring (maybe not..but I it would help to look at tcp traces first)



To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+unsubscribe@googlegroups.com.

To post to this group, send email to grp...@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.

ry...@quizlet.com

unread,
Sep 20, 2017, 3:10:32 PM9/20/17
to grpc.io
1) Yep, will do - I should continue to run with GRPC_VERBOSITY=DEBUG right?
2) Just merged latest master through

Sree Kuchibhotla

unread,
Sep 20, 2017, 3:57:35 PM9/20/17
to ry...@quizlet.com, grpc.io
Yes, GRPC_VERBOSITY=DEBUG.

To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+unsubscribe@googlegroups.com.

To post to this group, send email to grp...@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.

ry...@quizlet.com

unread,
Sep 21, 2017, 2:53:17 AM9/21/17
to grpc.io
I just got it to reproduce. Here are the last ~3,000 lines of the debug output: https://gist.github.com/RyanGordon/e3d211a6430daac560296a4b301f121b

At the end, it seemed like all of the requests timed out, so we called `grpc_call_cancel_with_status` to clean the request up. After we call that, we call `grpc_completion_queue_pluck` (justification) with an infinite timeout so it gets stuck there forever. After about 4 minutes of nothing happening, gRPC reports a "Socket closed" error, but nothing after that.

The one line that sticks out to me is L2677 from the gist. Typically `grpc_call_combiner_stop()` gets called right after that line but not in this case and then the behavior shown in the logs look atypical after that.
Yes, GRPC_VERBOSITY=DEBUG.

Michael Lumish

unread,
Sep 21, 2017, 12:13:57 PM9/21/17
to ry...@quizlet.com, grpc.io

The primary cause of your problem is that you're calling completion_queue_pluck after calling cancel_with_status. Assuming you're following the model of the current PHP wrapper and calling pluck with an infinite timeout, you should only do so exactly once for each operation that creates a tag on the queue. The function start_batch creates a tag; cancel_with_status does not. So blocking forever is the expected behavior in that situation.

Also, you should not need to call cancel_with_status at all. If the call has timed out, that should mean that the call has already finished and that there is already a status.


ry...@quizlet.com

unread,
Sep 21, 2017, 1:20:55 PM9/21/17
to grpc.io
Hm, I was recommended here https://github.com/grpc/grpc/pull/12306#issuecomment-330337644 to call the pluck after cancel in order to avoid a memory corrupting race condition (which i can observe in valgrind without the subsequent pluck after cancel). Am I misunderstanding something?

This problem is also reproducible if I remove the call cancel and subsequent pluck and switch back to an infinite timeout on the initial pluck. The only difference is instead the freeze happens on the initial pluck instead.

If I keep the timeout on the initial pluck then instead of freezing, once the problem starts, all calls always timeout indefinitely after that.

I'll remove the cancel and subsequent pluck and get new logs again but I've seen the freeze happen many times before we added the cancel and pluck timeout. The only reason we even added a pluck timeout was because we were getting this freeze problem. 

You also mentioned that we were using the php way to do this and yes that is correct; though i am curious what the other architectures for doing this are? I saw that the other extensions use a different way but i haven't quite understood it at a high level

Michael Lumish

unread,
Sep 21, 2017, 5:52:06 PM9/21/17
to ry...@quizlet.com, grpc.io
As far as I understand, your original problem was that completion_queue_pluck was occasionally blocking indefinitely.

The use-after-free issue that you reported is caused by freeing memory blocks that were passed to the gRPC core before the core was done using them. The gRPC C core requires that memory references that are passed to it are valid until the corresponding batch is complete and the completion has been received via completion_queue_next or completion_queue_pluck. When you added a finite timeout to completion_queue_pluck, it meant that you would sometimes end the function and delete the associated objects while the completion was still pending. This basically transformed your original problem from "completion_queue_pluck blocks indefinitely" to "we left a pending batch in the completion queue because we stopped waiting for it forever".

Then, when you added the cancel call and the corresponding pluck call, you most likely solved the memory corruption problem and reintroduced the original failure, just moved from the first pluck to the second. This does provide some information: it implies that the issue is not that the batch is simply waiting forever, but rather that there is some other problem. One possibility is that the batch is failing to start, so you're waiting on a batch that isn't in the queue.

In any case, I would strongly recommend reverting the changes to add finite deadlines and cancel calls, and trying to find the cause of the original failures.

ry...@quizlet.com

unread,
Sep 21, 2017, 6:50:51 PM9/21/17
to grpc.io
Yep, that's totally correct on all counts - I already went ahead this morning and switched back to the original scenario: https://github.com/quizlet/grpc/commit/0792caeec4cd1e4ab837c4b4c975f5acc969fe87

Actually, I just checked and I just got the freeze again in the original scenario as well. I had GRPC_TRACE = "tcp,api,call_combiner,op_failure,client_channel" set on it and VERBOSITY=DEBUG. The process is still running in this stuck state. Any thoughts on where I should prod to debug further? Happy to provide the logs as well
Reply all
Reply to author
Forward
0 new messages