Channel going into TRANSITION_FAILURE unless keepalive is used

40 views
Skip to first unread message

Daniel Rivas

unread,
Mar 13, 2024, 4:10:57 PMMar 13
to grpc.io

Hi all,


I’m experiencing a problem with a channel that I’m unable to explain based on what I know and what I could find in the documentation. Maybe someone here can shed some light on the matter :)


I’m using gRPC in my Python project and seemingly out of the blue some tests started failing for no apparent reason due to “Connection refused”. Moreover, I asked two colleagues to try but they weren’t able to reproduce this issue on their local environments, both using MacOS x86. So far, it fails in my machine (Arch Linux) and the Github CI (Ubuntu 23.10).


I use pytest as the testing framework and create the gRPC channel in a session-scoped fixture (i.e. the channel is created once at the beginning of the test session and re-used throughout it).


Environment

Language/Runtime: Python 3.10.3

gRPC version: 1.59.3

OS: Arch Linux. Kernel 6.6.18-1-lts



After a long debugging session, this is what I’ve gathered so far:

  • When run in isolation, all tests pass.

  • If I change the scope of the fixture to function (every test creates the channel and closes it at the end), all tests pass.

  • When re-using the channel, the first few tests run fine and after one point, all consecutive tests start to fail due to “Connection refused” (UNAVAILABLE).


The point after which tests start to fail seems to be consistent and is after the execution of a test module that takes about 11-13 seconds to run and doesn’t make any gRPC call.


Let’s say I test 3 modules:

  • Module A ←uses the gRPC channel.

  • Module B ← does not use the gRPC channel.

  • Module C ← uses again the same gRPC channel. 



I monitored the status changes of the channel and it looks like somewhere between the execution of Module B, the gRPC channel transitions into an IDLE state and immediately after into TRANSIENT_FAILURE. New calls get refused.


I added the following options to the channel:

options = {

"grpc.keepalive_time_ms": 5000,  # Send keepalive ping every 5 seconds

     "grpc.keepalive_permit_without_calls": True,  # Allow keepalive pings when there are no calls

     "grpc.http2.max_pings_without_data": 0,  # Unlimited pings without data

}

And the problem went away.

    

However, two things keep bothering me:

    1. From what I’ve gathered, the default IDLE_TIMEOUT is somewhere between 5 min and 30 min, as stated in the documentation.

    2. Regardless of the IDLE_TIMEOUT, the gRPC channel should be able to transition from IDLE to READY and accept new calls.



I added a callback to monitor the change of states and the following happens:

    1. Before the first test in Module A gets executed (when the channel is created), the channel goes from CONNECTING to READY, as expected.

[2024-03-12 12:44:42.221478+00:00] Channel state changed to CONNECTING.

[2024-03-12 12:44:42.224098+00:00] Channel state changed to READY.


    2. Right before the first test in Module C (which uses the channel again), the channel goes into IDLE and immediately after into TRANSITION_FAILURE:

      - [2024-03-12 12:44:53.473898+00:00] Channel state changed to IDLE.

      - [2024-03-12 12:44:53.474747+00:00] Channel state changed to TRANSITION_FAILURE.

   

When the grpc.keepalive_time_ms is set, there is still a transition to IDLE state, but the channel doesn't go into TRANSITION_FAILURE state and goes back to READY immediately.


According to the documentation:

When there has been no RPC activity on a channel for a specified IDLE_TIMEOUT, i.e., no new or pending (active) RPCs for this period, channels that are READY or CONNECTING switch to IDLE. Additionally, channels that receive a GOAWAY when there are no active or pending RPCs should also switch to IDLE to avoid connection overload at servers that are attempting to shed connections. We will use a default IDLE_TIMEOUT of 300 seconds (5 minutes).

Since tests don’t run for longer than the IDLE_TIMEOUT, I suspect it might have something to do with the GOAWAY, but looking at the traces I wasn’t able to find anything conclusive.


I’ve kind of fixed the issue by adding the keepalive options but I'd like to get to the bottom of this because I’m still missing something. Any ideas?


Thanks,

Daniel

Reply all
Reply to author
Forward
0 new messages