strategies for debugging hung event loop

88 views
Skip to first unread message

Eric Gradman

unread,
Dec 19, 2022, 1:06:17 PM12/19/22
to gevent: coroutine-based Python network library
TLDR; I have a gevent program that hangs.  Neither the "monitor thread" nor `gevent.util.print_run_info` is shedding light on why or where—but probably because I'm not really understanding the output.  I need strategies for finding this hang!

This is the project that's troubling me https://www.latimes.com/entertainment-arts/story/2019-09-06/operate-on-a-puppet-in-dr-botchers-minute-medical-school  It's pretty unusual!

It doesn't spawn 1000's of greenlets to handle web requests!  Rather there's just a handful of greenlets, and those greenlets mainly block waiting for messages from redis pubsub subscriptions, `sleep`ing and `join`ing on each other.  All the I/O is through a monkey-patched connection to a redis server.

The entire gevent loop will occasionally freeze up, requiring a restart of the python process. When this happens, every greenlet stops working and nothing helpful is printed to the console.  In this state however I have once seen a `Timeout` exception be raised and printed to the console.

I'm basically looking for strategies to find this bug.  I've a long-time user of gevent.  I know how to write greenlets that cooperate.  But I'm completely unsophisticated about gevent's internals.  Debugging this issue is the first time I've encountered a "hub."

My code kills greenlets.  I might be falling victim to this caution from the `Greenlet.kill` documentation:

> Use care when killing greenlets. If the code executing is not exception safe (e.g., makes proper use of `finally`) then an unexpected exception could result in corrupted state. Using a `link()` or `rawlink()` (cheaper) may be a safer way to clean up resources.

I've hooked a SIGUSR1 handler to `gevent.util.print_run_info` and I've gone through the output line by line, but all the greenlets "look like" they're doing reasonable things (like, they're inside blocking I/O calls, joining, sleeping, or doing other ordinary things).  But honestly I don't know what I'm looking for.  Here's a copy of the output from one crash run: https://www.dropbox.com/s/gmeqj9s6ad9r37y/gevent_log.txt?dl=0

I tried enabling the monitor thread early on in my code
```
from gevent import config
config.monitor_thread = True
```
But it produces no output...

So I'm stumped!

Some extra info:

This project runs in python 3.10 on Ubuntu 22.04 x86_64 in Docker.  I'm using whatever the latest gevent is.  I'm not pegging my versions, though I might start doing that.  I'm connecting to a redis server also in docker.  Monkey-patching happens first thing.  I have quite a few systems with this same configuration that all run like a champ.
Reply all
Reply to author
Forward
0 new messages