Seek for help - Python process stuck at _exec_once_mutex during first_connect in sqlalchemy/event/attr.py

65 views
Skip to first unread message

Ping Zhang

unread,
Jan 28, 2023, 1:02:13 AM1/28/23
to sqlalchemy
Hi all,

Our production system is experiencing a very strange behavior. At peak load, we see some of our python processes were stuck due to:  <built-in method __enter__ of _thread.lock object at remote 0x7f8e021156c0>  in with self._exec_once_mutex in sqlalchemy/event/attr.py, (here is the full stack trace that I got via gdb with python extension).  

Here are some high level information about production setup.

Python:                           3.7.9
SQLAlchemy                 1.3.6
PyMySQL                       1.0.2
Connection str: mysql+pymysql://HOST:PORT/DB?connect_timeout=30&charset=utf8
Docker container where the python process are running: Ubuntu 18.04.6
VM host version where docker runs: amzn2

When creating the engine, we use NullPool as the pool_class:
    engine_args['poolclass'] = NullPool
    engine = create_engine(SQL_ALCHEMY_CONN, **engine_args)
    setup_event_handlers(engine)
    Session = scoped_session(
        sessionmaker(autocommit=False,
                     autoflush=False,
                     bind=engine,
                     expire_on_commit=False))


There are 5 processes running running in the container. please see this link for the python process model running in the docker container. The stuck python process is created by its parent process (which is also python) via: `subprocess.Popen`, right after `subprocess.Popen`, there is a thread created to read the log of the subprocess(please see here for more info). Reading the source code of that stack trace, it looks like it tries to grab a lock, but i am not sure why it would fail to grab the lock given that we only have one process uses the session.

At the peak load, there are around ~200 docker containers running on a single amzn2 host. From the metric from database side, we do see some connections were dropped at the peak. I am not sure whether this would cause that threading.lock issue but they correlate. Or is it possible due to the  event_handlers of the engine and the session query in the stack trace try to grab this lock at the same time?

Potential related issues that I have checked: 
but i could not figure out why it happens in our system. We also tried to reproduce in our stress test env, however, we could not reproduce it.

Could you please share some insights? Any suggestions would be really appreciated.

Thanks,

Ping


Mike Bayer

unread,
Jan 28, 2023, 11:23:51 AM1/28/23
to noreply-spamdigest via sqlalchemy
that particular routine runs exactly once when the engine is used the very first time.    I only see one traceback here that does not seem to indicate the lock is being called in a re-entrant fashion, so to look for deadlocks we would need to see tracebacks for all threads.

if there is some unusual effect of creating processes here, ensure this routine runs in the parent process exactly once.   if you have a create_engine() that is then being spun out into separate processes, do this init *before* you make any of those processes:

engine = create_engine(...)
engine.connect().close()   # <--- engine is initialized

span_new_processes()   # now span sub processes

next, within each child process, set up the engine for all new connections to start up before the child process does its work by calling .dispose(close=False), so that within the child process, any connections inherited from the parent are forgotten:

def child_process():
    engine.dispose(close=False)



next w/ airflow, I have not used airflow but the code you point to there looks like the old "pre-ping" recipe which is no longer needed, as SQLAlchemy now includes the pre-ping option.   Report an issue to apache airflow that they should support SQLAlchemy 1.4 and above and remove this old event handler.

Finally, when the lock does actually occur youd also want to look at what the actual database and/or proxy servers are up to, if these servers are blocking new connections from proceeding, that would also be a potential cause of what you are seeing.
--
SQLAlchemy -
The Python SQL Toolkit and Object Relational Mapper
 
 
To post example code, please provide an MCVE: Minimal, Complete, and Verifiable Example. See http://stackoverflow.com/help/mcve for a full description.
---
You received this message because you are subscribed to the Google Groups "sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+...@googlegroups.com.

Ping Zhang

unread,
Jan 30, 2023, 12:48:26 PM1/30/23
to sqlal...@googlegroups.com
Thanks Mike for your kind response.

so to look for deadlocks we would need to see tracebacks for all threads.
Please see this link, there are 48 threads, lots of them are `blas_thread_server`, which I am not sure about where it comes from.

For the engine management, it looks like if we are using the NullPool, then it should be good. I will double check that as well.

Regarding the airflow, it starts to support SQLAlchemy 1.4 and above.

Thanks,

Ping


Reply all
Reply to author
Forward
0 new messages