I would like to tell sqlalchemy that if a query fails with this error,
it should wait a few seconds and then retry the query (and probably
give up if it fails again). Does SQLA provide some sort of hooks that
would allow me to do this without gnarly monkey patching?
you'd have to organize your code such that the desired operation can
be attempted again when this exception is raised.
This is a pretty tough road to travel, though, since if the connection
is lost, so is your entire transaction and everything you've loaded/
persisted within it. A better approach would be to isolate the cause
of the error. This error is commonly caused by a MySQL client
timeout (usually on a connection that's been idle for 8 hours) and is
allevated using the pool_recycle=<some number of seconds> option.
>
> This code isn't using transactions so retrying a failed query should
> be as simple as creating a new connection to replace the failed one
> and executing the query again.
>
> Still, I would much prefer to figure out the real cause, as you say. I
> had sort of given up on that because after a little while researching
> this error, I couldn't find much helpful info. It's hard to debug
> because the issue happens in a daily cron job, but it happens less
> than once a month and the rest of the time everything works fine. I
> have no way of consistently reproducing the problem or knowing if I've
> fixed it.
>
> I'm pretty sure there is no way that 8 hours could have gone by
> between the last query and the one that blew up.
>
> The basic structure of the cron job is:
> 1) It start up, does some sql stuff.
> 2) It forks a worker process using the python processing module.
> 3a) The worker calls metadata.bind.dispose() so that it won't try to
> reuse the connection it inherited from the parent. Worker then does
> some sql stuff. Worker always finishes successfully.
it might be better to just call create_engine() and not use bound
metadata here.
>
> 3b) Parent process goes into a loop doing sql stuff. Parent usually
> finishes successfully, but occasionally dies with the aforementioned
> MySQL error. I can't tell from the traceback whether it happens during
> the first iteration of the loop immediately after spawning the child
> or if it happens later.
>
> In principle, this structure is safe, right? 3a and 3b are happening
> in parallel, so it is indeterminate whether the worker calls dispose()
> before or during the sql stuff going on in the parent, but that
> shouldn't mater, right? Is it possible that the call to dispose() is
> somehow closing the connection in a way that sabotages the parent?
I wouldn't think so, but I'm not intricately familiar with the
mechanics of database connections passed between parent/child forks.
If its just a cron job you might want to consider using the NullPool
which doesnt pool any connections.