Problems handling an Amazon RDS Multi-AZ failover with SqlSoup and SqlAlchemy 0.7.8

JonathanQ

unread,

Nov 14, 2012, 2:38:05 PM11/14/12

to sqlal...@googlegroups.com

We have some code running against MySQL running in Amazon RDS. During an outage or the maintenance window - Amazon issues an automatic failover of our RDS instance. However our code doesn't failover with it.

We are having issues with 2 of our applications (2 different issues):

First one: Unable to rollback when we get the error "2006, 'MySQL server has gone away'"

The issue we are having is that a DB operation will fail when the failover occurs. So in the event of an error we issue a rollback on the SqlSoup object (which delegates to the underlying session).

sqlsoup_engine.rollback()

However this rollback fails with the error:

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/ext/sqlsoup.py", line 602, in rollback

self.session.rollback()

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/orm/scoping.py", line 114, in do

return getattr(self.registry(), name)(*args, **kwargs)

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 625, in rollback

self.transaction.rollback()

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 339, in rollback

transaction._rollback_impl()

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 367, in _rollback_impl

t[1].rollback()

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2035, in rollback

self._do_rollback()

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2072, in _do_rollback

self.connection._rollback_impl()

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1265, in _rollback_impl

self._handle_dbapi_exception(e, None, None, None, None)

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1262, in _rollback_impl

self.engine.dialect.do_rollback(self.connection)

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/dialects/mysql/base.py", line 1903, in do_rollback

connection.rollback()

OperationalError: (OperationalError) (2006, 'MySQL server has gone away') None None

Subsequently our application continues on an then starts failing with:

StatementError("Can't reconnect until invalid transaction is rolled back (original cause: InvalidRequestError: Can't reconnect until invalid transaction is rolled back)",)

The second error is: 2003, "Can't connect to MySQL server on 'mydbname.us-east-1.rds.amazonaws.com' (111)

File "/home/ec2-user/myapp/python/myapp.py", line 188, in _write

sqlsoup_engine.commit()

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/ext/sqlsoup.py", line 610, in commit

self.session.commit()

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/orm/scoping.py", line 114, in do

return getattr(self.registry(), name)(*args, **kwargs)

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 656, in commit

self.transaction.commit()

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 314, in commit

self._prepare_impl()

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 298, in _prepare_impl

self.session.flush()

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 1583, in flush

self._flush(objects)

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 1654, in _flush

flush_context.execute()

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 331, in execute

rec.execute(self)

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 475, in execute

uow

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 45, in save_obj

uowtransaction)

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 138, in _organize_states_for_save

states):

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 759, in _connections_for_states

base_mapper)

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 194, in connection

return self._connection_for_bind(engine)

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 256, in _connection_for_bind

conn = self._parent._connection_for_bind(bind)

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 267, in _connection_for_bind

conn = bind.contextual_connect()

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2490, in contextual_connect

self.pool.connect(),

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/pool.py", line 224, in connect

return _ConnectionFairy(self).checkout()

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/pool.py", line 387, in __init__

rec = self._connection_record = pool._do_get()

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/pool.py", line 741, in _do_get

con = self._create_connection()

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/pool.py", line 188, in _create_connection

return _ConnectionRecord(self)

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/pool.py", line 270, in __init__

self.connection = self.__connect()

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/pool.py", line 330, in __connect

connection = self.__pool._creator()

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/engine/strategies.py", line 80, in connect

return dialect.connect(*cargs, **cparams)

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 281, in connect

return self.dbapi.connect(*cargs, **cparams)

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/MySQLdb/__init__.py", line 81, in Connect

return Connection(*args, **kwargs)

File "/home/ec2-user/myapp/pyenv/lib/python2.7/site-packages/MySQLdb/connections.py", line 187, in __init__

super(Connection, self).__init__(*args, **kwargs2)

OperationalError: (OperationalError) (2003, "Can't connect to MySQL server on 'mydbname.us-east-1.rds.amazonaws.com' (111)") None None

It appears that the DNS lookup is cached (speculation?). RDS does a DNS swap on the url so it points to the slave instance. So if a DNS lookup was already done and the IP was cached if this would be the cause. This might not be anything to do with SqlAlchemy (maybe not even MySQLdb) - but wanted to check if anyone has seen similar behaviour before.

These applications can handle the rollback as long as it is able to re-connect and continue on. It will just re-process the records it was trying to operate on. But it gets in this stuck state and requires us to manually restart them for it to recover. So my questions are:

1) How should we be recovering when the server disappears? If a rollback is not going to succeed when the underlying DB has gone away - what is the best action to take?

2) Is there a good way to be handling the connection (with SqlSoup) so that it will auto-reconnect AND re-resolve DNS lookups?

Thanks,

Jonathan

JonathanQ

unread,

Nov 14, 2012, 2:45:02 PM11/14/12

to sqlal...@googlegroups.com

Ok - it appears that the 2nd issue (DNS resolving) may be related to AWS: https://forums.aws.amazon.com/thread.jspa?threadID=109414&tstart=0

But I am sure the first issue is something wrong with how we are using SqlAlchemy/SqlSoup connections.

Richie Foreman

unread,

Nov 14, 2012, 7:53:44 PM11/14/12

to sqlal...@googlegroups.com

Hey Jonathan,

When I was creating the Google AppEngine Cloud SQL driver I was having a very similar issue. The nature of the "cloud" is to not really hold open long-lived connections, it seems. I switched my driver code to FORCE SqlAlchemy's 'NullPool', which essentially does NO connection polling and requires a brand new connection each time.

I haven't tinkered with SQLSoup, but you might want to look where the SQLAlchemy connection is being established and set your pool class to NullPool.

Not sure if that helps, but wanted to throw my two cents in!

Cheers,

Richie

Michael Bayer

unread,

Nov 14, 2012, 8:16:56 PM11/14/12

to sqlal...@googlegroups.com

On Nov 14, 2012, at 2:38 PM, JonathanQ wrote:

> We have some code running against MySQL running in Amazon RDS. During an outage or the maintenance window - Amazon issues an automatic failover of our RDS instance. However our code doesn't failover with it.
>
> We are having issues with 2 of our applications (2 different issues):
>
> First one: Unable to rollback when we get the error "2006, 'MySQL server has gone away'"
>
> The issue we are having is that a DB operation will fail when the failover occurs. So in the event of an error we issue a rollback on the SqlSoup object (which delegates to the underlying session).
>
> sqlsoup_engine.rollback()
>
> However this rollback fails with the error:
>

> connection.rollback()
> OperationalError: (OperationalError) (2006, 'MySQL server has gone away') None None
>
> Subsequently our application continues on an then starts failing with:
> StatementError("Can't reconnect until invalid transaction is rolled back (original cause: InvalidRequestError: Can't reconnect until invalid transaction is rolled back)",)

OK well when you say "in the event of an error", is that an error that precedes the rollback, and is it the same 2006 error ? or is the 2006 error new with the rollback?

Basically, what should happen is, any DB operation, like execute, or commit, rollback, can potentially throw a 2006. SQLAlchemy will detect this as a "disconnect" situation, and discard all open connections that aren't checked out. The next operation will use a brand new connection.

So without knowing if you're getting this same error multiple times or not, you'd want to either rollback() a second time, or just throw away the Session after the rollback() fails. With SQLSoup you'd want to get at it by sqlsoup_engine.session.remove(), which will replace the current thread's Session with a new one.

JonathanQ

unread,

Nov 15, 2012, 12:28:48 PM11/15/12

to sqlal...@googlegroups.com

Thanks Michael - I think I am starting to see where we have gone wrong with our session handling. Looks like calling remove() on the session at the end of a unit of work is probably the best approach.

Here is a bit more detail on what these applications are doing - and perhaps you can comment on whether calling a remove() on the session at the end of the scope of work is the best action to take.

Application 1 runs 24/7 and consumes from a messaging queue. It pulls 10 messages off the queue and processes each one individually. Upon completion of processing for all 10 messages - we will ACK the ones that succeeded and let the ones that failed be re-driven to another consumer (after a few attempts we put them aside in a DLQ).

The unit of work here could be each message, or the batch of 10.

Application 2 runs 24/7 and loops over a list of enums and does queries based on them. For example: [blue, green, red, yellow]. We will loop over each and do a query for all records with color: blue, process them and then move on to the next. If processing for a single color fails, it shouldn't stop the processing of the next one - and assumption is that the next time around the failing ones will work (or will work soon). We sleep between outer loops for ~15 seconds.

The unit of work would be each color, or a single loop through all colors.

Currently in both applications we are not doing anything with the session after we finish processing the unit of work, so we are reusing the same session for all requests (for many days). That is why when something goes wrong and rollback fails for whatever reason - our application gets into a bad state and can't process anything until it is restarted.

Neither application is multi-threaded.

Is there any performance hit in doing a session.remove() very often? I am assuming not - after reading your reply I did some searches on session.remove() and that seems to be the regular behaivor for a web request. We can safely break up the unit of work into either very small chunks to minimize impact of a db issue, or batch them together safely without much impact on data integrity if there is a performance impact. Either way - re-using the same session for long periods of time (days) looks like its the root cause of our issues.