Multi-AZ RDS on AWS, failover and SQLAlchemy

Jonas Oscarsson

unread,

Oct 17, 2014, 4:47:31 AM10/17/14

to sqlal...@googlegroups.com

Hi,

We are running SQLAlchemy with a Multi-AZ Postgres RDS database on AWS and are having trouble with database connections that hangs when failover occurs. According to Amazon documentation for Multi-AZ RDS:

When operations such as DB Instance scaling or system upgrades like OS patching are initiated for Multi-AZ deployments, for enhanced availability, they are applied first on the standby prior to an automatic failover. As a result, your availability impact is limited only to the time required for automatic failover to complete.

and

When failing over, Amazon RDS simply flips the canonical name record (CNAME) for your DB Instance to point at the standby, which is in turn promoted to become the new primary. We encourage you to follow best practices and implement database connection retry at the application layer.

What we are wondering is what "best practices and implement database connection retry" means when using SQLAlchemy? To test what happens during a Mutli-AZ RDS failover, we created a new Multi-AZ Postgres instance and a simple script that we run on a server on AWS (connecting through internal network to the AWS RDS instance). The script queries a few rows and insert a new row in a while-loop.

Running this script and then rebooting the RDS (with flag "Reboot With Failover?" ticked) causes the script to hang on session.commit(). When rebooting the RDS instance **without** the "Reboot With Failover?" ticked, the script outputs "could not connect to server: Connection refused" for a few seconds, and then resumes as normal.

We have tried:

1. Using the QueuePool instead of the NullPool (we switched to NullPool just to isolate the error).

2. Setting socket.setdefaulttimeout() to a low value.

3. Passing PGOPTIONS="-c statement_timeout=1000" to make the pgdriver timeout.

4. Using autocommmit=True on sessionmaker().

If we run lsof -i on the application server, we can see that the connection closes after each iteration in the loop. When the script hangs, we can see that there is one connection open which does not close, so the issue seems to be a missing timeout somewhere.

Do anyone know what could be causing this?

The script we run is included below.

Thanks in advance! / Jonas

import sys
from datetime import datetime
from time import time, sleep
from sqlalchemy import Column, DateTime, String, Integer, ForeignKey, UnicodeText, create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy.orm.scoping import scoped_session
from sqlalchemy.pool import NullPool
from sqlalchemy.ext.declarative import declarative_base


current_milli_time = lambda: int(round(time() * 1000))
Base = declarative_base()


class Message(Base):
    __tablename__ = 'messages'
    id = Column(Integer, primary_key=True)
    body = Column(UnicodeText, nullable=False)


engine = create_engine('postgresql+psycopg2://local:<pw>@<aws-rds-host>/test', poolclass=NullPool)
session_factory = sessionmaker()
session_factory.configure(bind=engine)

session = scoped_session(session_factory)

while True:
  try:
    ids = ''
    start = current_milli_time()
    for msg in session.query(Message).order_by('id desc').limit(5):
      ids += str(msg.id) + ', '
    print '(%s) (took %s ms) fetched ids: %s' % (datetime.now().time().isoformat(), current_milli_time() - start, ids)

    start = current_milli_time()
    m = Message()
    m.body = 'some text'
    session.add(m)
    session.commit()
    print '(%s) (took %s ms) inserted new message' % (datetime.now().time().isoformat(), current_milli_time() - start))

  except Exception, e:
    print '(%s) failed due to %s' % (datetime.now().time().isoformat(), e)
    session.rollback()
  finally:
    session.remove()

  sleep(0.25)

Jonathan Vanasco

unread,

Oct 17, 2014, 11:38:21 AM10/17/14

to sqlal...@googlegroups.com

Have you tried a Pessimistic Disconnect strategy?

http://docs.sqlalchemy.org/en/latest/core/pooling.html#dealing-with-disconnects

Michael Bayer

unread,

Oct 17, 2014, 12:03:38 PM10/17/14

to sqlal...@googlegroups.com

On Oct 17, 2014, at 4:47 AM, Jonas Oscarsson <jona...@rsson.nu> wrote:

Hi,

We are running SQLAlchemy with a Multi-AZ Postgres RDS database on AWS and are having trouble with database connections that hangs when failover occurs. According to Amazon documentation for Multi-AZ RDS:

When operations such as DB Instance scaling or system upgrades like OS patching are initiated for Multi-AZ deployments, for enhanced availability, they are applied first on the standby prior to an automatic failover. As a result, your availability impact is limited only to the time required for automatic failover to complete.

and

When failing over, Amazon RDS simply flips the canonical name record (CNAME) for your DB Instance to point at the standby, which is in turn promoted to become the new primary. We encourage you to follow best practices and implement database connection retry at the application layer.

What we are wondering is what "best practices and implement database connection retry" means when using SQLAlchemy? To test what happens during a Mutli-AZ RDS failover, we created a new Multi-AZ Postgres instance and a simple script that we run on a server on AWS (connecting through internal network to the AWS RDS instance). The script queries a few rows and insert a new row in a while-loop.

Running this script and then rebooting the RDS (with flag "Reboot With Failover?" ticked) causes the script to hang on session.commit(). When rebooting the RDS instance **without** the "Reboot With Failover?" ticked, the script outputs "could not connect to server: Connection refused" for a few seconds, and then resumes as normal.

We have tried:
1. Using the QueuePool instead of the NullPool (we switched to NullPool just to isolate the error).
2. Setting socket.setdefaulttimeout() to a low value.
3. Passing PGOPTIONS="-c statement_timeout=1000" to make the pgdriver timeout.
4. Using autocommmit=True on sessionmaker().

If we run lsof -i on the application server, we can see that the connection closes after each iteration in the loop. When the script hangs, we can see that there is one connection open which does not close, so the issue seems to be a missing timeout somewhere.

Do anyone know what could be causing this?

I’d note that “autocommit=True” on sessionmaker() does not change much at all in the way the DBAPI is used. If you’re looking to use psycopg2’s “autocommit” feature, I’d suggest experimenting first with raw psycopg2 so that you can familiarize with how this behavior impacts what you’re doing with AWS; the SQLAlchemy integration with this feature is documented at http://docs.sqlalchemy.org/en/rel_0_9/dialects/postgresql.html#psycopg2-transaction-isolation-level.

Since you’re dealing with a system that hangs, it’s essential that you get a handle on exactly what connectivity and transactional pattern you want to see first at the DBAPI level, then we can talk about SQLAlchemy APIs to make sure these patterns are used.

Jonas Oscarsson

unread,

Oct 18, 2014, 8:42:30 AM10/18/14

to sqlal...@googlegroups.com

Thanks Michael, we will redo the test using psycopg2 directly and see what behavior we get there. Will report back here when done.

--
You received this message because you are subscribed to a topic in the Google Groups "sqlalchemy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sqlalchemy/NANPH1QWgMM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sqlalchemy+...@googlegroups.com.
To post to this group, send email to sqlal...@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.

Jonas Oscarsson

unread,

Oct 18, 2014, 8:42:32 AM10/18/14

to sqlal...@googlegroups.com

Thanks for the tip, we haven't tried that yet although I think we would have the same behavior as it seems to be the next connection after the database is down that hangs (no matter if it's a SELECT 1 or an actual query).

On Fri, Oct 17, 2014 at 5:38 PM, Jonathan Vanasco <jvan...@gmail.com> wrote:

Have you tried a Pessimistic Disconnect strategy?

http://docs.sqlalchemy.org/en/latest/core/pooling.html#dealing-with-disconnects

Joshua Ma

unread,

Dec 21, 2015, 3:19:32 PM12/21/15

to sqlalchemy

Hi Jonas,

Just wondering if you were able to characterize the behavior and had any ideas for gracefully handling RDS failover. From what I'm seeing, RDS will not cause errors, but will simply hang, so SQLA doesn't hit its "optimistic disconnect handling" (http://docs.sqlalchemy.org/en/latest/core/pooling.html#disconnect-handling-optimistic) handler.

Have you found a simple method of identifying failover but not a slow query? Would be super useful for us and I'm guessing other folks using SQLA + RDS. (We're also using Postgres.)

- Josh

Jonas Oscarsson

unread,

Feb 6, 2016, 1:22:53 PM2/6/16

to sqlal...@googlegroups.com

Hi Josh,

Sorry for the late reply.

Unfortunately we haven't had time/prioritized to look into it further. For the very few times when we've had to use the failover functionality, it has recovered quickly enough (within two minutes or so).

Best,

Jonas

Visit this group at https://groups.google.com/group/sqlalchemy.

Joshua Ma

unread,

Feb 7, 2016, 7:47:40 PM2/7/16

to sqlalchemy

Forgot to send an update myself. We actually figured it out: I found that it was due to SQLA not properly detecting a failed connection - we get "SSL SYSCALL error" and the connection pool doesn't reopen + stays borked. Looks like SQLA has a few hard-coded string checks that it looks for and SSL SYCALL was a newer one.

0.9.5 and 0.9.8 fixed a bunch of related issues:

0.9.5: Added a new "disconnect" message connection has been closed unexpectedly"

0.9.5: The psycopg2 .closed accessor is now consulted when determining if an

exception is a "disconnect" error

0.9.8: A revisit to this issue first patched in 0.9.5, apparently psycopg2’s

.closed accessor is not as reliable as we assumed, so we have added an explicit

check for the exception messages "SSL SYSCALL error: Bad file descriptor" and

"SSL SYSCALL error: EOF detected" when detecting an is-disconnect scenario

So upgrade to 0.9.8+ solved our issues. When a failover happens:

1) primary stops accepting requests, triggers SSL SYSCALL errors

2) connection pool invalidates the connection, opens a new one

3) dns has 5s ttl, so by then it points to new primary (if not, error + repeat)

4) new connection is established, and we've recovered

So there's a few errors, which seems like you can solve it using pessimistic disconnect handling, but at this point it seems reasonable enough for us.

- Josh

Harry

unread,

Jan 20, 2021, 5:42:35 AM1/20/21

to sqlalchemy

With apologies for resuscitating an old thread, I came across this issue last week, a 15-minute hang while an amazon RDS instance was doing a failover (we are on SQLA 1.3.22). I need to do some more work on repro-ing the issue but:

This post on SO from last month https://stackoverflow.com/questions/65505031/sqlalchemy-ssl-syscall-timeout-coping-mechanism mentions that PGBouncer had a similar issue (linked).

PGBouncer decided it was to do with low-level TCP settings. they fixed it by adjusting the TCP_USER_TIMEOUT setting. that can be done on a per-connection basis apparently.

Would it be worth investigating and adding to SQLA?

hp

Harry

unread,

Jan 20, 2021, 6:05:02 AM1/20/21

to sqlalchemy

setting socket options like this might be more of a psycopg2 thing? some more investigations here: https://stackoverflow.com/questions/26741175/psycopg2-db-connection-hangs-on-lost-network-connection

Mike Bayer

unread,

Jan 20, 2021, 11:05:16 AM1/20/21

to noreply-spamdigest via sqlalchemy

On Wed, Jan 20, 2021, at 5:42 AM, Harry wrote:

With apologies for resuscitating an old thread, I came across this issue last week, a 15-minute hang while an amazon RDS instance was doing a failover (we are on SQLA 1.3.22). I need to do some more work on repro-ing the issue but:

This post on SO from last month https://stackoverflow.com/questions/65505031/sqlalchemy-ssl-syscall-timeout-coping-mechanism mentions that PGBouncer had a similar issue (linked).
PGBouncer decided it was to do with low-level TCP settings. they fixed it by adjusting the TCP_USER_TIMEOUT setting. that can be done on a per-connection basis apparently.

Would it be worth investigating and adding to SQLA?

hey there -

all libpq parameters may be passed in the URL or connect args already, so just pass it in!

from sqlalchemy import create_engine

e = create_engine("postgresql+psycopg2://scott:tiger@localhost/test?tcp_user_timeout=30")

c = e.connect()

print(c.connection.connection)

output, there it is:

--
SQLAlchemy -
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable Example. See http://stackoverflow.com/help/mcve for a full description.
---
You received this message because you are subscribed to the Google Groups "sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sqlalchemy/75b4bed5-bbbc-476f-9635-e270780f010cn%40googlegroups.com.

Reply all

Reply to author

Forward