When operations such as DB Instance scaling or system upgrades like OS patching are initiated for Multi-AZ deployments, for enhanced availability, they are applied first on the standby prior to an automatic failover. As a result, your availability impact is limited only to the time required for automatic failover to complete.
When failing over, Amazon RDS simply flips the canonical name record (CNAME) for your DB Instance to point at the standby, which is in turn promoted to become the new primary. We encourage you to follow best practices and implement database connection retry at the application layer.
import sys
from datetime import datetime
from time import time, sleep
from sqlalchemy import Column, DateTime, String, Integer, ForeignKey, UnicodeText, create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy.orm.scoping import scoped_session
from sqlalchemy.pool import NullPool
from sqlalchemy.ext.declarative import declarative_base
current_milli_time = lambda: int(round(time() * 1000))
Base = declarative_base()
class Message(Base):
__tablename__ = 'messages'
id = Column(Integer, primary_key=True)
body = Column(UnicodeText, nullable=False)
engine = create_engine('postgresql+psycopg2://local:<pw>@<aws-rds-host>/test', poolclass=NullPool)
session_factory = sessionmaker()
session_factory.configure(bind=engine)
session = scoped_session(session_factory)
while True:
try:
ids = ''
start = current_milli_time()
for msg in session.query(Message).order_by('id desc').limit(5):
ids += str(msg.id) + ', '
print '(%s) (took %s ms) fetched ids: %s' % (datetime.now().time().isoformat(), current_milli_time() - start, ids)
start = current_milli_time()
m = Message()
m.body = 'some text'
session.add(m)
session.commit()
print '(%s) (took %s ms) inserted new message' % (datetime.now().time().isoformat(), current_milli_time() - start))
except Exception, e:
print '(%s) failed due to %s' % (datetime.now().time().isoformat(), e)
session.rollback()
finally:
session.remove()
sleep(0.25)
Hi,We are running SQLAlchemy with a Multi-AZ Postgres RDS database on AWS and are having trouble with database connections that hangs when failover occurs. According to Amazon documentation for Multi-AZ RDS:When operations such as DB Instance scaling or system upgrades like OS patching are initiated for Multi-AZ deployments, for enhanced availability, they are applied first on the standby prior to an automatic failover. As a result, your availability impact is limited only to the time required for automatic failover to complete.andWhen failing over, Amazon RDS simply flips the canonical name record (CNAME) for your DB Instance to point at the standby, which is in turn promoted to become the new primary. We encourage you to follow best practices and implement database connection retry at the application layer.What we are wondering is what "best practices and implement database connection retry" means when using SQLAlchemy? To test what happens during a Mutli-AZ RDS failover, we created a new Multi-AZ Postgres instance and a simple script that we run on a server on AWS (connecting through internal network to the AWS RDS instance). The script queries a few rows and insert a new row in a while-loop.Running this script and then rebooting the RDS (with flag "Reboot With Failover?" ticked) causes the script to hang on session.commit(). When rebooting the RDS instance **without** the "Reboot With Failover?" ticked, the script outputs "could not connect to server: Connection refused" for a few seconds, and then resumes as normal.We have tried:1. Using the QueuePool instead of the NullPool (we switched to NullPool just to isolate the error).2. Setting socket.setdefaulttimeout() to a low value.3. Passing PGOPTIONS="-c statement_timeout=1000" to make the pgdriver timeout.4. Using autocommmit=True on sessionmaker().If we run lsof -i on the application server, we can see that the connection closes after each iteration in the loop. When the script hangs, we can see that there is one connection open which does not close, so the issue seems to be a missing timeout somewhere.Do anyone know what could be causing this?
--
You received this message because you are subscribed to a topic in the Google Groups "sqlalchemy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sqlalchemy/NANPH1QWgMM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sqlalchemy+...@googlegroups.com.
To post to this group, send email to sqlal...@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.
Have you tried a Pessimistic Disconnect strategy?
Visit this group at https://groups.google.com/group/sqlalchemy.
With apologies for resuscitating an old thread, I came across this issue last week, a 15-minute hang while an amazon RDS instance was doing a failover (we are on SQLA 1.3.22). I need to do some more work on repro-ing the issue but:This post on SO from last month https://stackoverflow.com/questions/65505031/sqlalchemy-ssl-syscall-timeout-coping-mechanism mentions that PGBouncer had a similar issue (linked).PGBouncer decided it was to do with low-level TCP settings. they fixed it by adjusting the TCP_USER_TIMEOUT setting. that can be done on a per-connection basis apparently.Would it be worth investigating and adding to SQLA?
--SQLAlchemy -The Python SQL Toolkit and Object Relational MapperTo post example code, please provide an MCVE: Minimal, Complete, and Verifiable Example. See http://stackoverflow.com/help/mcve for a full description.---You received this message because you are subscribed to the Google Groups "sqlalchemy" group.To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+...@googlegroups.com.To view this discussion on the web visit https://groups.google.com/d/msgid/sqlalchemy/75b4bed5-bbbc-476f-9635-e270780f010cn%40googlegroups.com.