Saving a lot of data to Postgresql database in Pyramid using SQLAlchemy

32 views
Skip to first unread message

Emerson Barea

unread,
Mar 20, 2020, 3:42:04 PM3/20/20
to pylons-...@googlegroups.com
Hi there.

Some times my app needs to create and save almost a million records in a Postgres database. In that case, I'm looking for the best way to do this, because the procedures I've used so far are very slow.

I will present some ways that I tried that were very slow, as well the workarounds that I tried to improve them and the errors that occurred:

1 - a for loop to generate the records and add them to the transaction manager

import argparse
import getopt
import sys

from pyramid.paster import bootstrap, setup_logging
from sqlalchemy.exc import OperationalError


class Topology(object):
def __init__(self, dbsession):
self.dbsession = dbsession

def autonomous_system(self):


# SOME PROCESS THAT RETURNS array1 AND array2


for i in array1:
record = Table(field1=array1[i],
#other fields here
)
self.dbsession.add(record)


def parse_args(config_file):
parser = argparse.ArgumentParser()
parser.add_argument(
'config_uri',
help='Configuration file, e.g., pyramid.ini',
)
return parser.parse_args(config_file.split())


def main(argv=sys.argv[1:]):
try:
opts, args = getopt.getopt(argv, 'h:', ["config-file="])
except getopt.GetoptError:
print('* Usage: topology --config-file=pyramid.ini')
sys.exit(2)
for opt, arg in opts:
if opt == '-h':
print('* Usage: topology --config-file=pyramid.ini')
sys.exit()
elif opt == '--config-file':
config_file = arg

args = parse_args(config_file)
setup_logging(args.config_uri)
env = bootstrap(args.config_uri)
try:
t = Topology(dbsession)
with env['request'].tm:
dbsession = env['request'].dbsession
t.autonomous_system()
except OperationalError:
print('Database error')

In this case, I noticed that the "dbsession.add (prefix)" inside the loop made the process very slow. I tried to create an array with the records in the loop and add everything to the transaction manager just once outside the loop using "bulk_save_objects" (https://docs.sqlalchemy.org/en/13/orm/persistence_techniques.html#bulk-operations), but I don't understand what "s = Session ()" is.

s = Session()
objects = [
models.AutonomousSystem(autonomous_system=2, id_topology=68),
models.AutonomousSystem(autonomous_system=3, id_topology=68),
models.AutonomousSystem(autonomous_system=4, id_topology=68),
models.AutonomousSystem(autonomous_system=5, id_topology=68)
]
s.bulk_save_objects(objects)

2 - insert a pandas dataframe into the database

Another way that I tried, and that I think it has the best performance for working with almost a million records, is to use panda dataframes. It is very fast to generate all records using pandas dataframes, but I'm not getting success to insert all data from the dataframe into Postgres database.

I tried "to_sql" procedure (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html), but I don't know what is the "engine" in the "df.to_sql('users', con=engine)" command.

Please, can someone help me with these questions?

Emerson


Michael Merickel

unread,
Mar 20, 2020, 3:47:38 PM3/20/20
to pylons-...@googlegroups.com
dbsession.add on each row is pretty much worse case scenario. Start with https://docs.sqlalchemy.org/en/13/_modules/examples/performance/bulk_inserts.html which shows you how to use dbsession.bulk_insert_mappings and several other approaches. The only caveat that's Pyramid-related is that since the dbsession is controlled by the tm, if you drop down to some of the core apis you'll need to use zope.sqlalchemy.mark_changed(dbsession) at the end to have your data committed.

Of course there's always the ability to use the raw psycopg2 connection as well which can get you closer to COPY-like performance - kinda depends on your needs, but I've had success just using bulk_insert_mappings on the ORM.

- Michael

--
You received this message because you are subscribed to the Google Groups "pylons-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pylons-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pylons-discuss/CAO1UhdXJSP_mvJm3Br-_oiGx3FZBR9fcKagQtfFnRXscKWQErg%40mail.gmail.com.

Theron Luhn

unread,
Mar 20, 2020, 4:02:17 PM3/20/20
to 'Peter Lada' via pylons-discuss
For CLI scripts like this, I usually skip Pyramid entirely and instantiate the SQLAlchemy session manually.  Won’t help your performance woes, but it is less machinery to deal with.

I second Michael’s bulk_insert_mappings, it's what I usually reach for in cases like this and the performance is good.  However, with Postgres you’ll need to enable batch mode, especially if the DB is over a network, otherwise performance will suffer greatly. 

So my scripts usually look like this:

dburl = sys.argv[1]
create_engine(dburl, executemany_mode='batch’)
db = Session(bind=engine)
db.bulk_insert_mappings(MyTable, get_data())
db.commit()

I wouldn’t use Pandas unless you’re already handling your data with it.

Emerson Barea

unread,
Mar 20, 2020, 9:42:25 PM3/20/20
to pylons-...@googlegroups.com
Thank you Michael Merickel and Theron Luhn for the answers.

I'll try to follow bulk_insert_mappings procedure, but I don't know how to pass database parameters (servername, username, password, database schema) in a connection script like Theron posted. I know that is a silly doubt, but can you help me again?

Thank you.

Theron Luhn

unread,
Mar 27, 2020, 1:19:31 PM3/27/20
to 'J G' via pylons-discuss
Reply all
Reply to author
Forward
0 new messages