Hi there.
Some times my app needs to create and save almost a million records in a Postgres database. In that case, I'm looking for the best way to do this, because the procedures I've used so far are very slow.
I will present some ways that I tried that were very slow, as well the workarounds that I tried to improve them and the errors that occurred:
1 - a for loop to generate the records and add them to the transaction manager
import argparse
import getopt
import sys
from pyramid.paster import bootstrap, setup_logging
from sqlalchemy.exc import OperationalError
class Topology(object):
def __init__(self, dbsession):
self.dbsession = dbsession
def autonomous_system(self):
# SOME PROCESS THAT RETURNS array1 AND array2
for i in array1:
record = Table(field1=array1[i],
#other fields here
)
self.dbsession.add(record)
def parse_args(config_file):
parser = argparse.ArgumentParser()
parser.add_argument(
'config_uri',
help='Configuration file, e.g., pyramid.ini',
)
return parser.parse_args(config_file.split())
def main(argv=sys.argv[1:]):
try:
opts, args = getopt.getopt(argv, 'h:', ["config-file="])
except getopt.GetoptError:
print('* Usage: topology --config-file=pyramid.ini')
sys.exit(2)
for opt, arg in opts:
if opt == '-h':
print('* Usage: topology --config-file=pyramid.ini')
sys.exit()
elif opt == '--config-file':
config_file = arg
args = parse_args(config_file)
setup_logging(args.config_uri)
env = bootstrap(args.config_uri)
try:
t = Topology(dbsession)
with env['request'].tm:
dbsession = env['request'].dbsession
t.autonomous_system()
except OperationalError:
print('Database error')
s = Session()
objects = [
models.AutonomousSystem(autonomous_system=2, id_topology=68),
models.AutonomousSystem(autonomous_system=3, id_topology=68),
models.AutonomousSystem(autonomous_system=4, id_topology=68),
models.AutonomousSystem(autonomous_system=5, id_topology=68)
]
s.bulk_save_objects(objects)
2 - insert a pandas dataframe into the database
Another way that I tried, and that I think it has the best performance for working with almost a million records, is to use panda dataframes. It is very fast to generate all records using pandas dataframes, but I'm not getting success to insert all data from the dataframe into Postgres database.
I tried "to_sql" procedure (
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html), but I don't know what is the "engine" in the
"df.to_sql('users', con=engine)" command.
Please, can someone help me with these questions?
Emerson