executemany + postgresql

457 views
Skip to first unread message

Jon Nelson

unread,
Nov 5, 2009, 8:46:11 PM11/5/09
to sqlal...@googlegroups.com
I recently ran into an issue today where batched (inside a
transaction) I was able to achieve not more than about 9000
inserts/second to a postgresql database (same machine running the
test).

With everything exactly the same, I was able to achieve over 50,000
inserts/s to sqlite.

Now, I won't argue the relative merits of each database, but this is a
big problem for postgresql. I believe I have determined that the
psycopg2 module is to blame, and the substantial portion of the time
spent was being spent in IPC/RPC. Basically, every single insert in
this test is identical except for the values (same table and columns),
but psycopg2 (or possibly SQLAlchemy) was performing an individual
INSERT for every single row. I was *not* using the ORM.

The code was something like this:


row_values = build_a_bunch_of_dictionaries()
ins = table.insert()
t = conn.begin()
conn.execute(ins, row_values)
t.commit()

where row_values is (of course) a list of dictionaries.

What can be done here to improve the speed of bulk inserts? For
postgresql to get walloped by a factor of 5 in this area is a big
bummer.

--
Jon

Michael Bayer

unread,
Nov 5, 2009, 9:40:05 PM11/5/09
to sqlal...@googlegroups.com
it depends on the source of the speed problem. if your table has
types which do utf-8 encoding on each value, for example, that takes
up a lot of time. the sqlite backend doesn't have this requirement
but the PG one in 0.5 currently does.

we've done some work on this in 0.6 to reduce this - we now use
psycopg2's UNICODE extension, so that we expect result rows to come
back as unicode objects already. In response to this question I just
made the same change for bind parameters so that they wont be encoded
into utf-8 on the way in, so feel free to try r6484 of trunk.

Also psycopg2 is a very fast, native DBAPI so I doubt there's any
bottleneck there.





>
> --
> Jon
>
> >

Jon

unread,
Nov 6, 2009, 12:03:47 AM11/6/09
to sqlalchemy
I gave that a try and did receive a mild speed boost - from ~9000
inserts/s to 9500 +/- 200.
However, 9500 is still substantially lower than 50,000. In this case
(pathological), *all* of the values are strings, and in fact the table
doesn't even have a primary key.

> Also psycopg2 is a very fast, native DBAPI so I doubt there's any
> bottleneck there.

Granted, I'm using SA on /top/ of sqlite3 and psycopg2 (2.0.12), but
when the only thing that changes is the dburi...

Before I even posted I resorted to strace. strace immediately
confirmed my suspicion - when using psycopg2 I don't see one big fat
INSERT with lots of binds, I see one INSERT per bind, and it's this
that is ultimately killing the performance. You can easily observe
this via strace: as I'm sure you know, the communication between the
test program and postgresql takes place across a socket (unix domain
or tcp/ip). For every single set of bind params, the result is
essentially one sendto (INSERT INTO ....) and rt_sigprocmask, a poll,
and then a recvfrom and rt_sigprocmask pair. Profiling at the C level
shows that sendto accounts for *35%* of the total runtime and recvfrom
a healthy 15%. It's this enormous overhead for every single bind param
that's killing the performance.

--
Jon

Adrian von Bidder

unread,
Nov 6, 2009, 2:41:27 AM11/6/09
to sqlal...@googlegroups.com
Heyho!

On Friday 06 November 2009 02.46:11 Jon Nelson wrote:
> ... was performing an individual


> INSERT for every single row.

Don't know sqlalchemy good enough, but for big bulk imports on the SQL side,
shouldn't "COPY" be used? Which is as far as I know pg-specific / non-SQL
standard.


cheers
-- vbi

--
<Lo-lan-do> モインさん?
<nobse> Lo-lan-do: Gesundheit.
-- #debian-devel

signature.asc

Jon Nelson

unread,
Nov 6, 2009, 7:54:47 AM11/6/09
to sqlal...@googlegroups.com
On Fri, Nov 6, 2009 at 1:41 AM, Adrian von Bidder <avbi...@fortytwo.ch> wrote:
> Heyho!
>
> On Friday 06 November 2009 02.46:11 Jon Nelson wrote:
>> ... was performing an individual
>> INSERT for every single row.
>
> Don't know sqlalchemy good enough, but for big bulk imports on the SQL side,
> shouldn't "COPY" be used?  Which is as far as I know pg-specific / non-SQL
> standard.

Of course I could use COPY, but I wanted to avoid that. Using the same
data, I did some quick tests (via psql, although I believe psycopg2
exposes the functionality as well) and got insert rates that were
really, really impressive (anywhere from 40K to 150K records/s, highly
variable but still very very fast). The biggest problem with COPY is
that the data file has to be formatted for entry and has to be
physically available to the server.

Ultimately, even though postgresql has so many advantages over sqlite,
the (seeming) lack of batch inserts (at least from python) is a big
hurt. Maybe I need to root around in postgresql's C API to see if
there is something I could do.


--
Jon

Mike Conley

unread,
Nov 6, 2009, 9:08:28 AM11/6/09
to sqlal...@googlegroups.com
I don't use postgresql, but it looks like you are trying to achieve this

INSERT INTO table (a,b,c) VALUES
   ('a1','b1','c1),
   ('a2','b2','c2),
      ... repeated many times ...

which is not directly supported by SQLAlchemy, psycopg2 or both.


Did you consider an approach like this?  There could be many variations on how to construct the data rows and join them to the sql, but the theme is build the desired sql and give it to SA to pass through.

-- not tested with a database --

cols = ('col1','col2','col3','col4','col5','col6')
data = (
    ('1-1','1-2','1-3','1-4','1-5','1-6'),
    ('2-1','2-2','2-3','2-4','2-5','2-6'),
    ('3-1','3-2','3-3','3-4','3-5','3-6'),
)

sql = "INSERT INTO MYTABLE (%s) VALUES" % (','.join(cols),)
for row in data:
    print str(row)
sql += "\n    "
sql += ",\n    ".join([str(x) for x in data])
sql += ' ;'

conn.execute(sql)


Adrian von Bidder

unread,
Nov 6, 2009, 8:40:44 AM11/6/09
to sqlal...@googlegroups.com
On Friday 06 November 2009 13.54:47 Jon Nelson wrote:
> The biggest problem with COPY is
> that the data file has to be formatted for entry and has to be
> physically available to the server.

I may be way off, but didn't recent versions add support for COPY from stdin
in psql? (which, I'd think, should have some corresponding interface in C)

So the data wouldn't need to be on a file on the server. The data would
still need to be massaged into the right format, yes.

cheers
-- vbi

--
featured link: Debian Bookmark Collection - http://bookmarks.debian.net/

signature.asc

Michael Bayer

unread,
Nov 6, 2009, 10:57:52 AM11/6/09
to sqlal...@googlegroups.com

On Nov 6, 2009, at 12:03 AM, Jon wrote:

>
> I gave that a try and did receive a mild speed boost - from ~9000
> inserts/s to 9500 +/- 200.

woohoo !

> Before I even posted I resorted to strace. strace immediately
> confirmed my suspicion - when using psycopg2 I don't see one big fat
> INSERT with lots of binds, I see one INSERT per bind, and it's this
> that is ultimately killing the performance. You can easily observe
> this via strace: as I'm sure you know, the communication between the
> test program and postgresql takes place across a socket (unix domain
> or tcp/ip). For every single set of bind params, the result is
> essentially one sendto (INSERT INTO ....) and rt_sigprocmask, a poll,
> and then a recvfrom and rt_sigprocmask pair. Profiling at the C level
> shows that sendto accounts for *35%* of the total runtime and recvfrom
> a healthy 15%. It's this enormous overhead for every single bind param
> that's killing the performance.

have you asked about this on the psycopg2 mailing list ? its at http://mail.python.org/mailman/listinfo/python-list
. Let me know if you do, because I'll get out the popcorn... :)

We also support the pg8000 DBAPI in 0.6. I doubt its doing something
differently here but feel free to connect with postgresql+pg8000://
and see what you get.


Jon Nelson

unread,
Nov 7, 2009, 12:53:14 PM11/7/09
to sqlal...@googlegroups.com
On Fri, Nov 6, 2009 at 9:57 AM, Michael Bayer <mik...@zzzcomputing.com> wrote:
>> Before I even posted I resorted to strace. strace immediately
>> confirmed my suspicion - when using psycopg2 I don't see one big fat
>> INSERT with lots of binds, I see one INSERT per bind, and it's this
>> that is ultimately killing the performance. You can easily observe
>> this via strace: as I'm sure you know, the communication between the
>> test program and postgresql takes place across a socket (unix domain
>> or tcp/ip). For every single set of bind params, the result is
>> essentially one sendto (INSERT INTO ....) and rt_sigprocmask, a poll,
>> and then a recvfrom and rt_sigprocmask pair.  Profiling at the C level
>> shows that sendto accounts for *35%* of the total runtime and recvfrom
>> a healthy 15%. It's this enormous overhead for every single bind param
>> that's killing the performance.
>
> have you asked about this on the psycopg2 mailing list ?   its at http://mail.python.org/mailman/listinfo/python-list
>  .   Let me know if you do, because I'll get out the popcorn... :)

That's the python list.
Anyway, I did some more testing. executemany performance is not any
better than looping over execute, because that's all that executemany
appears to do in any case.

However, I manually built a bit fat set of bind params (bypassing
sqlalchemy directly) and got a SUBSTANTIAL performance improvement.
Postgresql as of 8.2 supports /sets/ of bind params, it'd be nice if
pg8000 or psycopg2 (or both) supported that. Building 25000 bind
params by hand is not fun, but it got me to just shy of 50K
inserts/second.

> We also support the pg8000 DBAPI in 0.6.  I doubt its doing something
> differently here but feel free to connect with postgresql+pg8000://
> and see what you get.

I tried pg8000 but I got an error:

...

return self.dbapi.connect(*cargs, **cparams)
sqlalchemy.exc.DBAPIError: (TypeError) connect() takes at least 1
non-keyword argument (0 given) None None

--
Jon

Michael Bayer

unread,
Nov 7, 2009, 12:58:16 PM11/7/09
to sqlal...@googlegroups.com
On Nov 7, 2009, at 12:53 PM, Jon Nelson wrote:


have you asked about this on the psycopg2 mailing list ?   its at http://mail.python.org/mailman/listinfo/python-list
 .   Let me know if you do, because I'll get out the popcorn... :)

That's the python list.




I tried pg8000 but I got an error:

...

   return self.dbapi.connect(*cargs, **cparams)
sqlalchemy.exc.DBAPIError: (TypeError) connect() takes at least 1
non-keyword argument (0 given) None None

i can't reproduce that.   this is with the latest trunk:

from sqlalchemy import *

e = create_engine('postgresql+pg8000://scott:tiger@localhost/test')

print e.execute("select 1").fetchall()

produces:

[(1,)]


Jon Nelson

unread,
Nov 7, 2009, 1:30:08 PM11/7/09
to sqlal...@googlegroups.com

Apparently, pg8000 requires host, user and pass (or at least one of those).

Of course, then when I am connected, I get a traceback:

...
metadata.drop_all()
File "/usr/lib64/python2.6/site-packages/sqlalchemy/schema.py", line
1871, in drop_all
bind.drop(self, checkfirst=checkfirst, tables=tables)
File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py",
line 1336, in drop
self._run_visitor(ddl.SchemaDropper, entity,
connection=connection, **kwargs)
File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py",
line 1360, in _run_visitor
visitorcallable(self.dialect, conn, **kwargs).traverse(element)
File "/usr/lib64/python2.6/site-packages/sqlalchemy/sql/visitors.py",
line 86, in traverse
return traverse(obj, self.__traverse_options__, self._visitor_dict)
File "/usr/lib64/python2.6/site-packages/sqlalchemy/sql/visitors.py",
line 197, in traverse
return traverse_using(iterate(obj, opts), obj, visitors)
File "/usr/lib64/python2.6/site-packages/sqlalchemy/sql/visitors.py",
line 191, in traverse_using
meth(target)
File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/ddl.py",
line 89, in visit_metadata
collection = [t for t in reversed(sql_util.sort_tables(tables)) if
self._can_drop(t)]
File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/ddl.py",
line 104, in _can_drop
return not self.checkfirst or
self.dialect.has_table(self.connection, table.name,
schema=table.schema)
File "/usr/lib64/python2.6/site-packages/sqlalchemy/dialects/postgresql/base.py",
line 611, in has_table
type_=sqltypes.Unicode)]
File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py",
line 991, in execute
return Connection.executors[c](self, object, multiparams, params)
File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py",
line 1053, in _execute_clauseelement
return self.__execute_context(context)
File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py",
line 1076, in __execute_context
self._cursor_execute(context.cursor, context.statement,
context.parameters[0], context=context)
File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py",
line 1136, in _cursor_execute
self.dialect.do_execute(cursor, statement, parameters, context=context)
File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/default.py",
line 207, in do_execute
cursor.execute(statement, parameters)
File "pg8000/dbapi.py", line 243, in _fn
return fn(self, *args, **kwargs)
File "pg8000/dbapi.py", line 312, in execute
self._execute(operation, args)
File "pg8000/dbapi.py", line 317, in _execute
self.cursor.execute(new_query, *new_args)
File "pg8000/interface.py", line 303, in execute
self._stmt = PreparedStatement(self.connection, query,
statement_name="", *[{"type": type(x), "value": x} for x in args])
File "pg8000/interface.py", line 108, in __init__
self._parse_row_desc = self.c.parse(self._statement_name, statement, types)
File "pg8000/protocol.py", line 918, in _fn
return fn(self, *args, **kwargs)
File "pg8000/protocol.py", line 1069, in parse
self._send(Parse(statement, qs, param_types))
File "pg8000/protocol.py", line 975, in _send
data = msg.serialize()
File "pg8000/protocol.py", line 121, in serialize
val = struct.pack("!i", len(val) + 4) + val
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8d in position
3: ordinal not in range(128)

--
Jon

Michael Bayer

unread,
Nov 7, 2009, 4:02:50 PM11/7/09
to sqlal...@googlegroups.com

On Nov 7, 2009, at 1:30 PM, Jon Nelson wrote:

> File "pg8000/protocol.py", line 121, in serialize
> val = struct.pack("!i", len(val) + 4) + val
> UnicodeDecodeError: 'ascii' codec can't decode byte 0x8d in position
> 3: ordinal not in range(128)

make sure you're on the latest tip of pg8000, which these days seems
to be at http://github.com/mfenniak/pg8000/tree/trunk . It also
adheres to the client encoding of your PG database, which you should
make sure is on utf-8.

But its not going to render an INSERT...VALUES with multiple
parameters in one big string, so if that's your goal you need to
generate that string yourself. I'm surprised that sqlite, per your
observation, parses an INSERT statement and re-renders it with
multiple VALUES clauses ? very surprising behavior.

Jon Nelson

unread,
Nov 7, 2009, 5:59:28 PM11/7/09
to sqlal...@googlegroups.com
On Sat, Nov 7, 2009 at 3:02 PM, Michael Bayer <mik...@zzzcomputing.com> wrote:
>
>
> On Nov 7, 2009, at 1:30 PM, Jon Nelson wrote:
>
>>  File "pg8000/protocol.py", line 121, in serialize
>>    val = struct.pack("!i", len(val) + 4) + val
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0x8d in position
>> 3: ordinal not in range(128)
>
> make sure you're on the latest tip of pg8000, which these days seems
> to be at http://github.com/mfenniak/pg8000/tree/trunk .  It also
> adheres to the client encoding of your PG database, which you should
> make sure is on utf-8.

Ah. I was running the latest /released/ version - I generally avoid
running 'tip/HEAD/whatever' except during testing. Since I don't
expect pg8000 to have any substantially different behavior, it's
probably not even worth the effort.

<snip/>

> I'm surprised that sqlite, per your
> observation, parses an INSERT statement and re-renders it with
> multiple VALUES clauses ?    very surprising behavior.

I'm not sure I said that - I certainly didn't intend that.

Ultimately, the IPC costs associated with each set of bind params (one
per row) just murders psycopg2 when compared to sqlite. There isn't
any sqlite RPC per-se, since it's always local. I can only assume
that sqlite defers locking the database until the start of a
transaction, and since sqlite isn't multi-writer aware the overhead of
doing so is minimal.

I wasn't comparing sqlite and postgresql per se - there isn't much of
a comparison in my mind once you start needing all of the features,
stability, and power that postgresql brings. However, I was
disappointed to see that psycopg2 is not making use of the (postgresql
8.2 and newer) multi-bind param INSERT stuff, as this ultimately
reduces the IPC overhead to a very small amount.

The cost of a single call to postgresql might be small, but when you
multiply it by hundreds of thousands or millions it suddenly becomes a
deciding factor in some situations.

--
Jon

Reply all
Reply to author
Forward
0 new messages