With everything exactly the same, I was able to achieve over 50,000
inserts/s to sqlite.
Now, I won't argue the relative merits of each database, but this is a
big problem for postgresql. I believe I have determined that the
psycopg2 module is to blame, and the substantial portion of the time
spent was being spent in IPC/RPC. Basically, every single insert in
this test is identical except for the values (same table and columns),
but psycopg2 (or possibly SQLAlchemy) was performing an individual
INSERT for every single row. I was *not* using the ORM.
The code was something like this:
row_values = build_a_bunch_of_dictionaries()
ins = table.insert()
t = conn.begin()
conn.execute(ins, row_values)
t.commit()
where row_values is (of course) a list of dictionaries.
What can be done here to improve the speed of bulk inserts? For
postgresql to get walloped by a factor of 5 in this area is a big
bummer.
--
Jon
On Friday 06 November 2009 02.46:11 Jon Nelson wrote:
> ... was performing an individual
> INSERT for every single row.
Don't know sqlalchemy good enough, but for big bulk imports on the SQL side,
shouldn't "COPY" be used? Which is as far as I know pg-specific / non-SQL
standard.
cheers
-- vbi
--
<Lo-lan-do> モインさん?
<nobse> Lo-lan-do: Gesundheit.
-- #debian-devel
Of course I could use COPY, but I wanted to avoid that. Using the same
data, I did some quick tests (via psql, although I believe psycopg2
exposes the functionality as well) and got insert rates that were
really, really impressive (anywhere from 40K to 150K records/s, highly
variable but still very very fast). The biggest problem with COPY is
that the data file has to be formatted for entry and has to be
physically available to the server.
Ultimately, even though postgresql has so many advantages over sqlite,
the (seeming) lack of batch inserts (at least from python) is a big
hurt. Maybe I need to root around in postgresql's C API to see if
there is something I could do.
--
Jon
I may be way off, but didn't recent versions add support for COPY from stdin
in psql? (which, I'd think, should have some corresponding interface in C)
So the data wouldn't need to be on a file on the server. The data would
still need to be massaged into the right format, yes.
cheers
-- vbi
--
featured link: Debian Bookmark Collection - http://bookmarks.debian.net/
>
> I gave that a try and did receive a mild speed boost - from ~9000
> inserts/s to 9500 +/- 200.
woohoo !
> Before I even posted I resorted to strace. strace immediately
> confirmed my suspicion - when using psycopg2 I don't see one big fat
> INSERT with lots of binds, I see one INSERT per bind, and it's this
> that is ultimately killing the performance. You can easily observe
> this via strace: as I'm sure you know, the communication between the
> test program and postgresql takes place across a socket (unix domain
> or tcp/ip). For every single set of bind params, the result is
> essentially one sendto (INSERT INTO ....) and rt_sigprocmask, a poll,
> and then a recvfrom and rt_sigprocmask pair. Profiling at the C level
> shows that sendto accounts for *35%* of the total runtime and recvfrom
> a healthy 15%. It's this enormous overhead for every single bind param
> that's killing the performance.
have you asked about this on the psycopg2 mailing list ? its at http://mail.python.org/mailman/listinfo/python-list
. Let me know if you do, because I'll get out the popcorn... :)
We also support the pg8000 DBAPI in 0.6. I doubt its doing something
differently here but feel free to connect with postgresql+pg8000://
and see what you get.
That's the python list.
Anyway, I did some more testing. executemany performance is not any
better than looping over execute, because that's all that executemany
appears to do in any case.
However, I manually built a bit fat set of bind params (bypassing
sqlalchemy directly) and got a SUBSTANTIAL performance improvement.
Postgresql as of 8.2 supports /sets/ of bind params, it'd be nice if
pg8000 or psycopg2 (or both) supported that. Building 25000 bind
params by hand is not fun, but it got me to just shy of 50K
inserts/second.
> We also support the pg8000 DBAPI in 0.6. I doubt its doing something
> differently here but feel free to connect with postgresql+pg8000://
> and see what you get.
I tried pg8000 but I got an error:
...
return self.dbapi.connect(*cargs, **cparams)
sqlalchemy.exc.DBAPIError: (TypeError) connect() takes at least 1
non-keyword argument (0 given) None None
--
Jon
have you asked about this on the psycopg2 mailing list ? its at http://mail.python.org/mailman/listinfo/python-list. Let me know if you do, because I'll get out the popcorn... :)
That's the python list.
I tried pg8000 but I got an error:
...
return self.dbapi.connect(*cargs, **cparams)
sqlalchemy.exc.DBAPIError: (TypeError) connect() takes at least 1
non-keyword argument (0 given) None None
Apparently, pg8000 requires host, user and pass (or at least one of those).
Of course, then when I am connected, I get a traceback:
...
metadata.drop_all()
File "/usr/lib64/python2.6/site-packages/sqlalchemy/schema.py", line
1871, in drop_all
bind.drop(self, checkfirst=checkfirst, tables=tables)
File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py",
line 1336, in drop
self._run_visitor(ddl.SchemaDropper, entity,
connection=connection, **kwargs)
File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py",
line 1360, in _run_visitor
visitorcallable(self.dialect, conn, **kwargs).traverse(element)
File "/usr/lib64/python2.6/site-packages/sqlalchemy/sql/visitors.py",
line 86, in traverse
return traverse(obj, self.__traverse_options__, self._visitor_dict)
File "/usr/lib64/python2.6/site-packages/sqlalchemy/sql/visitors.py",
line 197, in traverse
return traverse_using(iterate(obj, opts), obj, visitors)
File "/usr/lib64/python2.6/site-packages/sqlalchemy/sql/visitors.py",
line 191, in traverse_using
meth(target)
File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/ddl.py",
line 89, in visit_metadata
collection = [t for t in reversed(sql_util.sort_tables(tables)) if
self._can_drop(t)]
File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/ddl.py",
line 104, in _can_drop
return not self.checkfirst or
self.dialect.has_table(self.connection, table.name,
schema=table.schema)
File "/usr/lib64/python2.6/site-packages/sqlalchemy/dialects/postgresql/base.py",
line 611, in has_table
type_=sqltypes.Unicode)]
File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py",
line 991, in execute
return Connection.executors[c](self, object, multiparams, params)
File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py",
line 1053, in _execute_clauseelement
return self.__execute_context(context)
File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py",
line 1076, in __execute_context
self._cursor_execute(context.cursor, context.statement,
context.parameters[0], context=context)
File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py",
line 1136, in _cursor_execute
self.dialect.do_execute(cursor, statement, parameters, context=context)
File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/default.py",
line 207, in do_execute
cursor.execute(statement, parameters)
File "pg8000/dbapi.py", line 243, in _fn
return fn(self, *args, **kwargs)
File "pg8000/dbapi.py", line 312, in execute
self._execute(operation, args)
File "pg8000/dbapi.py", line 317, in _execute
self.cursor.execute(new_query, *new_args)
File "pg8000/interface.py", line 303, in execute
self._stmt = PreparedStatement(self.connection, query,
statement_name="", *[{"type": type(x), "value": x} for x in args])
File "pg8000/interface.py", line 108, in __init__
self._parse_row_desc = self.c.parse(self._statement_name, statement, types)
File "pg8000/protocol.py", line 918, in _fn
return fn(self, *args, **kwargs)
File "pg8000/protocol.py", line 1069, in parse
self._send(Parse(statement, qs, param_types))
File "pg8000/protocol.py", line 975, in _send
data = msg.serialize()
File "pg8000/protocol.py", line 121, in serialize
val = struct.pack("!i", len(val) + 4) + val
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8d in position
3: ordinal not in range(128)
--
Jon
> File "pg8000/protocol.py", line 121, in serialize
> val = struct.pack("!i", len(val) + 4) + val
> UnicodeDecodeError: 'ascii' codec can't decode byte 0x8d in position
> 3: ordinal not in range(128)
make sure you're on the latest tip of pg8000, which these days seems
to be at http://github.com/mfenniak/pg8000/tree/trunk . It also
adheres to the client encoding of your PG database, which you should
make sure is on utf-8.
But its not going to render an INSERT...VALUES with multiple
parameters in one big string, so if that's your goal you need to
generate that string yourself. I'm surprised that sqlite, per your
observation, parses an INSERT statement and re-renders it with
multiple VALUES clauses ? very surprising behavior.
Ah. I was running the latest /released/ version - I generally avoid
running 'tip/HEAD/whatever' except during testing. Since I don't
expect pg8000 to have any substantially different behavior, it's
probably not even worth the effort.
<snip/>
> I'm surprised that sqlite, per your
> observation, parses an INSERT statement and re-renders it with
> multiple VALUES clauses ? very surprising behavior.
I'm not sure I said that - I certainly didn't intend that.
Ultimately, the IPC costs associated with each set of bind params (one
per row) just murders psycopg2 when compared to sqlite. There isn't
any sqlite RPC per-se, since it's always local. I can only assume
that sqlite defers locking the database until the start of a
transaction, and since sqlite isn't multi-writer aware the overhead of
doing so is minimal.
I wasn't comparing sqlite and postgresql per se - there isn't much of
a comparison in my mind once you start needing all of the features,
stability, and power that postgresql brings. However, I was
disappointed to see that psycopg2 is not making use of the (postgresql
8.2 and newer) multi-bind param INSERT stuff, as this ultimately
reduces the IPC overhead to a very small amount.
The cost of a single call to postgresql might be small, but when you
multiply it by hundreds of thousands or millions it suddenly becomes a
deciding factor in some situations.
--
Jon