prepared statements

2,424 views
Skip to first unread message

Rick Otten

unread,
Feb 6, 2014, 12:34:12 PM2/6/14
to sqlal...@googlegroups.com
Hello,

I'm using SQLAlchemy 0.8.4 with PostgreSQL 9.3.  I have a situation where I need to run the same set of queries millions of times in one session.  I think it would be very helpful from a performance perspective if I could prepare my query ahead of time.  (The data under the queries is not changing significantly during a run.)

Is this possible?  Is SQLAlchemy already doing it behind the scenes for me magically?

I was imagining/hoping I'd find something like this:

   # prepare the query:
   myPreparedQuery = mySession.query(stuff).filter(parameter definitions).prepare()

   # run the query whenever I need it during my session:
   myPreparedQuery.parameters(stuff).fetchall()

Unfortunately query.prepare() appears to be for 2-phase commits, which I don't think are the same thing as a prepared statement...

I have not found much documentation, nor even many discussion threads on the topic, and what I have found have not been clear.  I apologize if I missed something somewhere and this is old-hat-FAQ material.

Suggestions?



Michael Bayer

unread,
Feb 6, 2014, 1:08:12 PM2/6/14
to sqlal...@googlegroups.com

On Feb 6, 2014, at 12:34 PM, Rick Otten <rottenw...@gmail.com> wrote:

> Hello,
>
> I'm using SQLAlchemy 0.8.4 with PostgreSQL 9.3. I have a situation where I need to run the same set of queries millions of times in one session. I think it would be very helpful from a performance perspective if I could prepare my query ahead of time. (The data under the queries is not changing significantly during a run.)
>
> Is this possible? Is SQLAlchemy already doing it behind the scenes for me magically?

the Python DBAPI (see http://www.python.org/dev/peps/pep-0249/) doesn’t have an explicit “prepared statements” construct. what it does have is the “executemany()” command, which passes a statement and a list of parameter sets to the DBAPI, which can then make the decision to use prepared statements or whatever other means it prefers to optimize the approach.

SQLAlchemy documents the use of executemany() most fully in the Core tutorial:

http://docs.sqlalchemy.org/en/rel_0_9/core/tutorial.html#executing-multiple-statements


> I was imagining/hoping I'd find something like this:
>
> # prepare the query:
> myPreparedQuery = mySession.query(stuff).filter(parameter definitions).prepare()
>
> # run the query whenever I need it during my session:
> myPreparedQuery.parameters(stuff).fetchall()

prepared statements don’t really apply much to SELECT statements, the performance gains from such are marginal as you typically invoke a particular SELECT just once within a transaction, and prepared statements don’t necessarily carry across transaction or connection boundaries on backends.

There are various performance concerns related to SELECT queries and solving them depends much on identifying where a particular performance issue resides. There’s overhead on the side of Python/SQLAlchemy formulating the string statement, theres overhead on the side of passing it to the database to be parsed (which is by far the most infinitesimal part) there’s overhead on retrieving raw rows and columns and there’s python/SQLAlchemy overhead on marshaling those rows into Python objects, or particularly ORM objects which are very expensive relatively.

For these areas of overhead there are different strategies, some more exotic than others, of minimizing overhead, and you’d want to dig into each one individually after doing profiling. For background on profiling, see http://stackoverflow.com/questions/1171166/how-can-i-profile-a-sqlalchemy-powered-application/1175677#1175677 .

I also have a writeup on the performance differences as one moves between core and ORM as well as between executemany() and non, which is here: http://stackoverflow.com/questions/11769366/why-is-sqlalchemy-insert-with-sqlite-25-times-slower-than-using-sqlite3-directly/11769768#11769768 .


signature.asc

Michael Bayer

unread,
Feb 6, 2014, 1:16:52 PM2/6/14
to sqlal...@googlegroups.com

On Feb 6, 2014, at 12:34 PM, Rick Otten <rottenw...@gmail.com> wrote:

> Hello,
>
> I'm using SQLAlchemy 0.8.4 with PostgreSQL 9.3. I have a situation where I need to run the same set of queries millions of times in one session.

After sending that, I just read the words “same set of queries millions of times in one session”. That raises a red flag for me. If it were me and I had to pull millions of rows from a database I’d be looking for ways to SELECT lots of rows at once, in batches. Millions of individual queries with no option for batching suggests that each subsequent query somehow relies upon the results of the previous one, which would be a pretty exotic case in itself but I still might see if a more advanced feature like window functions can be used to help with that (a window function lets you write criteria against the rows that have already been returned thus far).

That said, if you truly need to run millions of queries, you might want to consider using psycopg2 directly, or at best you’d use only SQLAlchemy Core (no ORM) and probably make use of the compiled_cache feature (see http://docs.sqlalchemy.org/en/rel_0_9/core/connections.html?highlight=execution_options#sqlalchemy.engine.Connection.execution_options.params.compiled_cache). It’s going to be a time consuming operation in any case.






signature.asc

Claudio Freire

unread,
Feb 6, 2014, 1:31:32 PM2/6/14
to sqlal...@googlegroups.com
On Thu, Feb 6, 2014 at 3:08 PM, Michael Bayer <mik...@zzzcomputing.com> wrote:
>> I was imagining/hoping I'd find something like this:
>>
>> # prepare the query:
>> myPreparedQuery = mySession.query(stuff).filter(parameter definitions).prepare()
>>
>> # run the query whenever I need it during my session:
>> myPreparedQuery.parameters(stuff).fetchall()
>
> prepared statements don't really apply much to SELECT statements, the performance gains from such are marginal as you typically invoke a particular SELECT just once within a transaction, and prepared statements don't necessarily carry across transaction or connection boundaries on backends.

That's simply false. Because of the following:

> There are various performance concerns related to SELECT queries and solving them depends much on identifying where a particular performance issue resides. There's overhead on the side of Python/SQLAlchemy formulating the string statement, theres overhead on the side of passing it to the database to be parsed (which is by far the most infinitesimal part)

That part is most definitely not always so infinitesimally small. I
recall one case (granted, one among so many) of a complex query that
was very fast to execute. The query had dozens of joins, so it took
considerable planning time on the database side. Planning took around
300ms, where execution took only 25ms. So, having spotted the
bottleneck, I switched to using prepared statements (it was a query
that was run several times in the session, not millions of times, but
several) with different arguments. The benefit was considerable.

To do that (I was using SQLA 0.5) I had to compile the select object
and generate the query string, I used Text in the bindparams to
replace them with "$1", "$2", etc... (as postgres likes it), and the
built a "PREPARE" statement and an "EXECUTE" one, it was rather
simple, having SQLA generate the query string.

The result was a 10-fold increase in performance.

DBAPI doesn't have direct support for prepared statements, but hacking
them isn't that difficult. You just have to make sure you send the
PREPARE, EXECUTE and the DEALLOCATE (never forget the DEALLOCATE) on
the same connection. Which is easy, just ask SQLAlchemy's session for
its connection and use it.

Jonathan Vanasco

unread,
Feb 6, 2014, 2:00:46 PM2/6/14
to sqlal...@googlegroups.com
just wondering -- would it be possible to mimic this behavior using a custom view for this select ( in postgresql ) and then querying that ? i think the query planner might only run on the creation.  

Claudio Freire

unread,
Feb 6, 2014, 2:18:51 PM2/6/14
to sqlal...@googlegroups.com
I think views don't cache the plan, they're handled as rules.

What you'd need is a function (pg's version of stored procedures).

Rick Otten

unread,
Feb 6, 2014, 2:29:44 PM2/6/14
to sqlal...@googlegroups.com
Thanks for the feedback.  This is a very interesting and helpful set of answers!

I'm merging and deduping several large data sets.  I have a set of business rules to identify if a new incoming row matches an existing row before we decide if it really is a new row or not.  Each business rule is another check against the existing data set - which is too large to just pull back and examine in memory.  Whenever we update our business rules, or discover new data to merge, I need to go back and run through another de-duping, so this is a regularly occurring task.

I actually do run thousands of queries and updates (with flush()) in a single transaction at a time to minimize commit overhead.  executemany doesn't really work as a model for me because I need to check each row against the pending data as well as the already committed update/inserts and I need to branch my actions based on whether and which the business rule matched or didn't match.

So every few ms I can shave off of the read queries (most of the action is at the database), the better.  It is possible that the overhead of managing prepared statement parameters exceeds the overhead saved from having to reprocess the plan every time the queries run.  I was hoping it would be easy to try it and see.  I'm now very interested in some of the discussions on profiling the code and may pursue some of that.

I really like implementing my business rules in SQLAlchemy instead of psycopg.  It makes the code much cleaner and more readable.

I'm already also tuning the database, adding functional indexes, and that sort of thing. 

Thanks Claudio - I'll mull over the pros and cons of explicitly managing the connections to prepare the statements vs just being patient while the job runs vs gains I might achieve elsewhere.







--
You received this message because you are subscribed to a topic in the Google Groups "sqlalchemy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sqlalchemy/QMAmqtQomA8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sqlalchemy+...@googlegroups.com.
To post to this group, send email to sqlal...@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/groups/opt_out.

Claudio Freire

unread,
Feb 6, 2014, 2:37:05 PM2/6/14
to sqlal...@googlegroups.com
On Thu, Feb 6, 2014 at 4:29 PM, Rick Otten <rottenw...@gmail.com> wrote:
> Thanks Claudio - I'll mull over the pros and cons of explicitly managing the
> connections to prepare the statements vs just being patient while the job
> runs vs gains I might achieve elsewhere.

Remember, prepared statements will only help if planning time is a
considerable fraction of execution time. The queries in need of this
would be complex. Try to explain analyze the queries you repeatedly
execute to see whether preparing them is worth the hassle. I don't
remember which version of postgres introduced planning time in the
explain, but if you don't have it, just take the difference between
the reported execution time and what you time from your end, save
network latency that should be equal to what you'd save by preparing.

Michael Bayer

unread,
Feb 6, 2014, 2:38:13 PM2/6/14
to sqlal...@googlegroups.com

On Feb 6, 2014, at 1:31 PM, Claudio Freire <klauss...@gmail.com> wrote:

>
>> There are various performance concerns related to SELECT queries and solving them depends much on identifying where a particular performance issue resides. There's overhead on the side of Python/SQLAlchemy formulating the string statement, theres overhead on the side of passing it to the database to be parsed (which is by far the most infinitesimal part)
>
> That part is most definitely not always so infinitesimally small. I
> recall one case (granted, one among so many) of a complex query that
> was very fast to execute. The query had dozens of joins, so it took
> considerable planning time on the database side. Planning took around
> 300ms, where execution took only 25ms. So, having spotted the
> bottleneck, I switched to using prepared statements (it was a query
> that was run several times in the session, not millions of times, but
> several) with different arguments. The benefit was considerable.

ideally the database would know how to cache plans based on the string form of the statement. Oracle does this. It appears that Postrgresql does not unless you specifically use a prepared statement.

>
> DBAPI doesn't have direct support for prepared statements, but hacking
> them isn't that difficult. You just have to make sure you send the
> PREPARE, EXECUTE and the DEALLOCATE (never forget the DEALLOCATE) on
> the same connection. Which is easy, just ask SQLAlchemy's session for
> its connection and use it.

psycopg2 is an extremely mature library and I find it curious that such a feature has not been added. They’ve posted their rationale here: http://initd.org/psycopg/articles/2012/10/01/prepared-statements-psycopg/

in any case, prepared statements aren’t part of DBAPI nor built in to psycopg2 yet so this is out of SQLAlchemy’s realm, thankfully. If you want to use the recipe at that link it would have to be hand-rolled.


signature.asc

Claudio Freire

unread,
Feb 6, 2014, 2:43:04 PM2/6/14
to sqlal...@googlegroups.com
On Thu, Feb 6, 2014 at 4:38 PM, Michael Bayer <mik...@zzzcomputing.com> wrote:
> On Feb 6, 2014, at 1:31 PM, Claudio Freire <klauss...@gmail.com> wrote:
>
>>
>>> There are various performance concerns related to SELECT queries and solving them depends much on identifying where a particular performance issue resides. There's overhead on the side of Python/SQLAlchemy formulating the string statement, theres overhead on the side of passing it to the database to be parsed (which is by far the most infinitesimal part)
>>
>> That part is most definitely not always so infinitesimally small. I
>> recall one case (granted, one among so many) of a complex query that
>> was very fast to execute. The query had dozens of joins, so it took
>> considerable planning time on the database side. Planning took around
>> 300ms, where execution took only 25ms. So, having spotted the
>> bottleneck, I switched to using prepared statements (it was a query
>> that was run several times in the session, not millions of times, but
>> several) with different arguments. The benefit was considerable.
>
> ideally the database would know how to cache plans based on the string form of the statement. Oracle does this. It appears that Postrgresql does not unless you specifically use a prepared statement.

Well, it cannot. At least not simplistically, because postgres uses
the literals in the statement (the ones likely to change) to
specialize plan cost, and different values can produce different plans
(depending on stats). For instance, a query with a where clause
"deleted=false" that selects 99% of a table would do a sequential
scan, whereas "deleted=true" might use an index (if there was such an
index).

They thus don't cache plans unless specifically asked for (with
prepare), because those plans may very well be inferior (they don't
use value-specific knowledge).

Michael Bayer

unread,
Feb 6, 2014, 2:47:12 PM2/6/14
to sqlal...@googlegroups.com
right, but from an API POV, its more tedious since we have to track the handle on the outside.

This could be implemented as a cursor event even in conjunction with a custom execution option and possibly using compiled_cache as well.


signature.asc

Claudio Freire

unread,
Feb 6, 2014, 2:58:13 PM2/6/14
to sqlal...@googlegroups.com
That would be tricky. You'd have to maintain a tally of which
connection prepared which query (think the complications of
multi-engine setups), and make sure you deallocate the handle before
returning the connection to the pool. Not to mention coming up with
unique handles.

Michael Bayer

unread,
Feb 6, 2014, 3:13:06 PM2/6/14
to sqlal...@googlegroups.com
you can use a per-connection cache using connection.info, and you can also manage the lifecycle of the connection and the cache of items with connection and/or pool events as well (e.g. if connection.rollback() clears out prepared statements). lots of functions already do things like this, like keeping track of XA handles and such. it wouldn’t really be hard at all actually, as long as the DBAPI itself behaves nicely when lots of handles are allocated.

the execution_options functionality would then be usable as a way to enable the feature on a per-connection or per-statement basis, like:

stmt = select([some_table]).execution_options(pg_prepared_statement=True)
conn.execute(stmt)

the string form of stmt could be linked to the generated handle so that any number of stmt object’s whose string form matches the target would work.





signature.asc

Claudio Freire

unread,
Feb 6, 2014, 3:53:22 PM2/6/14
to sqlal...@googlegroups.com
On Thu, Feb 6, 2014 at 5:13 PM, Michael Bayer <mik...@zzzcomputing.com> wrote:
>
> you can use a per-connection cache using connection.info, and you can also manage the lifecycle of the connection and the cache of items with connection and/or pool events as well (e.g. if connection.rollback() clears out prepared statements). lots of functions already do things like this, like keeping track of XA handles and such. it wouldn't really be hard at all actually, as long as the DBAPI itself behaves nicely when lots of handles are allocated.

I don't think the DBAPI cares, but the backend might. I know complex
query plans can sometimes take several MB to cache, so if you had many
of those, the cost in memory could be important.

But, since this has to be explicitly enabled, I don't think it should
be a matter.

Jonathan Vanasco

unread,
Feb 6, 2014, 8:50:16 PM2/6/14
to sqlal...@googlegroups.com


On Thursday, February 6, 2014 2:18:51 PM UTC-5, Klauss wrote:
I think views don't cache the plan, they're handled as rules.

What you'd need is a function (pg's version of stored procedures).

I had time to look it up; this generally seems correct. Looking at some explain syntax, it seems like the plan for join conditions under the view are kept -- but you still get hit with a plan onto the view.  i could be interpreting this wrong, and this likely has minimal use.

something that might be applicable to the original poster's usage is a parallel search.

we have a lightweight read-through cache working with sqlalchemy for 'simple' queries ( 1-2 kv pairs ).  cache fails are aggregated and then tossed into a single select.  

dumb example:

   select * from table where id = 1;
   select * from table where id = 2;
   select * from table where id = 3;
   select * from table where id = 4;
   select * from table where id = 5;

becomes

   select * from table where id in (1,2,3,4,5)

The performance improvement is shockingly significant; i think it's a mix of the planning, fetching and fewer roundtrips/connectivity.  if that's an option, i'd suggest trying it.


Reply all
Reply to author
Forward
0 new messages