Compiling to Postgres VALUES clause with bindparam

1,091 views
Skip to first unread message

Martin Stein

unread,
Feb 7, 2019, 3:58:57 PM2/7/19
to sqlalchemy
We are using Postgres, the baked extension and have code along these lines:

    query = bakery(
       
lambda s: s.query(Item)
                   
.filter(Item.parent_id.op('IN')(bindparam('parent_ids')))


    result
= query(dbsession).params(parent_ids=tuple(parent_ids)).all()


The .op('IN') approach was used to get the bindparam working with an IN-clause. (See https://github.com/sqlalchemy/sqlalchemy/issues/3574)

With the above approach, the relevant part of the SQL query looks something like this:

WHERE items.parent_id IN %(parent_ids)s

and SQLAlchemy passes the list of parent_ids nicely as:
{
 
'parent_ids': ('abc', 'def', 'gjh')
}

For our case, Postgres executes the query significantly faster if we use the PG-specific VALUES syntax (see here for an example: https://stackoverflow.com/questions/40443409/postgresql-in-operator-performance-list-vs-subquery). Therefore we are trying to achieve two things:

1) Somehow get an SQL-output like this:

WHERE items.parent_id IN (VALUES ('abc'), ('def'), ('gjh'))


2) Make this work with bindparam, so we can pass the parent_ids as list/tuple into the baked query.

 and https://groups.google.com/d/topic/sqlalchemy/JW2oigBVpik/discussion , but I'm unable to get any of it to work.

How can I set this up so I can do the following (or something along those lines)?

.filter(Item.parent_id.op('IN')(values(bindparam('parent_ids'))))


Mike Bayer

unread,
Feb 7, 2019, 5:45:09 PM2/7/19
to sqlal...@googlegroups.com
On Thu, Feb 7, 2019 at 3:59 PM Martin Stein <mstei...@gmail.com> wrote:
>
> We are using Postgres, the baked extension and have code along these lines:
>
> query = bakery(
> lambda s: s.query(Item)
> .filter(Item.parent_id.op('IN')(bindparam('parent_ids')))
>
>
> result = query(dbsession).params(parent_ids=tuple(parent_ids)).all()
>
>
> The .op('IN') approach was used to get the bindparam working with an IN-clause. (See https://github.com/sqlalchemy/sqlalchemy/issues/3574)

note the mentioned issue is closed. IN supports an inline parameter now:

https://docs.sqlalchemy.org/en/rel_1_2/orm/extensions/baked.html#baked-in



>
> With the above approach, the relevant part of the SQL query looks something like this:
>
> WHERE items.parent_id IN %(parent_ids)s
>
> and SQLAlchemy passes the list of parent_ids nicely as:
> {
> 'parent_ids': ('abc', 'def', 'gjh')
> }
>
> For our case, Postgres executes the query significantly faster if we use the PG-specific VALUES syntax (see here for an example: https://stackoverflow.com/questions/40443409/postgresql-in-operator-performance-list-vs-subquery). Therefore we are trying to achieve two things:
>
> 1) Somehow get an SQL-output like this:
>
> WHERE items.parent_id IN (VALUES ('abc'), ('def'), ('gjh'))
>
>
> 2) Make this work with bindparam, so we can pass the parent_ids as list/tuple into the baked query.
>
> I have read https://stackoverflow.com/questions/18858291/values-clause-in-sqlalchemy,
> https://stackoverflow.com/questions/33767740/how-to-make-use-of-bindparam-in-a-custom-compiled-expression/33815597#33815597
> and https://groups.google.com/d/topic/sqlalchemy/JW2oigBVpik/discussion , but I'm unable to get any of it to work.
>
> How can I set this up so I can do the following (or something along those lines)?

well normally for VALUES I'd have you make a custom clause element.
However, you're looking to carry onto the EXPANDING thing as well,
which has to happen after the SQL has been rendered but before it is
sent to the database. So here's a way to use an event to make that
rewrite happen:

import re

from sqlalchemy import bindparam
from sqlalchemy import Column
from sqlalchemy import create_engine
from sqlalchemy import event
from sqlalchemy import Integer
from sqlalchemy import String
from sqlalchemy.ext import baked
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import Session

Base = declarative_base()


class A(Base):
__tablename__ = "a"

id = Column(Integer, primary_key=True)
data = Column(String)


e = create_engine("postgresql://scott:tiger@localhost/test", echo=True)
Base.metadata.create_all(e)

s = Session(e)

bakery = baked.bakery()

query = bakery(
lambda s: s.query(A).filter(A.id.in_(bindparam("ids", expanding=True)))
)

q = query(s).params(ids=[1, 2, 3])


@event.listens_for(e, "before_cursor_execute", retval=True)
def before_cursor_execute(
conn, cursor, statement, parameters, context, executemany
):
for bind in context.compiled.binds.values():
if bind.expanding:
paramnames = sorted(
p for p in parameters if p.startswith(bind.key)
)
reg = r"IN \(%s\)" % (
", ".join(r"%%\(%s\)s" % p for p in paramnames)
)
statement = re.sub(
reg,
"IN (VALUES %s)"
% ", ".join("(%%(%s)s)" % p for p in paramnames),
statement,
)
return statement, parameters


print(q)
q.all()


query at the end is:

SELECT a.id AS a_id, a.data AS a_data
FROM a
WHERE a.id IN (VALUES (%(ids_1)s), (%(ids_2)s), (%(ids_3)s))





>
> .filter(Item.parent_id.op('IN')(values(bindparam('parent_ids'))))
>
>
> --
> SQLAlchemy -
> The Python SQL Toolkit and Object Relational Mapper
>
> http://www.sqlalchemy.org/
>
> To post example code, please provide an MCVE: Minimal, Complete, and Verifiable Example. See http://stackoverflow.com/help/mcve for a full description.
> ---
> You received this message because you are subscribed to the Google Groups "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+...@googlegroups.com.
> To post to this group, send email to sqlal...@googlegroups.com.
> Visit this group at https://groups.google.com/group/sqlalchemy.
> For more options, visit https://groups.google.com/d/optout.

Martin Stein

unread,
Feb 8, 2019, 10:04:58 AM2/8/19
to sqlalchemy
Am Donnerstag, 7. Februar 2019 23:45:09 UTC+1 schrieb Mike Bayer:


note the mentioned issue is closed.  IN supports an inline parameter now:

https://docs.sqlalchemy.org/en/rel_1_2/orm/extensions/baked.html#baked-in


We had started using the .op('IN') approach around v1.1 and I hadn't realized that SQLAlchemy supports this now. Very good to know, thanks.

I have integrated your suggested approach with the 'before_cursor_execute' rewrite of the statement, and it works nicely, thanks again!

Simply out of curiosity and because I'm trying to understand the internals a bit better:

1) Theoretically, would another possible approach be to implement a sub-class of BindParameter (let's say PGValuesBindParameter) and - in combination with the .op('IN') approach - overwrite the bind-parameter behavior so it passes the values as one long string like this:
    .filter(Item.parent_id.op('IN')(PGValuesBindParameter('parent_ids')))

# Parameters that are passed into the statement:
{
   
'parent_ids': "(VALUES ('abc'), ('def'), ('gjh'))"
}


2) I've also considered using the following construct with the array-function from the postgres-dialect. It achieves the same performance improvement as the VALUES approach and doesn't require the tricky rewriting of the VALUES-part with the inner parentheses:

func.unnest(array(parent_ids))

but I couldn't get it to work with bindparams (tried also with expanding=True).

Which one would you suggest/do you think is the cleanest... the VALUES statement-rewrite, or one of those other approaches?

Mike Bayer

unread,
Feb 8, 2019, 12:03:20 PM2/8/19
to sqlal...@googlegroups.com
On Fri, Feb 8, 2019 at 10:05 AM Martin Stein <mstei...@gmail.com> wrote:
>
> Am Donnerstag, 7. Februar 2019 23:45:09 UTC+1 schrieb Mike Bayer:
>>
>>
>>
>> note the mentioned issue is closed. IN supports an inline parameter now:
>>
>> https://docs.sqlalchemy.org/en/rel_1_2/orm/extensions/baked.html#baked-in
>>
>
> We had started using the .op('IN') approach around v1.1 and I hadn't realized that SQLAlchemy supports this now. Very good to know, thanks.
>
> I have integrated your suggested approach with the 'before_cursor_execute' rewrite of the statement, and it works nicely, thanks again!
>
> Simply out of curiosity and because I'm trying to understand the internals a bit better:
>
> 1) Theoretically, would another possible approach be to implement a sub-class of BindParameter (let's say PGValuesBindParameter) and - in combination with the .op('IN') approach - overwrite the bind-parameter behavior so it passes the values as one long string like this:
> .filter(Item.parent_id.op('IN')(PGValuesBindParameter('parent_ids')))
>
> # Parameters that are passed into the statement:
> {
> 'parent_ids': "(VALUES ('abc'), ('def'), ('gjh'))"
> }

so the event recipe I gave you , you probably *want* to do this, so
that in the event, you not only look at bindparam.expanding, you also
do isinstance(param, PGValuesBindParameter).


>
> 2) I've also considered using the following construct with the array-function from the postgres-dialect. It achieves the same performance improvement as the VALUES approach and doesn't require the tricky rewriting of the VALUES-part with the inner parentheses:
>
> func.unnest(array(parent_ids))
>
> but I couldn't get it to work with bindparams (tried also with expanding=True).
>
> Which one would you suggest/do you think is the cleanest... the VALUES statement-rewrite, or one of those other approaches?

ideally the expanding IN system in SQLAlchemy would have a little bit
of expansion built into it so you could just get in there to add this
keyword.

the big issue here is that you need to use the baked system so any
writing of these parameters / arrays / whatever needs to happen
*after* SQLAlchemy creates the string. the current IN trick you are
doing is taking advantage of psycopg2's ability to pass a tuple as a
parameter to libpq. psycopg2 probably can't achieve this with
"UNNEST" because that's a SQL function, not something that libpq can
accept as a parameter.

So rewriting the statement after the fact is the best way to go.
the workings of it are complicated because we have to also take your
single parameter "foo" and add "foo_1", "foo_2", "foo_3", etc. to the
parameter dictionary. If you are then passing in a list of tuples,
then it has to break into a list of tuples. The logic is at
https://github.com/sqlalchemy/sqlalchemy/blob/master/lib/sqlalchemy/engine/default.py#L748,
you can see it's pretty tedious, so it's probably best to keep
leveraging that logic.

Looking at the logic there, I can see we can leverage the existing
"render a tuple" logic so that we can remove the regexp thing if you
just pass the params as tuples. This can also be automated if you
rewrite the parameter dictionary using the before_execute() event, but
here's the more simple idea:

from sqlalchemy import Column
from sqlalchemy import create_engine
from sqlalchemy import Integer
from sqlalchemy import String
from sqlalchemy.ext import baked
from sqlalchemy.ext.compiler import compiles
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import Session
from sqlalchemy.sql.expression import BindParameter

class PGValuesParam(BindParameter):
pass

@compiles(PGValuesParam)
def _compile_pgvalues(element, compiler, **kw):
string = compiler.visit_bindparam(element, **kw)
return "(VALUES %s" % string[1:]


Base = declarative_base()


class A(Base):
__tablename__ = "a"

id = Column(Integer, primary_key=True)
data = Column(String)


e = create_engine("postgresql://scott:tiger@localhost/test", echo=True)
Base.metadata.create_all(e)

s = Session(e)

bakery = baked.bakery()

query = bakery(
lambda s: s.query(A).filter(A.id.in_(PGValuesParam("ids", expanding=True)))
)

q = query(s).params(ids=[(1,), (2,), (3,)])


print(q)
q.all()

Martin Stein

unread,
Feb 11, 2019, 3:37:27 PM2/11/19
to sqlalchemy
So I have replaced the regex-based rewrite with the PGValuesParam approach that we talked about. One issue with the regex-based rewrite was that in its current form it breaks for parameter-lists > 10 elements, because the inner sorted(..) call does string-based sorting:

'IN \\(%\\(parent_ids_1\\)s, %\\(parent_ids_10\\)s, %\\(parent_ids_11\\)s, ...'

whereas the actual sequence in the statement is:

... IN (%(parent_ids_1)s, %(parent_ids_2)s, %(parent_ids_3)s ...)

I'm aware that we could modify the sorting to account for that, but it starts to feel somewhat tricky. I like the PGValuesParam-idea a lot, because it is an explicit opt-in approach, that avoids the tricky regex. So I went with the following:

class PGValuesParam(BindParameter):


   
@staticmethod
   
def prepare(parameters):
       
return [(element,) for element in parameters]


Usage in the query then:

    .filter(Item.parent_id.in_(PGValuesParam('parent_ids', expanding=True)))


and passing of parameters as:

    query(session).params(parent_ids=PGValuesParam.prepare(parent_ids))

The transformation of each list element to a tuple is done in the prepare-staticmethod. That way, the rest of the team doesn't have to remember the details besides "Use the .prepare(...)-method when passing parameter-lists to PGValuesParam."

The cherry on top would be to do an automatic transformation of the list to the list-of-tuples for the PGValuesParam, but when I stepped into the before_execute-event and inspected the data, I couldn't find the PGValuesParam instance anywhere. I see the list of values as part of multiparams, but not PGValuesParam itself (which would be needed as the 'signal' for doing the transformation).

Anyway, while I would be interested in how to achieve this (it would make the whole approach very clean), I don't want to take up too much of your time. So, thank you very much for your help already, Mike. And I've been meaning to mention this: While I can feel myself getting somewhat frustrated with Python's lack of good static typing and gravitating towards .NET Core because of this, SQLAlchemy is the one library that singlehandedly keeps me in the Python world, because it's so amazing.

Mike Bayer

unread,
Feb 11, 2019, 4:40:52 PM2/11/19
to sqlal...@googlegroups.com
On Mon, Feb 11, 2019 at 3:37 PM Martin Stein <mstei...@gmail.com> wrote:
>
> So I have replaced the regex-based rewrite with the PGValuesParam approach that we talked about. One issue with the regex-based rewrite was that in its current form it breaks for parameter-lists > 10 elements, because the inner sorted(..) call does string-based sorting:
>
> 'IN \\(%\\(parent_ids_1\\)s, %\\(parent_ids_10\\)s, %\\(parent_ids_11\\)s, ...'
>
> whereas the actual sequence in the statement is:
>
> ... IN (%(parent_ids_1)s, %(parent_ids_2)s, %(parent_ids_3)s ...)
>
> I'm aware that we could modify the sorting to account for that, but it starts to feel somewhat tricky. I like the PGValuesParam-idea a lot, because it is an explicit opt-in approach, that avoids the tricky regex. So I went with the following:
>
> class PGValuesParam(BindParameter):
>
>
> @staticmethod
> def prepare(parameters):
> return [(element,) for element in parameters]
>
>
> Usage in the query then:
>
> .filter(Item.parent_id.in_(PGValuesParam('parent_ids', expanding=True)))
>
>
> and passing of parameters as:
>
> query(session).params(parent_ids=PGValuesParam.prepare(parent_ids))
>
> The transformation of each list element to a tuple is done in the prepare-staticmethod. That way, the rest of the team doesn't have to remember the details besides "Use the .prepare(...)-method when passing parameter-lists to PGValuesParam."
>
> The cherry on top would be to do an automatic transformation of the list to the list-of-tuples for the PGValuesParam, but when I stepped into the before_execute-event and inspected the data, I couldn't find the PGValuesParam instance anywhere. I see the list of values as part of multiparams, but not PGValuesParam itself (which would be needed as the 'signal' for doing the transformation).
>
> Anyway, while I would be interested in how to achieve this (it would make the whole approach very clean), I don't want to take up too much of your time. So, thank you very much for your help already, Mike. And I've been meaning to mention this: While I can feel myself getting somewhat frustrated with Python's lack of good static typing and gravitating towards .NET Core because of this, SQLAlchemy is the one library that singlehandedly keeps me in the Python world, because it's so amazing.

it's a tight spot to get the params but inside of before_execute()
before the string is generated you'd need to traverse the
"clauseelement" to search for PGValuesParam instances. The traversal
would be with this function:
https://github.com/sqlalchemy/sqlalchemy/blob/master/lib/sqlalchemy/sql/visitors.py#L276

that is

def visit_bindparam(element):
if isinstance(element, PGValuesParam):
params[element.key] = < make tuples out of the values that are
in params >
traverse(clauseelement, {}, {"bindparam": visit_bindparam})

haven't tried it, I think "bindparam" should hit your PGValuesParam
objects correctly though

Jonathan Vanasco

unread,
Feb 12, 2019, 1:24:20 PM2/12/19
to sqlalchemy
This looks very useful to other pg users.  What about making this part of the PostgreSQL dialect somehow?  

Mike Bayer

unread,
Feb 12, 2019, 6:14:44 PM2/12/19
to sqlal...@googlegroups.com
On Tue, Feb 12, 2019 at 1:24 PM Jonathan Vanasco <jona...@findmeon.com> wrote:
>
> This looks very useful to other pg users. What about making this part of the PostgreSQL dialect somehow?

as you know I prefer making things possible to making things assumed
:) , because in the latter case, I have to keep attending to those
assumptions as they change. The whole "rewrite EXPANDING" part of
defaultdialect should be expandable. I think maybe an event hook in
that area might be helpful, not sure.

Jonathan Vanasco

unread,
Feb 13, 2019, 1:14:25 PM2/13/19
to sqlalchemy


On Tuesday, February 12, 2019 at 6:14:44 PM UTC-5, Mike Bayer wrote:
as you know I prefer making things possible to making things assumed
:) , because in the latter case, I have to keep attending to those
assumptions as they change.    The whole "rewrite EXPANDING" part of
defaultdialect should be expandable.  I think maybe an event hook in
that area might be helpful, not sure.

 
Of course.  There are a lot of ways this could be handled too - as a plugin, a recipe, etc... but wanted to start a dialog on it. 

I didn't know about the VALUES syntax performance difference, and It would be very useful to turn the VALUES syntax on/off on a per query option when working on optimizations.  

From a few tests, these all generate the same results, but have VERY different performance, plans and executions depending on the query:

    WHERE items.parent_id IN (1, 2, 3)
    WHERE items.parent_id IN (VALUES (1), (2), (3))
    WHERE items.parent_id = ANY (VALUES (1), (2), (3))
    WHERE items.parent_id = ANY (ARRAY[1, 2, 3])

While SqlAlchemy allows us to edit the queries to emit each of these, it would be incredibly useful to achieve all the forms through a single operator and toggling a kwarg to influence how the sql is generated.
Reply all
Reply to author
Forward
0 new messages