`func.similarity` performance

1,279 views
Skip to first unread message

Robert Smith

unread,
Mar 30, 2016, 9:50:07 PM3/30/16
to sqlalchemy
I'm using sqlalchemy 1.0.12 with postgreSQL 9.4.6 with the pg_trgm extension enabled. Basically, I'm performing a similarity-based query on a single column:

In [26]: str(session.query(Model).order_by(desc(func.similarity(Model.description, 'string'))).limit(100))
Out[26]: 'SELECT model.id AS model_id, model.description AS model_description, \nFROM model ORDER BY similarity(model.description, :similarity_1) DESC\n LIMIT :param_1'

This query works okay, but I'm worried that looking at the query that is being used, it doesn't seem to take advantage of some performance improvements that are available in this kind of situation. As a simple example, I noticed that people recommend not to use `similarity` and instead rely on `%` or alternatively, use `WHERE model.description  % 'string' AND similarity(model.description, 'string') > 0.5`. This is apparently related to the efficient use of a "gist" index. 

Another question: In the case of having an index for the trigram matches, does sqlalchemy pick up that information automatically when I'm using `automap_base()` to reflect my database?

Thanks

Mike Bayer

unread,
Mar 31, 2016, 9:43:11 AM3/31/16
to sqlal...@googlegroups.com


On 03/30/2016 09:50 PM, Robert Smith wrote:
> I'm using sqlalchemy 1.0.12 with postgreSQL 9.4.6 with the pg_trgm
> extension enabled. Basically, I'm performing a similarity-based query on
> a single column:
>
> In [26]:
> str(session.query(Model).order_by(desc(func.similarity(Model.description, 'string'))).limit(100))
> Out[26]: 'SELECT model.id AS model_id, model.description AS
> model_description, \nFROM model ORDER BY similarity(model.description,
> :similarity_1) DESC\n LIMIT :param_1'
>
> This query works okay, but I'm worried that looking at the query that is
> being used, it doesn't seem to take advantage of some performance
> improvements that are available in this kind of situation. As a simple
> example, I noticed that people recommend not to use `similarity` and
> instead rely on `%` or alternatively, use `WHERE model.description %
> 'string' AND similarity(model.description, 'string') > 0.5`. This is
> apparently related to the efficient use of a "gist" index.

this is more of a Postgresql optimization question so I dont have any
insight on that.

>
> Another question: In the case of having an index for the trigram
> matches, does sqlalchemy pick up that information automatically when I'm
> using `automap_base()` to reflect my database?

Table reflection will reflect table indexes, though this is not really
important in the case of automapping as indexes do not affect ORM mappings.



>
> Thanks
>
> --
> You received this message because you are subscribed to the Google
> Groups "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to sqlalchemy+...@googlegroups.com
> <mailto:sqlalchemy+...@googlegroups.com>.
> To post to this group, send email to sqlal...@googlegroups.com
> <mailto:sqlal...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/sqlalchemy.
> For more options, visit https://groups.google.com/d/optout.

Jonathan Vanasco

unread,
Mar 31, 2016, 3:43:50 PM3/31/16
to sqlalchemy


On Thursday, March 31, 2016 at 9:43:11 AM UTC-4, Mike Bayer wrote:
this is more of a Postgresql optimization question so I dont have any 
insight on that. 

FWIW, I generally handle these types of "optimize postgres" queries using a function to apply the filter - like this: 

    def filter_Table_field_a(query, value, optimize=False):
        """Table.field_a has a partial index on a substring of field_a
            it will only be consulted by the planner if used in a query."""
        query = query.filter(model.core.Table.field_a = value, )
        if optimize:
            query = query.filter(func.substr(model.core.Table.field_a, 0, 5)) == value[:4], )
        return query


    def filter_Table_field_b(query, value, optimize=False):
        """Table.field_b has a uniqueindex on a lower(field_b);
            it will only be consulted by the planner if used in a query."""
        query = query.filter(model.core.Table.field_b = value, )
        if optimize:
            query = query.filter(func.lower(model.core.Table.field_b)) == value.lower(), )
        return query

q = s.query(TableA)
q = filter_Table_field_b(q, 'hello', optimize=True)

I toggle the optimization, because some queries will perform better with it off.  

Robert Smith

unread,
Mar 31, 2016, 6:40:28 PM3/31/16
to sqlalchemy
Mike Bayer:

Thank you for your response. I wasn't really asking whether the optimization I described above (using `%` instead of `similarity`) is correct or not. Based on some resources (e.g. Super Fuzzy Searching on PostgreSQLFUZZY STRING MATCHING WITH TRIGRAM AND TRIGRAPHS), I think it is correct although I wouldn't mind being proven wrong. I was mainly asking, why sqlalchemy doesn't seem to take advantage of this improvement? 

Your second comment intrigued me: `though this is not really important in the case of automapping as indexes do not affect ORM mappings.`. I'm pretty new to sqlalchemy, so I think I don't understand what you mean. Can you describe this a bit more?

Regards

Robert Smith

unread,
Mar 31, 2016, 6:46:10 PM3/31/16
to sqlalchemy
Jonathan Vanasco:

Thank you. That's a good idea but in this case, I'm really wondering if sqlalchemy should use that small change to improve performance quite a bit in this type of queries.
Otherwise, I think I would need to rely on raw SQL to perform a similarity operation and I'm not sure if `session.execute` provides the same checks to prevent SQL injections (which is my main reason to use sqlalchemy in the first place). 

Regards

Mike Bayer

unread,
Mar 31, 2016, 8:15:41 PM3/31/16
to sqlal...@googlegroups.com


On 03/31/2016 06:40 PM, Robert Smith wrote:
> Mike Bayer:
>
> Thank you for your response. I wasn't really asking whether the
> optimization I described above (using `%` instead of `similarity`) is
> correct or not. Based on some resources (e.g. Super Fuzzy Searching on
> PostgreSQL
> <http://webcache.googleusercontent.com/search?q=cache:59hB5zYwRzcJ:bartlettpublishing.com/site/bartpub/blog/3/entry/350+&cd=9&hl=en&ct=clnk&gl=mx>,
> FUZZY STRING MATCHING WITH TRIGRAM AND TRIGRAPHS
> <http://www.postgresonline.com/journal/archives/169-Fuzzy-string-matching-with-Trigram-and-Trigraphs.html>),
> I think it is correct although I wouldn't mind being proven wrong. I was
> mainly asking, why sqlalchemy doesn't seem to take advantage of this
> improvement?

The Core keeps everyone sane by emitting to as much a degree as is
possible exactly the SQL you tell it to. I see a hardcoded number ".5"
in your optimization as well as a significant structural difference,
which I don't think it would be reasonable for Core to take it upon
itself to emit with no intent coming from the user.

More generally, func.XYZ() has no awareness of most functions except a
small handful just for the purposes of helping with return types.

Similarly, there are no attempts to dig into the expressions used within
functions and try to make decisions based on indexes associated with the
elements of those expressions; parsing SQL and applying indexes to
expressions is the job of the database itself and it wouldn't be
reasonable for SQLAlchemy to start attempting to duplicate all that
extremely rich, complex, and idiosyncratic logic that we get for free
from the database. If this optimization is truly applicable by default
I'd ask why doesn't Postgresql itself apply it automatically behind the
scenes; if the database engine is failing to take advantage of its own
optimizations which can be established as appropriate by default without
any additional intent from the user, that's a bug on their end.

>
> Your second comment intrigued me: `though this is not really important
> in the case of automapping as indexes do not affect ORM mappings.`. I'm
> pretty new to sqlalchemy, so I think I don't understand what you mean.
> Can you describe this a bit more?

An ORM mapping links the attributes in an object-oriented class to rows
in a database table. The purpose of this mapping is to generate
SELECT, INSERT, UPDATE, and DELETE statements to be emitted to the
database, and to retrieve result data from these statements in the form
of database rows. None of these operations have any requirement to
explicitly state the names of indexes nor are any indexes required for
any of those operations to function. So awareness of indexes on the
SQLAlchemy side for the purposes of emitting and retrieving data from
DQL or DML is unnecessary. The SQLAlchemy Index() construct is only
useful for DDL (e.g. CREATE INDEX and related statements), which is a
concern separate from object relational mapping.





>
> Regards
>
> On Thursday, March 31, 2016 at 7:43:11 AM UTC-6, Mike Bayer wrote:
>
>
>
> On 03/30/2016 09:50 PM, Robert Smith wrote:
> > I'm using sqlalchemy 1.0.12 with postgreSQL 9.4.6 with the pg_trgm
> > extension enabled. Basically, I'm performing a similarity-based
> query on
> > a single column:
> >
> > In [26]:
> >
> str(session.query(Model).order_by(desc(func.similarity(Model.description,
> 'string'))).limit(100))
> > Out[26]: 'SELECT model.id <http://model.id> AS model_id,
> model.description AS
> > model_description, \nFROM model ORDER BY
> similarity(model.description,
> > :similarity_1) DESC\n LIMIT :param_1'
> >
> > This query works okay, but I'm worried that looking at the query
> that is
> > being used, it doesn't seem to take advantage of some performance
> > improvements that are available in this kind of situation. As a
> simple
> > example, I noticed that people recommend not to use `similarity` and
> > instead rely on `%` or alternatively, use `WHERE
> model.description %
> > 'string' AND similarity(model.description, 'string') > 0.5`. This is
> > apparently related to the efficient use of a "gist" index.
>
> this is more of a Postgresql optimization question so I dont have any
> insight on that.
>
> >
> > Another question: In the case of having an index for the trigram
> > matches, does sqlalchemy pick up that information automatically
> when I'm
> > using `automap_base()` to reflect my database?
>
> Table reflection will reflect table indexes, though this is not really
> important in the case of automapping as indexes do not affect ORM
> mappings.
>
>
>

Jonathan Vanasco

unread,
Mar 31, 2016, 9:53:02 PM3/31/16
to sqlalchemy


On Thursday, March 31, 2016 at 6:46:10 PM UTC-4, Robert Smith wrote:
Thank you. That's a good idea but in this case, I'm really wondering if sqlalchemy should use that small change to improve performance quite a bit in this type of queries.

I've been in your place.
1. It sounds like a great idea right now... but after few months "magic" optimizations like this get in the way of your work.
2. The work required to make something like this function correctly is really large.
3. The postgres planner is hard to deal with. depending on how you configured the server, it might or might not use the hints you give it. 

you don't need to use raw sql.  if you want to, you can use bind parameters to handle sql escaping.

if you want to stay in the orm...

i think you could translate this:


    WHERE model.description  % 'string' AND similarity(model.description, 'string') > 0.5`.

into something like this

    session.query(Model)\
    .filter(Model.description.op('%')(string),
             func.similarity(Model.description, string) > 0.5
             )

I used the `column.op()` syntax because it's useful for specifying certain operators.  I think the func should work like that, but I haven't used it in a similar context in a long time.




Sergey V.

unread,
Mar 31, 2016, 11:46:34 PM3/31/16
to sqlalchemy
Robert, my understanding is that SQLAlchemy knows nothing about the Postgres's `similarity` function - sqlalchemy.func just magically generates the SQL output depending on which member you invoke. Try `func.magic_unicorns()`. So, there's not much to optimize here - it outputs what you give it. See this for details: http://docs.sqlalchemy.org/en/latest/core/sqlelement.html#sqlalchemy.sql.expression.func

It's up to you to build a query which uses the `%` syntax if you need it. SQLAlchemy's Columns have the `.op()` method for that (http://docs.sqlalchemy.org/en/rel_1_0/core/sqlelement.html#sqlalchemy.sql.expression.ColumnElement.op)

So this raw sql query: "... WHERE model.description  % 'string' AND similarity(model.description, 'string') > 0.5" becomes

      my_query.filter(Model.description.op('%')('string')).filter(func.similarity(Model.description, 'string') > 0.5)


Robert Smith

unread,
Apr 1, 2016, 1:24:47 PM4/1/16
to sqlalchemy
Mike Bayer:

Thank you for your explanation. Yes, I thought `func.XYZ` was more aware of its functions. In the case of `%` and indexes, I think there is an improvement if there is an index, but the query still works as usual even in the absence of an index. However, if the implementation is not trivial, it is better to look for alternatives.

Robert Smith

unread,
Apr 1, 2016, 1:31:36 PM4/1/16
to sqlalchemy
Jonathan and Sergey:

I tried to use the `op()` method but for some reason it fails in this particular query (session.query(Model).filter(Model.description.op('%')('string'))). I get the following error:

/home/user/anaconda/envs/gensim/lib/python3.5/site-packages/sqlalchemy/engine/default.py in do_execute(self, cursor, statement, parameters, context)
    448
    449     def do_execute(self, cursor, statement, parameters, context=None):
--> 450         cursor.execute(statement, parameters)
    451
    452     def do_execute_no_params(self, cursor, statement, context=None):
TypeError: 'dict' object does not support indexing

i tested a similar query with `regexp` and that works correctly, so I'm not sure what is different in this case.

Regards

Jonathan Vanasco

unread,
Apr 1, 2016, 3:51:29 PM4/1/16
to sqlalchemy
if you can put together a self-contained repeatable example that triggers this, i'd be happy to take a look.
Reply all
Reply to author
Forward
0 new messages