SQLAlchemy does not properly create temporary table for subqueries in MariaDB

1,256 views
Skip to first unread message

Lukas Barth

unread,
Jan 15, 2018, 4:47:50 PM1/15/18
to sqlalchemy
Hi,

first things first: I put a complete notebook showing the problem at [0], I'll only post the excerpts I deem useful below. The exact same code, this time working, with an sqlite in-memory database is at [1]. I'm using SQLAlchemy version 1.2.1, python 3.5, mysqlclient version 1.3.12 and MariaDB version 10.1.29.

I have this persistent class:
class Test(Base):
    __tablename__ = "test"
    
    id = Column(Integer, primary_key=True)
    timestamp = Column(Integer)
    value = Column(Integer)

Resulting in this schema:
 
CREATE TABLE test (
   id INTEGER NOT NULL
,
   timestamp INTEGER
,
   value INTEGER
,
   PRIMARY KEY
(id)
)

What I want to achieve is to delete all objects from Test, which have a timestamp less than some value (five in my example code), but which are *not* the most recent entries for their respective value. For example: If there is a Test object (call it A) with timestamp = 4 and value = 1234, and all other (if there exist any at all…) Test objects with value = 1234 have a timestamp of less than 4, then I want A to *not* be deleted, even though its timestamp is less than 5.

I figured I go with two subqueries: The first subquery uses a group_by on value, and max() on timestamp. These are the objects to be protected from deletion. The second subquery retrieves the ids of the objects in subquery 1. Then I can issue a delete statement, filtering so that the ids of the objects to be deleted are not in subquery 2.

Here's the first subquery:
protected_items = session.query(Test.id, Test.value,
                                func.max(Test.timestamp))\
                         .group_by(Test.value).subquery(name='subquery1')

And the second one:
protected_items_ids = session.query(Test.id).join(
        protected_items, Test.id == protected_items.c.id)\
        .subquery(name='subquery2')

And finally, the deletion:
deleted_rows = session.query(Test) \
                      .filter((Test.timestamp < 5)) \
                      .filter(~Test.id.in_(
                          protected_items_ids)) \
                      .delete(synchronize_session=False)

This works great when using it with a sqlite database. However, it gives an OperationalError when using it with MariaDB. See the bottom of [0] for the full stack trace. This is the error message I get from the MariaDB server:

OperationalError: (_mysql_exceptions.OperationalError) 

(1093, "Table 'test' is specified twice, both as a target for 'DELETE' and as a separate source for data")

[SQL: 'DELETE FROM test WHERE test.timestamp < %s AND test.id NOT IN (SELECT test.id \nFROM test INNER JOIN (SELECT test.id AS id, test.value AS value, max(test.timestamp) AS max_1 \nFROM test GROUP BY test.value) AS subquery1 ON test.id = subquery1.id)'] [parameters: (5,)] (Background on this error at: http://sqlalche.me/e/e3q8)

Doing dome digging, I think one needs to specify a name for the innermost subquery (subquery 1), to make MariaDB create a temporary table for this. At least that's what [2] suggests. I would have assumed the inner subquery to be named ("… as subquery1"), since I specified "name = subquery1", but that doesn't seem to be 

>>> print(protected_items)
SELECT test.id, test.value, max(test.timestamp) AS max_1 
FROM test GROUP BY test.value

… I guess that's a bug? At least I would assume that this should not result in an OperationalError, especially since it works with sqlite?

Thanks for any help,

Lukas

Jonathan Vanasco

unread,
Jan 15, 2018, 5:28:30 PM1/15/18
to sqlalchemy
I can't speak to the internals of this being a bug or not, or how this should be done... but I think you could do a short-term (and cross platform) fix using an alias via `sqlalchemy.orm.aliased` for one (or more) of the inner subqueries.  That should result in a unique discriminator being generated for the table.

It's usually used for joining a table against itself, but I've used it to get more control over the generated sql like this a few times.  

http://docs.sqlalchemy.org/en/latest/orm/query.html?highlight=alias#sqlalchemy.orm.aliased
http://docs.sqlalchemy.org/en/latest/core/selectable.html#sqlalchemy.sql.expression.alias

something like this...

Test1 = sqlalchemy.orm.aliased(Test, name='test1')
protected_items_ids = session.query(Test1.id).join(
        protected_items, Test1.id == protected_items.c.id)\
        .subquery(name='subquery2')




Mike Bayer

unread,
Jan 15, 2018, 5:32:25 PM1/15/18
to sqlal...@googlegroups.com
doing a print() on the subquery object itself does not show it in
context, as the subquery / alias name only comes out if you select
FROM the object. If you look at the SQL shown in the error message,
it is rendering"AS subquery1", so that is not the problem here.

the query as written doesn't actually seem to do anything with the
subqueries, because you are putting this max(test.timestamp) in the
columns clause of an embedded subquery, but you aren't filtering on it
or doing anything at all with it, it's being thrown away. The query
I think you need is:

latest_items = session.query(
Test.value, func.max(Test.timestamp).label('latest_timestamp')
).group_by(Test.value).subquery(name='subquery1')

latest_item_ids = session.query(Test.id).filter(
Test.timestamp == latest_items.c.latest_timestamp).\
filter(Test.value == latest_items.c.value)

deleted_rows = session.query(Test) \
.filter((Test.timestamp < 5)) \
.filter(~Test.id.in_(latest_item_ids)) \
.delete(synchronize_session=False)



generating:

DELETE FROM test WHERE test.timestamp < %s AND test.id NOT IN (SELECT
test.id AS test_id
FROM (SELECT test.value AS value, max(test.timestamp) AS latest_timestamp
FROM test GROUP BY test.value) AS subquery1
WHERE test.timestamp = subquery1.latest_timestamp AND test.value =
subquery1.value)


which is accepted by the database.




>
> … I guess that's a bug? At least I would assume that this should not result
> in an OperationalError, especially since it works with sqlite?
>
> Thanks for any help,
>
> Lukas
>
> [0] https://tinloaf.de/~tinloaf/sqlalchemy/mariadb.html
> [1] https://tinloaf.de/~tinloaf/sqlalchemy/sqlite.html
> [2]
> https://stackoverflow.com/questions/5816840/delete-i-cant-specify-target-table
>
> --
> SQLAlchemy -
> The Python SQL Toolkit and Object Relational Mapper
>
> http://www.sqlalchemy.org/
>
> To post example code, please provide an MCVE: Minimal, Complete, and
> Verifiable Example. See http://stackoverflow.com/help/mcve for a full
> description.
> ---
> You received this message because you are subscribed to the Google Groups
> "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to sqlalchemy+...@googlegroups.com.
> To post to this group, send email to sqlal...@googlegroups.com.
> Visit this group at https://groups.google.com/group/sqlalchemy.
> For more options, visit https://groups.google.com/d/optout.

Mike Bayer

unread,
Jan 15, 2018, 6:09:42 PM1/15/18
to sqlal...@googlegroups.com
that's not quite right, because I'm not selecting from Test. Adding
the JOIN back in, I've tried aliasing Test everywhere and it still
produces the error. MySQL is really bad at subqueries. give me a
minute to find a query that works here.

Mike Bayer

unread,
Jan 15, 2018, 6:14:14 PM1/15/18
to sqlal...@googlegroups.com
On Mon, Jan 15, 2018 at 6:09 PM, Mike Bayer <mik...@zzzcomputing.com> wrote:
> On Mon, Jan 15, 2018 at 5:32 PM, Mike Bayer <mik...@zzzcomputing.com> wrote:
>
> that's not quite right, because I'm not selecting from Test. Adding
> the JOIN back in, I've tried aliasing Test everywhere and it still
> produces the error. MySQL is really bad at subqueries. give me a
> minute to find a query that works here.

OK try this:

latest_items = session.query(
Test.value, func.max(Test.timestamp).label('latest_timestamp')
).group_by(Test.value).subquery(name='subquery1')


deleted_rows = (
session.query(Test)
.filter((Test.timestamp < 5))
.filter(
~exists().
where(Test.value == latest_items.c.value).
where(Test.timestamp == latest_items.c.latest_timestamp)
).delete(synchronize_session=False)
)


output:

DELETE FROM test WHERE test.timestamp < %s AND NOT (EXISTS (SELECT *
FROM (SELECT test.value AS value, max(test.timestamp) AS latest_timestamp
FROM test GROUP BY test.value) AS subquery1
WHERE test.value = subquery1.value AND test.timestamp =
subquery1.latest_timestamp))

Lukas Barth

unread,
Jan 17, 2018, 3:10:08 PM1/17/18
to sqlalchemy
Thanks a lot! So, if I get this correctly, the difference is that you assigned a label to the timestamp, and that you're doing "~exists()" instead of "~in()" in the delete clause, right?

I'll have to adapt this a bit to get it into my (obviously a bit more complicated) actual code, but thanks for the help so far!

Lukas
Reply all
Reply to author
Forward
0 new messages