Adding alias to join() makes a wildly different query

25 views
Skip to first unread message

Zsolt Ero

unread,
Nov 28, 2019, 4:31:42 PM11/28/19
to sqlalchemy
I have the following query:

min_values = (
        sa.select(
            [
                table.c.region_id,
                table.c.operator,
                table.c.day_str,
                sa.func.min(table.c.processed_time).label('min_processed_time'),
            ]
        )
        .group_by(table.c.region_id, table.c.operator, table.c.day_str)
        .order_by(table.c.region_id, table.c.operator, table.c.day_str)
    ).alias('mv')

    j = min_values.join(
        config_mds,
        sa.and_(
            min_values.c.region_id == config_mds.c.region_id,
            min_values.c.operator == config_mds.c.operator,
        ),
    ).alias('j')

    stmt = (
        sa.select([min_values.c.region_id, min_values.c.operator, min_values.c.day_str])
        .select_from(j)
        .where(min_values.c.min_processed_time < config_mds.c.last_reprocess_time)
    )


Which works well, as long as I do not add the .alias('j') part. If I add it, it generates a wildly different SQL code, which doesn't finish in reasonable time (I terminated it after a few minutes whereas the first one finishes in a few seconds.

Here is the generated SQL without alias('j'):

SELECT
  mv.region_id,
  mv.operator,
  mv.day_str
FROM
  (
    SELECT
      status_changes.region_id AS region_id,
      status_changes.operator AS OPERATOR,
      status_changes.day_str AS day_str,
      min(status_changes.processed_time) AS min_processed_time
    FROM
      status_changes
    GROUP BY
      status_changes.region_id,
      status_changes.operator,
      status_changes.day_str
    ORDER BY
      status_changes.region_id,
      status_changes.operator,
      status_changes.day_str
  ) AS mv
  JOIN config_mds ON mv.region_id = config_mds.region_id
  AND mv.operator = config_mds.operator
WHERE
  config_mds.last_reprocess_time > mv.min_processed_time


And here is with the added alias('j'), which doesn't terminates:

SELECT
  mv.region_id,
  mv.operator,
  mv.day_str
FROM
  (
    SELECT
      status_changes.region_id AS region_id,
      status_changes.operator AS OPERATOR,
      status_changes.day_str AS day_str,
      min(status_changes.processed_time) AS min_processed_time
    FROM
      status_changes
    GROUP BY
      status_changes.region_id,
      status_changes.operator,
      status_changes.day_str
    ORDER BY
      status_changes.region_id,
      status_changes.operator,
      status_changes.day_str
  ) AS mv,
  config_mds,
  (
    SELECT
      mv.region_id AS mv_region_id,
      mv.operator AS mv_operator,
      mv.day_str AS mv_day_str,
      mv.min_processed_time AS mv_min_processed_time,
      config_mds.region_id AS config_mds_region_id,
      config_mds.operator AS config_mds_operator,
      config_mds.disable_download AS config_mds_disable_download,
      config_mds.trips_url AS config_mds_trips_url,
      config_mds.status_changes_url AS config_mds_status_changes_url,
      config_mds.url_headers AS config_mds_url_headers,
      config_mds.login_data AS config_mds_login_data,
      config_mds.use_different_region AS config_mds_use_different_region,
      config_mds.comment AS config_mds_comment,
      config_mds.last_reprocess_time AS config_mds_last_reprocess_time,
      config_mds.token_cache AS config_mds_token_cache
    FROM
      (
        SELECT
          status_changes.region_id AS region_id,
          status_changes.operator AS OPERATOR,
          status_changes.day_str AS day_str,
          min(status_changes.processed_time) AS min_processed_time
        FROM
          status_changes
        GROUP BY
          status_changes.region_id,
          status_changes.operator,
          status_changes.day_str
        ORDER BY
          status_changes.region_id,
          status_changes.operator,
          status_changes.day_str
      ) AS mv
      JOIN config_mds ON mv.region_id = config_mds.region_id
      AND mv.operator = config_mds.operator
  ) AS j
WHERE
  config_mds.last_reprocess_time > mv.min_processed_time;



This behaviour confuses me a bit. Why would adding an alias make a difference to the generated query?



Mike Bayer

unread,
Nov 28, 2019, 10:50:56 PM11/28/19
to noreply-spamdigest via sqlalchemy
sure, the "j" is shorthand for calling a SELECT of the join, then aliasing it.   you can no longer refer to 'min_values" in the subsequent query because that means you want to add it to the FROM clause separately from the "j" and it will go poorly.
you have:

min_values = "(SELECT .... FROM ... ) AS mv"

select([min_values.c.x]) =  "SELECT mv.x FROM (SELECT ... FROM ..) AS mv"


then you say, "j = min_values.join(config_mds, ...).alias("j")"

this is the same as saying:

j = select([min_values.join(config_mds, ...)]).alias("j")

an "alias" is only against a SELECT or a table;  there is no "SELECT * FROM (x JOIN y) AS j" syntax.     Did a quick test and apparently Postgresql allows that syntax but it fails on MySQL.   SQLite seems to support it but that's a newer thing as SQLite didn't support nested JOINs at all some years ago.   Adding the "AS <name>" to the plain parenthesies like that It's not something I've seen in SQL before and SQLAlchemy has no concept of that right now.

so given that join().alias() is shorthand for select([join()]).alias(), this means:

select([j.c.x]) =   "SELECT j.x FROM (SELECT mv.*, config_mds.* FROM (SELECT ... FROM ..) AS mv JOIN config_mds ON ...) AS j"

In SQLAlchemy 1.4 there's been a big scale-back of how things like alias() and join() can be used in order to decrease this kind of confusion, and I was sort of hoping this particular use was also deprecated but apparently it isn't.   so https://github.com/sqlalchemy/sqlalchemy/issues/5010 is added and the join().alias() usage will emit a deprecation warning in 1.4 because it does a thing that by today's standards is surprising.

anyway, if you mix the join() and the join().alias() together, like:

select([min_values.c.x]).select_from(j)...

this is like:

"SELECT mv.x FROM (SELECT ... FROM ..) AS mv, (SELECT mv.*, config_mds.* FROM (SELECT ... FROM ..) AS mv JOIN config_mds ON ...) AS j"

that is, saying select_from(j) doesn't mean that's the only thing you are selecting from, it also has to select from min_values directly.

TLDR join().alias() is doing a large implicit thing based on older assumptions, when parenthesized "(a JOIN b)" clauses were not accepted by SQLite and possibly others and SQLAlchemy always used SELECT... in order to group joins.     you can group joins in parentheses now but you still can't give them an alias() without a SELECT implied.





--
SQLAlchemy -
The Python SQL Toolkit and Object Relational Mapper
 
 
To post example code, please provide an MCVE: Minimal, Complete, and Verifiable Example. See http://stackoverflow.com/help/mcve for a full description.
---
You received this message because you are subscribed to the Google Groups "sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+...@googlegroups.com.

Zsolt Ero

unread,
Nov 29, 2019, 8:03:00 AM11/29/19
to sqlalchemy
Thanks for the detailed explanation. I think I should explain why I
run into this problem, as it's in line with emitting a warning on this
use case.

1. I wrote the query without any kind of alias in SQLAlchemy.
2. Received an error from PostgreSQL "subquery in FROM must have an
alias", so I added it in both places "to be safe". One for the
subquery, one for the join.
3. After this the query worked - so I confusingly thought I did the
right thing - but became unusably slow.

Then I tried to debug step by step and arrived that the adding of
alias to join() made it into a very different query which happened to
be slow.

I think I'm not the only user who adds aliases "everywhere" once
receiving that alias is needed error. I thought that the j = ... is a
subquery as it's in a select_from() line.

So yes, in my case adding those warnings would be very helpful,
directing me in the right direction.


Zsolt
> You received this message because you are subscribed to a topic in the Google Groups "sqlalchemy" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/sqlalchemy/nPKcBGWWY3s/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to sqlalchemy+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/sqlalchemy/f26cf2af-cb16-4488-b75b-baca3c457e00%40www.fastmail.com.

Mike Bayer

unread,
Nov 29, 2019, 10:28:49 AM11/29/19
to noreply-spamdigest via sqlalchemy


On Fri, Nov 29, 2019, at 8:02 AM, Zsolt Ero wrote:
Thanks for the detailed explanation. I think I should explain why I
run into this problem, as it's in line with emitting a warning on this
use case.

1. I wrote the query without any kind of alias in SQLAlchemy.
2. Received an error from PostgreSQL "subquery in FROM must have an
alias", so I added it in both places "to be safe". One for the
subquery, one for the join.
3. After this the query worked - so I confusingly thought I did the
right thing - but became unusably slow.


right because the FROM list thing really tends towards creating cartesian products because of the implicit FROM thing, there is also a recipe + upcoming extension that will warn when this situation occurs, that's in review at https://gerrit.sqlalchemy.org/#/c/sqlalchemy/sqlalchemy/+/1461/  .



Then I tried to debug step by step and arrived that the adding of
alias to join() made it into a very different query which happened to
be slow.

I think I'm not the only user who adds aliases "everywhere" once
receiving that alias is needed error. I thought that the j = ... is a
subquery as it's in a select_from() line.

I'm not sure that's necessarily true as usually you'd want to be looking at the SQL and matching up...the ORM is less prone to this kind of thing because you don't deal with join() as a separate object (having this work for Core select is also a thing I am pursuing).



So yes, in my case adding those warnings would be very helpful,
directing me in the right direction.

the only thing that worries me lately is that deprecationwarnings in Python don't emit by default now, so just curious is this in the context of a test suite that will enable deprecationwarning ?



Reply all
Reply to author
Forward
0 new messages