Filterable subqueries ...

300 views
Skip to first unread message

Bernd Wechner

unread,
Feb 28, 2019, 7:27:07 AM2/28/19
to Django developers (Contributions to Django itself)
I have a problem I've only been able to solvewhich bugs me. I've posted on the Django users list (to stunning silence) and on stackoverflow:


to comparable silence.  I'm rather convinced this can't be done in Django without raw SQL and that it is an integral part of Window function utility so I'd like to propose a native ORM based solution to the problem.

To understand the context it will be necessary to read the post on Stack Overflow above, there seems little point in copying the text here.

The proposal though is simple enough. It is that django.db.models.Subquery support the methods that django.db.models.QuerySet does, specifically SQL constructing methods like annotate() and filter() - I'm sure there are more.

The idea is to make easily available a way of selecting from a subquery such that something akin to:

SQ = Subquery(model.objects.filter(...))  

produces SQL in the form:

SELECT ... FROM model WHERE ...

and now:

Q = SQ.filter(---)

would produce SQL in the form:

SELECT * FROM (SELECT ... FROM model WHERE ...) AS SQ WHERE ---

Essentially permitting us to filter on the results of a Subquery.

Again, this is crucial when using Window functions like LAG and LEAD over a filtered list of objects. The reasons for this are explained on Stack Overflow with samples of code and SQL.

Of course, in spite of the stunning silence on the Django users list and Stackoverflow, there's a chance we can already do this in Django in a way that does not involve Raw SQL that I have not found yet, in which case apologies, this really is a support question for the Django users list. But I have looked hard and asked hard and searched hard and experimented hard and delved into the Django code quite hard and I'm fairly convinced it's not possible at present ... without constructing Raw SQL.

Regards,

Bernd.




Josh Smeaton

unread,
Mar 12, 2019, 5:29:16 AM3/12/19
to Django developers (Contributions to Django itself)
With regard to the stunning silence you're witnessing, it's my guess that you haven't made yourself clear enough in a concise way. The stackoverflow post is large, and doesn't point out what's missing very clearly.

What is the SQL you want?

What is the SQL you're getting, and what is the queryset you're constructing?

I **think** what you're trying to get to is this:

SELECT * FROM ( 
   SELECT
       t.field,
       lag(t.other, 1) over ( .. ) as wlag
   FROM T t
   WHERE t.field = 1
) inn
WHERE wlag = 3;


That is, you want to be able to filter on annotations without copying the annotation into the WHERE clause by wrapping with an outer query that does the filtering. Is that correct?

Bernd Wechner

unread,
Mar 14, 2019, 8:18:59 PM3/14/19
to Django developers (Contributions to Django itself)
Josh,

Thanks very much for your contribution. Yes it is not concise, but I did hope it is clear. Alas there is a contest between a well presented, well researched, and complete question and a concise one. Sometimes the two don't reconcile themselves well I admit.

Your supposition is almost right though:
  1. It relates to more to a problem that DISTINCT and INNER JOIN produce with Window functions. If you want DISTINCT tuples on a JOINed pair of tables that is fine, but the minute you put a window function in because each tuple gains a unique value before the DISTINCT is applied DISTINCT is rendered functionally useless.
  2. The only fix to that is to force DISTINCT to apply before the window functions are added (i.e. before annotating).
  3. Django seems not to have a way to do that, bar RAW SQL. And I would argue it should.
To paraphrase your example and ignoring JOINS, in the interest of simplicity and brevity:

In this query if multiple tuples that match the WHERE share the same value of field1 then rows in the output are duplicated.

   SELECT t.field1
   FROM T t
   WHERE t.field2 like "%x%"

In this query:

   SELECT DISTINCT t.field1
   FROM T t
   WHERE t.field2 like "%x%"

the duplication is avoided. But in this query:

   SELECT DISTINCT t.field1, row_number(t.other, 1) over ( .. ) as wrownum
   FROM T t
   WHERE t.field2 like "%x%"

The duplication is maintained. That is because each row receives a unique value of wrownum, and DISTINCT is applied to the result! The way to fix this is:

    SELECT t.field1, row_number(t.other, 1) over ( .. ) as wrownum
    FROM (
        SELECT DISTINCT t.field1
        FROM T t
        WHERE t.field2 like "%x%"
    ) inn

And there seems no way in Django currently to do such a simple breakout SELECT wrapper with Subqueries that I can find.

As an aside, if t.field1 is an object id, the above example actually causes no problems, but when you do a join to any that that T has a ToMany relationship with, then this problem emerged which is where I find it. I get duplicate tuples with the same id because the LIKE matches more than one of the related objects. That complicates the queries a little and is well presented on StackOverflow.

My concern is it's not possible and would be if we supported perhaps a new argument on filters like Wrap which if true wraps the SQL in a new Select. BUt I'm not 100% sure there isn't an existing Django ORM option, I am just at this point deeply suspicious there isn't one.

Your example below BTW is also a fine use case, and similarly not supported currently  in the ORM I fear. Unless (and I hope) I am mistaken.

Kind regards,

Bernd.


charettes

unread,
Mar 14, 2019, 9:21:39 PM3/14/19
to Django developers (Contributions to Django itself)
Hey Bern,

> And there seems no way in Django currently to do such a simple breakout SELECT wrapper with Subqueries that I can find.

Right, there's no public API to perform the pushdown but what you are looking for is really
similar to what QuerySet.aggregate() (through sql.query.Query.get_aggregation) does when
the query happens to use .distinct(). I think that removing the `contains_aggregate` checks
in .aggregate()[0] could actually make it work out of the box with an aggregate(F(), RowNumber)
and the check can probably be worked around by passing subclasses of F and RowNumber
that set `contains_aggregate = True` during `resolve_expression` when passed `is_summary=True`.

Note that there have been previous requests for this exact feature[1] and by playing around a bit
with Subquery and sql.Query.get_aggregation recently[2] I'm pretty confident most of the logic of
the latter could be reused to implement it.

Simon

Reply all
Reply to author
Forward
0 new messages