Filtering window expressions. Ticket #28333

342 views
Skip to first unread message

Mads Jensen

unread,
Apr 4, 2020, 5:11:01 AM4/4/20
to django-d...@googlegroups.com
Hey,

I originally submitted the window functions expressions to Django a couple of years ago for Django 2.0. In hindsight,
there are probably things that could be implemented a bit better although I'm not sure how I easily could have used the
SQLCompiler or QuerySet/Query classes directly, especially considering the fact that there are slight discrepencies in
how window expressions are supported between the different databases that Django officielly supports.

Anyway, it spawned a ticket #28333 [1] that I like to dedicate some time to implement since the feature appears to be wanted.

I have some thoughts about the process:

1. Include another flag: filtering_requires_subquery (should default to False, just filterable defaults to True)
2. Edit in QuerySet._filter_or_exclude to add somehow the following logic:

1. If the expression is not filterable, then we can introduce a subquery and filter directly there.
I surmise that this statement would be generic for SQL: if something is not filterable in a main query,
then it will be including it in another query as a subquery, and filtering that way?
2. Everything related to existing ordering etc. is a bit blurry.
3. Having to work with table aliases doesn't seem to be something that I'd want to worry about.
There's Query.rewrite_cols for the case when there's an annotation from a joined table,
Sum(F('author__awards')) is the example given in the comment in that function.

3. I think it's a general lacune that it's difficult to add a subquery to an existing query.
Although there's a boolean flag Query.subquery, it doesn't seem to make it easier to deal with filtering and other
things related to a subquery. I found this when searching for tickets that have "subquery" in the description:
https://code.djangoproject.com/ticket/20127

When are the table aliases used (T1, T2 etc.), and when are the actual table names used? I think it's when a table is auto-joined?

I'm not sure how much work that is required in SQLCompiler and Query, or if those two parts can stay untouched.

I tend to overthink things and sometimes make them more complex, so it's likely that there are things in the above that

Query.as_sql has this:

def as_sql(self, compiler, connection):
sql, params = self.get_compiler(connection=connection).as_sql()
if self.subquery:
sql = '(%s)' % sql
return sql, params

If someone is willing to discuss ideas about the process, please reach out to me. Devil lies in the details, and I suppose some
of the problem that subqueries are difficult in the ORM could be mended with work on this patch. However, I'd like to interact with
people who have worked more with the internals of the ORM than I did. One particular name Simon Charettes comes to mind,
since he also volunteered to mentor GSOC in the ORM area.

[1] https://code.djangoproject.com/ticket/28333
--
Med venlig hilsen / Kind regards,
Mads Jensen







charettes

unread,
Apr 4, 2020, 6:06:15 PM4/4/20
to Django developers (Contributions to Django itself)
Hey Mads,

I'll start by saying that subquery handling is a bit messy in the ORM. As you noticed there a few tickets and PRs about trying to make it better. I personally tried to get things better by pushing a lot of the current Subquery logic down Query a few months[0] ago but there's still a lot of room for improvements.

If I understand correctly filtering against window expression would require an automatic subquery wrap around which the ORM already does under certain circumstances but always on terminal operations such as aggregation[1].

I think that getting the semantic of how the wrapping should be performed while allowing following Queryset operations (filter, annotate, aggregate) to take place would be quite tricky and error prone so I suggest you explore the following alternative.

There's a ticket opened[2] to introduce a Queryset method to use the current query as a subquery which would allow to circumvent the implicit wrap around problem and require this method to be used to filter against window expression. To reuse #28333's example and assuming the method would be named subquery you could achieve what you're after with the following

    Window.objects.annotate(
        row=Window(expression=RowNumber())
    ).subquery().filter(row__gt=1)

which would result in SQL equivalent to

    SELECT * FROM (
        SELECT *, ROW_NUMBER()
        FROM window
    ) subquery WHERE subquery.row > 1

This subquery operation would address a few other cases that I know Claude[3] is interested in seeing addressed as well. Maybe you'd be able to pair on this problem given his interest in the problem and your past interactions with some of the ORM's internal.

From the little investgation I've done so far I think it shouldn't be too hard to implement once there's a way to add subquery to `Query.alias_map`[4]. From that point the _subquery_ operation should create a new `Query` instance and add the current one to `alias_map` as `base_table` (the first entry in `alias_map`) by making sure to augment the outer `Query.annotations` and select mask appropriately. It's not clear to me what should be done with `JOIN`s originating from `select_related` prior to calling the operation but I think it would be more appropriate to prune them from the inner query and graft them to the outer one since they'll have the same base table.

Here's a bit of peudo-code of what I tried to describe above

    def subquery(self):
        query = Query(self.model)
        alias = 'subquery'
        query.alias_map[alias] = BaseSubquery(self.query.clone(), alias)
        # XXX: Handle the select mask as well.
        query.annotations = {
            annotation: Ref(annotation, expr)
            for annotation, expr in self.query.annotations.items()
        }
        return QuerySet(self.model, query=query)

It's certainly missing quite a lot of things but it should get you started.

I'm happy to provide more guidance along the way if you decide to head in this direction.

Cheers,
Simon


[0] https://github.com/django/django/commit/35431298226165986ad07e91f9d3aca721ff38ec#diff-0edd853580d56db07e4020728d59e193
[1] https://github.com/django/django/blob/6fbce45b0376f0ec8f3bf244f4f1f8d62c201f58/django/db/models/sql/compiler.py#L1556-L1574
[2] https://code.djangoproject.com/ticket/24462
[3] https://code.djangoproject.com/ticket/30592

Claude Paroz

unread,
Apr 6, 2020, 2:31:11 AM4/6/20
to Django developers (Contributions to Django itself)
Le dimanche 5 avril 2020 00:06:15 UTC+2, charettes a écrit :
This subquery operation would address a few other cases that I know Claude[3] is interested in seeing addressed as well. Maybe you'd be able to pair on this problem given his interest in the problem and your past interactions with some of the ORM's internal.

Yes, I confirm my strong interest in solving that, particularly the "simple" use case of #30592. Unfortunately, my knowledge of the ORM internals is really poor.

Claude

Reply all
Reply to author
Forward
0 new messages