or-ing to QuerySets turns INNER into LEFT OUTER joins?

659 views
Skip to first unread message

Carsten Fuchs

unread,
Jul 15, 2013, 11:41:35 AM7/15/13
to Django users
Hi all,

we have two queries/QuerySets Q_a and Q_b, each of which use INNER joins
in the generated SQL when evaluated individually.

When I use a Q object to "OR" these two QuerySets, the INNER joins turn
into LEFT OUTER joins -- which in turn cause a huge performance drop
(several hundred times slower than with INNER joins).

Why is this, and can I do anything against it at the Django ORM level?


Details:

My environment:
Django 1.5.1 on Ubuntu 10.04 (Python 2.6.5) with Oracle database 10g


FirstDay = date(2013, 5, 1)
LastDay = date(2013, 5, 31)
SomeDep = ... # the department with ID == 1


Q_a = Staff.objects. \
filter(bereiche=SomeDep).distinct()

Q_b = Staff.objects. \
filter(erfasst__datum__gte=FirstDay,
erfasst__datum__lte=LastDay,
erfasst__bereich=SomeDep).distinct()

Q_a_or_b = Staff.objects. \
filter(Q(bereiche=SomeDep) |
Q(erfasst__datum__gte=FirstDay,
erfasst__datum__lte=LastDay,
erfasst__bereich=SomeDep)).distinct()


In the following output, to improve readability I used the "*" to
manually shorten the list of selected fields:


print Q_a.query

SELECT DISTINCT "STAFF".* FROM "STAFF"
INNER JOIN "STAFF_BEREICHE" ON ("STAFF"."ID" = "STAFF_BEREICHE"."STAFF_ID")
WHERE "STAFF_BEREICHE"."BEREICH_ID" = 1


print Q_b.query

SELECT DISTINCT "STAFF".* FROM "STAFF"
INNER JOIN "ERFASST" ON ("STAFF"."KEY" = "ERFASST"."KEY")
WHERE ("ERFASST"."DATUM" <= 2013-05-31 AND "ERFASST"."DATUM" >=
2013-05-01 AND "ERFASST"."BEREICH_ID" = 1 )


print Q_a_or_b.query

SELECT DISTINCT "STAFF".* FROM "STAFF"
LEFT OUTER JOIN "STAFF_BEREICHE" ON ("STAFF"."ID" =
"STAFF_BEREICHE"."STAFF_ID")
LEFT OUTER JOIN "ERFASST" ON ("STAFF"."KEY" = "ERFASST"."KEY")
WHERE ("STAFF_BEREICHE"."BEREICH_ID" = 1 OR ("ERFASST"."DATUM" <=
2013-05-31 AND "ERFASST"."DATUM" >= 2013-05-01 AND
"ERFASST"."BEREICH_ID" = 1 ))


In the last SQL statement, when I replace "LEFT OUTER JOIN" with "INNER
JOIN" and manually run it, the result seems to be the same, but the
query completes one hundred to several hundred times faster.

So, from the above observations, I was wondering if the use of "LEFT
OUTER JOIN" is what is supposed to happen for Q_a_or_b in the first
place? Is it the expected behavior?
And if so, can it be changed via the Django ORM?

Many thanks for your help!

Best regards,
Carsten



--
Dipl.-Inf. Carsten Fuchs

Carsten Fuchs Software
Industriegebiet 3, c/o Rofu, 55768 Hoppstädten-Weiersbach, Germany
Internet: http://www.cafu.de | E-Mail: in...@cafu.de

Cafu - the open-source game and graphics engine for multiplayer 3D action

Jon Dufresne

unread,
Jul 16, 2013, 10:02:01 AM7/16/13
to django...@googlegroups.com
I too would be interested to know if there is a solution. A LEFT OUTER JOIN can really hurt performance when an INNER JOIN are more appropriate,.


--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscribe@googlegroups.com.
To post to this group, send email to django...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-users.
For more options, visit https://groups.google.com/groups/opt_out.



Carsten Fuchs

unread,
Jul 24, 2013, 12:39:10 PM7/24/13
to django...@googlegroups.com
Hi all,

Am 15.07.2013 17:41, schrieb Carsten Fuchs:
> we have two queries/QuerySets Q_a and Q_b, each of which use INNER joins
> in the generated SQL when evaluated individually.
>
> When I use a Q object to "OR" these two QuerySets, the INNER joins turn
> into LEFT OUTER joins -- which in turn cause a huge performance drop
> (several hundred times slower than with INNER joins).
>
> Why is this, and can I do anything against it at the Django ORM level?


Can someone provide some help or insights into this, please?

Best regards,
Carsten



--
Dipl.-Inf. Carsten Fuchs

Carsten Fuchs Software
Industriegebiet 3, c/o Rofu, 55768 Hoppst�dten-Weiersbach, Germany

Tom Evans

unread,
Jul 25, 2013, 6:17:13 AM7/25/13
to django...@googlegroups.com
On Wed, Jul 24, 2013 at 5:39 PM, Carsten Fuchs <carste...@cafu.de> wrote:
> Hi all,
>
> Am 15.07.2013 17:41, schrieb Carsten Fuchs:
>
>> we have two queries/QuerySets Q_a and Q_b, each of which use INNER joins
>> in the generated SQL when evaluated individually.
>>
>> When I use a Q object to "OR" these two QuerySets, the INNER joins turn
>> into LEFT OUTER joins -- which in turn cause a huge performance drop
>> (several hundred times slower than with INNER joins).
>>
>> Why is this, and can I do anything against it at the Django ORM level?
>
>
>
> Can someone provide some help or insights into this, please?
>

Isn't this to be expected? You've asked Django to OR the querysets.
This means that you are looking for tuples from STAFF where either the
join+conditions to ERFASST match or the join+conditions to
STAFF_BEREICHE match. If these joins were performed with an INNER
JOIN, then the query would only include tuples from STAFF that match
both joins and conditions.

Cheers

Tom

Carsten Fuchs

unread,
Jul 25, 2013, 11:42:05 AM7/25/13
to django...@googlegroups.com
Hi Tom,

thank you very much for your reply!

Am 25.07.2013 12:17, schrieb Tom Evans:
> Isn't this to be expected? You've asked Django to OR the querysets.
> This means that you are looking for tuples from STAFF where either the
> join+conditions to ERFASST match or the join+conditions to
> STAFF_BEREICHE match. If these joins were performed with an INNER
> JOIN, then the query would only include tuples from STAFF that match
> both joins and conditions.

Well, I must foreclose that I'm by no means an SQL expert, and rather
consider myself as a beginner -- but won't the joins augment the STAFF
table independently of each other?

That is, if I understand things correctly, the join types INNER vs. LEFT
OUTER are mostly about the handling of NULL values on the "other" side
of the relation, but as only non-NULL values are involved here, I'd
expect a query to provide the same augmented table no matter if INNER or
LEFT OUTER joins are used.

(And vice versa, as the corresponding Django foreign key fields are
defined as non-NULL, too, I thought that Django would use INNER joins in
the first place.)

The OR in the WHERE clause then acts "independently" on the resulting
augmented table.

When I try this experimentally with the queries Q_a and Q_b from my
original post, where Q_a and Q_b yield different result tuples, e.g.

A B C D E

for Q_a and

D E F G

for Q_b, then *both* the query Q_a_or_b with the original LEFT OUTER
joins, as well as an edited query where INNER join is used instead,
yield the same result tuple set

A B C D E F G


Or is my understanding of SQL basics that wrong?

Tom Evans

unread,
Jul 25, 2013, 1:00:06 PM7/25/13
to django...@googlegroups.com
On Thu, Jul 25, 2013 at 4:42 PM, Carsten Fuchs <carste...@cafu.de> wrote:
> Hi Tom,
>
> thank you very much for your reply!
>
> Am 25.07.2013 12:17, schrieb Tom Evans:
>
>> Isn't this to be expected? You've asked Django to OR the querysets.
>> This means that you are looking for tuples from STAFF where either the
>> join+conditions to ERFASST match or the join+conditions to
>> STAFF_BEREICHE match. If these joins were performed with an INNER
>> JOIN, then the query would only include tuples from STAFF that match
>> both joins and conditions.
>
>
> Well, I must foreclose that I'm by no means an SQL expert, and rather
> consider myself as a beginner -- but won't the joins augment the STAFF table
> independently of each other?
>
> That is, if I understand things correctly, the join types INNER vs. LEFT
> OUTER are mostly about the handling of NULL values on the "other" side of
> the relation, but as only non-NULL values are involved here, I'd expect a
> query to provide the same augmented table no matter if INNER or LEFT OUTER
> joins are used.

This query is all about joining tuples from the STAFF table to the
other two tables. If the joins to both ERFASST and STAFF_BEREICHE are
inner joins, then the only tuples from STAFF under consideration are
ones which have valid references to both tables. This is not the same
as OR'ing two independent queries.

Now, if your data obeys that criteria, then there would be no
difference between the two results, but the meaning of the queries are
very different. With inner joins, the query means "Find me staff that
have department 1, AND have logs* in the specified date range from
department 1", but with outer joins the query means "Find me staff
that have either department 1, or have logs from the specified date
range from department 1".

Put another way, if you AND'ed the Q objects instead, your query would
be the fast one you are looking for with inner joins throughout, but
you would not see users who do not have log entries, nor users with
log entries not in departments - but those conditions sound like an
impossible situation given your models.

As an example you can play around with, pick one user who has log
entries, and remove all their departments from STAFF_BEREICHE. This
user should then be found with the slow query with LEFT OUTER joins,
but not with the fast query with INNER joins - they are excluded due
to the INNER join to STAFF_BEREICHE fails for that tuple in STAFF, and
so it is not considered from that point onwards.

Cheers

Tom

* Erfasst is meaning logging here, right? Reaching the limits of my German :)

Carsten Fuchs

unread,
Jul 26, 2013, 12:32:18 PM7/26/13
to django...@googlegroups.com
Hi Tom,

thank you very much for your detailed reply, it helped at lot!

Am 25.07.2013 19:00, schrieb Tom Evans:
> [...]
> As an example you can play around with, pick one user who has log
> entries, and remove all their departments from STAFF_BEREICHE. This
> user should then be found with the slow query with LEFT OUTER joins,
> but not with the fast query with INNER joins [...]

Yes, I can reproduce this, and also for the opposite case, where a STAFF
member has no entries in the related ERFASST table.

And although I've clipped most of your text in the quote above for
brevity, your text really made me understand the background and
reasoning, too! :-D

Now with correctness established, this leaves me with the performance
problem. The formulation with the OR and the LEFT OUTER joins is indeed
what we semantically need (also see below, about Erfasst and logging ;) ).

The best that I could think of so far is to run the individual queries
individually, combine the results in Python code (the "OR" operation),
then (as we need a QuerySet at this place in the code for further
filtering) run another query with `.filter(id__in=ManuallyCombinedResults)`:


S_a = set(Staff.objects. \
filter(bereiche=SomeDep).values_list('id', flat=True))

S_b = set(Staff.objects. \
filter(erfasst__datum__gte=FirstDay,
erfasst__datum__lte=LastDay,
erfasst__bereich=SomeDep).values_list('id', flat=True))

Q_a_or_b = Staff.objects.filter(id__in=S_a | S_b).distinct()


This is still about 50 times better than the form with Q objects and
LEFT OUTER joins. (Do you see a way to optimize this even further (in
the Django ORM)?)

My initial assumption was that the OR in the all-in-one query would not
be much slower than the individual queries, and was really surprised to
learn otherwise...

> * Erfasst is meaning logging here, right? Reaching the limits of my German :)

Translated verbatim, "Erfasst" means "committed" or "recorded" -- which
obviously is a really stupid name for a database table. Alas, this is
the legacy that I got when I overtook the original PHP project, and
backwards-compatibility has so far kept me from updating the name.

What we really are recording here are "workdays". In my attempt to keep
the examples in my initial post both easy to understand and correct, I
only translated the table name for STAFF, but felt that I'd lose
consistency and thus correctness if things got more involved if I
translated ERFASST as well (especially with backwards relation names etc.)

So the above is really an attempt to find all STAFF members who are
assigned to a certain department regularly (via STAFF_BEREICHE), "OR"
those who have spent a workday in that department in the given time range.
(So new staff members who have not yet spent a single workday anywhere
would promptly be missing from any report and form with the
INNER-joins-only query...)

Tom, many thanks for your help!

Martin J. Laubach

unread,
Jul 26, 2013, 2:48:33 PM7/26/13
to django...@googlegroups.com

This is still about 50 times better than the form with Q objects and
LEFT OUTER joins.

  This is rather unusual. A join in the database should be quite a bit faster than doing things manually, except if the joined tables produce massive amounts of data. This suggests that you're missing useful indexes and/or statistics for your query.

        mjl

Carsten Fuchs

unread,
Jul 29, 2013, 9:09:28 AM7/29/13
to django...@googlegroups.com
Hi Martin,

Am 2013-07-26 20:48, schrieb Martin J. Laubach:
> This is rather unusual. A join in the database should be quite a bit faster than
> doing things manually, except if the joined tables produce massive amounts of data. This
> suggests that you're missing useful indexes and/or statistics for your query.


Could you please elaborate a bit?

As I mentioned earlier, I'm no SQL/database expert. When I run the queries individually,
they're all reasonably fast, even though all of them have several thousand rows, and the
ERFASST table even has several millions. So I concluded that the proper indexes were
already in place.

I've also talked to our DB admins, who ran the full query through some Oracle
auto-optimizer tool, but said that it only suggested to add more indexes that, when
tried out, did not seem to buy anything.

Anything else that I should look into or steps to take?

Sharath Gururaj

unread,
Feb 14, 2014, 4:28:21 PM2/14/14
to django...@googlegroups.com
Hi Carsten,

Did you find a fix for this problem?
I'm running into the same issue. I am manually doing a union in python to avoid the left outer join.
What I would like to know is: 
1. Is there a reason why mysql performance is so slow on left outer join? (even with indexes)
2. How do I force django to use a UNION ALL (which is performant) instead of left outer join


On Monday, July 29, 2013 9:09:28 AM UTC-4, Carsten Fuchs wrote:
Hi Martin,

Am 2013-07-26 20:48, schrieb Martin J. Laubach:
>    This is rather unusual. A join in the database should be quite a bit faster than
> doing things manually, except if the joined tables produce massive amounts of data. This
> suggests that you're missing useful indexes and/or statistics for your query.


Could you please elaborate a bit?

As I mentioned earlier, I'm no SQL/database expert. When I run the queries individually,
they're all reasonably fast, even though all of them have several thousand rows, and the
ERFASST table even has several millions. So I concluded that the proper indexes were
already in place.

I've also talked to our DB admins, who ran the full query through some Oracle
auto-optimizer tool, but said that it only suggested to add more indexes that, when
tried out, did not seem to buy anything.

Anything else that I should look into or steps to take?

Best regards,
Carsten



--
Dipl.-Inf. Carsten Fuchs

Carsten Fuchs Software
Industriegebiet 3, c/o Rofu, 55768 Hoppst�dten-Weiersbach, Germany

Carsten Fuchs

unread,
Feb 14, 2014, 5:15:04 PM2/14/14
to django...@googlegroups.com
Hi,

Am 14.02.2014 22:28, schrieb Sharath Gururaj:
> Did you find a fix for this problem?
> I'm running into the same issue. I am manually doing a union in python
> to avoid the left outer join.

Unfortunately, I've not found anything else, and ended up using what I
described in my earlier posts: computing the "OR" operation in Python,
then using another query like `... .filter(id__in=S_a | S_b)`.

> What I would like to know is:
> 1. Is there a reason why mysql performance is so slow on left outer
> join? (even with indexes)

I'm using Oracle, but I don't know the answer either way. I too would be
interested in learning more, though!

Best regards,
Carsten

Camilo Torres

unread,
Feb 15, 2014, 4:37:01 PM2/15/14
to django...@googlegroups.com
On Friday, February 14, 2014 4:58:21 PM UTC-4:30, sha wrote:
What I would like to know is: 
1. Is there a reason why mysql performance is so slow on left outer join? (even with indexes)


That is not a MySQL issue with outher joins, nor any other RDBMS issue with outher join. That depends so much in the way you wrote your query and the conditions for joining you are putting in; you can be doing a full table scan or cartesian product of two (or more) relations. The index are used whenever possible or whenever the RDBMS is able to use them.

Carsten's manual optimization seems better than that of the ORM for this case.
Reply all
Reply to author
Forward
0 new messages