Using EXISTS instead of IN for subqueries

1,452 views
Skip to first unread message

Anssi Kääriäinen

unread,
Mar 25, 2013, 6:40:50 AM3/25/13
to Django developers
I am very likely going to change the ORM to use EXISTS subqueries
instead of IN subqueries. I know this is a good idea on PostgreSQL but
I don't have enough experience of other databases to know if this is a
good idea or not.

There are two main reasons for doing this. First, exists should
perform better on some databases, and exists allows for filter
conditions other than single column equality on all databases. So,
EXISTS subqueries are needed in the ORM in any case, the question is
if they should be the only option.

The semantics of NOT IN are harder to optimize for the DB than NOT
EXISTS, and this can result in large performance differences. See for
example this post from pgsql-hacker mailing list:
http://www.postgresql.org/message-id/19913.13...@sss.pgh.pa.us

It is easy to construct cases where NOT IN results in runtime of days
and NOT EXISTS in runtime of seconds. Just have a large enough table
in the subquery and PostgreSQL will choke.

Quick testing indicates that Oracle and MySQL seem to perform about
the same for IN and EXISTS variants, and SQLite seems to be a bit
faster when using EXISTS over IN. The docs of MySQL suggests using
EXISTS: https://dev.mysql.com/doc/refman/5.5/en/subquery-optimization-with-exists.html
(see the part about "very useful optimization"). My experience of
using these databases is very limited, so I might be missing some
known problematic cases.

So, the question is if there are situations where performance of
EXISTS is a lot worse than IN?

It will be possible to have a
connection.features.prefers_exists_subqueries flag and use that to
decide if the query should be generated as IN or EXISTS subquery.
However, always using EXISTS is a lot simpler.

- Anssi

Tim Chase

unread,
Mar 25, 2013, 6:58:12 AM3/25/13
to django-d...@googlegroups.com, anssi.ka...@thl.fi
On 2013-03-25 03:40, Anssi Kääriäinen wrote:
> I am very likely going to change the ORM to use EXISTS subqueries
> instead of IN subqueries. I know this is a good idea on PostgreSQL
> but I don't have enough experience of other databases to know if
> this is a good idea or not.

I can only speak for testing IN-vs-EXISTS speed on MSSQLServer at
$OLD_JOB, but there it's usually about the same, occasionally with IN
winning out. However, the wins were marginal, and MSSQL is a 2nd-class
citizen in the Django world, so I'm +1 on using EXISTS instead of IN,
if the results are assured to be the same.

However, the query constuction to move the condition into the EXISTS
subclause might be a bit more complex.

-tkc


Simon Riggs

unread,
Mar 25, 2013, 7:23:07 AM3/25/13
to django-d...@googlegroups.com
On 25 March 2013 10:58, Tim Chase <django...@tim.thechases.com> wrote:
> On 2013-03-25 03:40, Anssi Kääriäinen wrote:
>> I am very likely going to change the ORM to use EXISTS subqueries
>> instead of IN subqueries. I know this is a good idea on PostgreSQL
>> but I don't have enough experience of other databases to know if
>> this is a good idea or not.
>
> I can only speak for testing IN-vs-EXISTS speed on MSSQLServer at
> $OLD_JOB, but there it's usually about the same, occasionally with IN
> winning out. However, the wins were marginal, and MSSQL is a 2nd-class
> citizen in the Django world, so I'm +1 on using EXISTS instead of IN,
> if the results are assured to be the same.

The results are definitely different because NOT IN has some quite
strange characteristics: if the subquery returns a NULL then the whole
result is "unknown". It is that weirdness that makes it hard to
optimize for, or at least, not-yet-optimized for in PostgreSQL.

In most cases it is the NOT EXISTS behaviour that people find natural
and normal anyway and that is the best mechanism to use.

> However, the query constuction to move the condition into the EXISTS
> subclause might be a bit more complex.


--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Anssi Kääriäinen

unread,
Mar 25, 2013, 8:37:00 AM3/25/13
to Django developers
On 25 maalis, 13:23, Simon Riggs <si...@2ndQuadrant.com> wrote:
> On 25 March 2013 10:58, Tim Chase <django.us...@tim.thechases.com> wrote:
>
> > On 2013-03-25 03:40, Anssi Kääriäinen wrote:
> >> I am very likely going to change the ORM to use EXISTS subqueries
> >> instead of IN subqueries. I know this is a good idea on PostgreSQL
> >> but I don't have enough experience of other databases to know if
> >> this is a good idea or not.
>
> > I can only speak for testing IN-vs-EXISTS speed on MSSQLServer at
> > $OLD_JOB, but there it's usually about the same, occasionally with IN
> > winning out. However, the wins were marginal, and MSSQL is a 2nd-class
> > citizen in the Django world, so I'm +1 on using EXISTS instead of IN,
> > if the results are assured to be the same.
>
> The results are definitely different because NOT IN has some quite
> strange characteristics: if the subquery returns a NULL then the whole
> result is "unknown". It is that weirdness that makes it hard to
> optimize for, or at least, not-yet-optimized for in PostgreSQL.
>
> In most cases it is the NOT EXISTS behaviour that people find natural
> and normal anyway and that is the best mechanism to use.

When doing an .exclude() that requires subquery Django automatically
generates the queries so that the inner query's select clause can't
contain nulls. For example:
>>> print D.objects.exclude(e__id__gte=0).query
SELECT `table_d`.`id`, `table_d`.`a`, `table_d`.`b` FROM `table_d`
WHERE NOT (`table_d`.`id` IN (SELECT U1.`d_id` FROM `table_e` U1 WHERE
(U1.`id` >= 0 AND U1.`d_id` IS NOT NULL)))

However it is possible to generate NOT IN query where the SQL
semantics are in effect when using __in lookup:
>>> print D.objects.exclude(id__in=E.objects.filter(id__gte=0).values_list('d_id')).query
SELECT `table_d`.`id`, `table_d`.`a`, `table_d`.`b` FROM `table_d`
WHERE NOT (`table_d`.`id` IN (SELECT U0.`d_id` FROM `table_e` U0 WHERE
U0.`id` >= 0 ))

The results of the latter case could change (assuming d_id can contain
null values).

I think that this could be considered a bug fix. Django's ORM doesn't
try to mimic SQL semantics, it tries to have Python semantics for the
query. So an exclude(__in) lookup should behave like Python's "value
not in list", not like SQL's NOT IN.

On the other hand having __in lookups that do EXISTS in SQL might be a
bit surprising. The way __in works is documented as generating SQL IN
lookup: https://docs.djangoproject.com/en/dev/ref/models/querysets/#in.

I feel pretty strongly that NOT EXISTS semantics are wanted. The NOT
IN semantics are likely there just because that is how the
implementation was originally done, not because there was any decision
to choose those semantics. Also, multicolumn NOT IN lookups aren't
supported on all databases (SQLite at least), so for that case NOT
EXISTS semantics is going to happen anyways.

- Anssi

Alex Gaynor

unread,
Mar 25, 2013, 9:26:18 AM3/25/13
to django-d...@googlegroups.com
I have no idea how EXISTS performs on MySQL, however I can say that IN + subqueries on MySQL are so atrocious that we outright banned that where I work, so I don't see how it could be worse :)

Alex



--
You received this message because you are subscribed to the Google Groups "Django developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.





--
"I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire)
"The people's good is the highest law." -- Cicero

Simon Riggs

unread,
Mar 25, 2013, 2:16:07 PM3/25/13
to django-d...@googlegroups.com
On 25 March 2013 12:37, Anssi Kääriäinen <anssi.ka...@thl.fi> wrote:

> I feel pretty strongly that NOT EXISTS semantics are wanted. The NOT
> IN semantics are likely there just because that is how the
> implementation was originally done, not because there was any decision
> to choose those semantics.

Most likely, yes, so it looks like a bug fix now not an optimization.

> Also, multicolumn NOT IN lookups aren't
> supported on all databases (SQLite at least), so for that case NOT
> EXISTS semantics is going to happen anyways.

Yes, I think that's the clincher.

Michael Manfre

unread,
Mar 26, 2013, 1:16:00 AM3/26/13
to django-d...@googlegroups.com, anssi.ka...@thl.fi, django...@tim.thechases.com


On Monday, March 25, 2013 6:58:12 AM UTC-4, Tim Chase wrote:
I can only speak for testing IN-vs-EXISTS speed on MSSQLServer at
$OLD_JOB, but there it's usually about the same, occasionally with IN
winning out.

In SQL 2008r2, the optimizer is usually smart enough to end up with the same execution plan for IN and EXISTS queries. Historically, EXISTS was usually the faster operation for SQL Server and if memory serves it had to deal with its ability to bail out of the EXISTS query sooner compared to the IN query.

MSSQL is a 2nd-class citizen in the Django world, so I'm +1 

Reasoning like that helps to keep it in its place.

Anssi,

Any chance of adding a new database feature to flip the behavior of __in to either IN or EXISTS? Sounds like this change of logical and documented behavior is being made specifically because of failings with Postgresql. The feature would also help satisfy the deprecation cycle normally used for changes to documented behaviors. Sub-queries are more likely to expose database specific issues with the SQL provided by Django (normally when used with aggregates or slicing). Adding the database feature might save every other backend from having to potentially jump through unnecessary hoops (mangling more SQL).

Regards,
Michael Manfre

Anssi Kääriäinen

unread,
Mar 26, 2013, 8:03:17 AM3/26/13
to Django developers
Yes, this is a possibility and I think I will take this path. This
adds a bit of complexity to the code but not much. There is already
support for both IN and EXISTS subqueries in the ORM (as of couple of
days ago). In addition I will change the __in lookup to exclude NULL
values from the inner query so that EXISTS and IN give the same
results.

By default PostgreSQL and SQLite will get EXISTS as preferred subquery
type. For other databases it will be easy to flip the feature flag if
it seems EXISTS is better than IN for the DB.

- Anssi

Tim Chase

unread,
Mar 26, 2013, 10:23:12 AM3/26/13
to Michael Manfre, django-d...@googlegroups.com, anssi.ka...@thl.fi, django...@tim.thechases.com
On 2013-03-25 22:16, Michael Manfre wrote:
> On Monday, March 25, 2013 6:58:12 AM UTC-4, Tim Chase wrote:
> > I can only speak for testing IN-vs-EXISTS speed on MSSQLServer at
> > $OLD_JOB, but there it's usually about the same, occasionally
> > with IN winning out.
>
> In SQL 2008r2, the optimizer is usually smart enough to end up with
> the same execution plan for IN and EXISTS queries. Historically,
> EXISTS was usually the faster operation for SQL Server and if
> memory serves it had to deal with its ability to bail out of the
> EXISTS query sooner compared to the IN query.

I'd have to go back and re-test 2008r2, as my testing was on 2005
(and earlier). But my testing directly contradicts your
"Historically..." bit, as I DISTINCTly (bad SQL pun intended) remember
being surprised precisely because of what you say: EXISTS should be
able to optimize and bail early. EXISTS also has some nice features
like the ability to do testing against multiple columns, i.e., you
can't do something like

select *
from tbl1
where (col1, col2) in (
select a, b
from tbl2
)

but that's a simple EXISTS query.

> > MSSQL is a 2nd-class citizen in the Django world, so I'm +1
>
> Reasoning like that helps to keep it in its place.

MSSQL's lack of certain core features is what does that.
OFFSET/LIMIT provided by pretty much every other DB vendor? Use
baroque hacks to get the same functionality. I seem to recall other
issues, though v2008 seems to have addressed many of them (min date =
1900-01-01 which was an issue when dealing with
historical/genealogical data; seems to be better in 2008)

Either way, if EXISTS in MSSQL is now faster than IN, it just is one
more tally in the "plus" column for why this might be a good idea
(modulo implementation complexities).

-tkc






Michael Manfre

unread,
Mar 26, 2013, 11:19:52 AM3/26/13
to Tim Chase, django-d...@googlegroups.com, anssi.ka...@thl.fi
On Tue, Mar 26, 2013 at 10:23 AM, Tim Chase <django...@tim.thechases.com> wrote:
> In SQL 2008r2, the optimizer is usually smart enough to end up with
> the same execution plan for IN and EXISTS queries. Historically,
> EXISTS was usually the faster operation for SQL Server and if
> memory serves it had to deal with its ability to bail out of the
> EXISTS query sooner compared to the IN query.

I'd have to go back and re-test 2008r2, as my testing was on 2005
(and earlier). But my testing directly contradicts your
"Historically..." bit, as I DISTINCTly (bad SQL pun intended) remember
being surprised precisely because of what you say: EXISTS should be
able to optimize and bail early.

My "Historically" comment is based upon memory of observed behaviors from many years ago and could even be an ingrained memory dating back to SQL Server 7. I'll concede that your testing is probably more accurate than my memory of observed SQL optimizer behaviors from upwards of a decade ago. The execution plan is a result of the specific query and schema, so it's possible that we're both correct and both wrong at the same time.
 
 EXISTS also has some nice features
like the ability to do testing against multiple columns, i.e., you
can't do something like

  select *
  from tbl1
  where (col1, col2) in (
   select a, b
   from tbl2
   )

but that's a simple EXISTS query.

Agreed, EXISTS is more flexible and at least to me, often easier to read, format, and maintain. Assuming this is implemented with a database feature, I'll most likely enable the EXISTS change for django-mssql. My main objections to a blanket change is to not have the specific behaviors of one database dictate how every other database must behave. Some one needs to be the voice of 3rd party database backends and it appears for the time being, I'm wearing that hat.
 
> > MSSQL is a 2nd-class citizen in the Django world, so I'm +1
>
> Reasoning like that helps to keep it in its place.

MSSQL's lack of certain core features is what does that.
OFFSET/LIMIT provided by pretty much every other DB vendor?  Use
baroque hacks to get the same functionality.  

The non "baroque hack" way of doing limit offset is with nested selects. Maybe someday the non-standard LIMIT/OFFSET keywords will get added to the standard (I truly hope this happens) so Oracle, MSSQL, DB2, and Informix could share SQL with postgres and mysql without needing to mangle it.

Regards,
Michael Manfre

Anssi Kääriäinen

unread,
Mar 26, 2013, 12:40:35 PM3/26/13
to django-d...@googlegroups.com, Tim Chase, anssi.ka...@thl.fi
On Tuesday, March 26, 2013 5:19:52 PM UTC+2, Michael Manfre wrote:


On Tue, Mar 26, 2013 at 10:23 AM, Tim Chase <django...@tim.thechases.com> wrote:
 EXISTS also has some nice features
like the ability to do testing against multiple columns, i.e., you
can't do something like

  select *
  from tbl1
  where (col1, col2) in (
   select a, b
   from tbl2
   )

but that's a simple EXISTS query.

Agreed, EXISTS is more flexible and at least to me, often easier to read, format, and maintain. Assuming this is implemented with a database feature, I'll most likely enable the EXISTS change for django-mssql. My main objections to a blanket change is to not have the specific behaviors of one database dictate how every other database must behave. Some one needs to be the voice of 3rd party database backends and it appears for the time being, I'm wearing that hat.

One of the main reasons for the change is that EXISTS allows for queries that are impossible with IN. In addition EXISTS semantics regarding NULLs is wanted. And, at least PostgreSQL performs better with NOT EXISTS than NOT IN. Granted, the better performance on PostgreSQL is perhaps the most important reason for me, but this change is not only about that.

The main reason for this thread was to find out if there are some databases where performance of EXISTS is worse than IN. The DB feature approach seems good because it allows investigating performance characteristics one DB at time.

 
> > MSSQL is a 2nd-class citizen in the Django world, so I'm +1
>
> Reasoning like that helps to keep it in its place.

MSSQL's lack of certain core features is what does that.
OFFSET/LIMIT provided by pretty much every other DB vendor?  Use
baroque hacks to get the same functionality.  

The non "baroque hack" way of doing limit offset is with nested selects. Maybe someday the non-standard LIMIT/OFFSET keywords will get added to the standard (I truly hope this happens) so Oracle, MSSQL, DB2, and Informix could share SQL with postgres and mysql without needing to mangle it.

Hmmh, this means Oracle, MSSQL, DB2 and Informix are doing more or less the same thing for limit/offset support? If so, then having a more generic approach to this problem than having a custom compiler per backend might be worth it...

BTW there is already something like LIMIT and OFFSET in SQL 2008 standard. The syntax is different than LIMIT/OFFSET, and supported only by some vendors... See https://en.wikipedia.org/wiki/Select_%28SQL%29#FETCH_FIRST_clause

 - Anssi

Michael Manfre

unread,
Mar 26, 2013, 3:54:00 PM3/26/13
to django-d...@googlegroups.com, Tim Chase, anssi.ka...@thl.fi
On Tue, Mar 26, 2013 at 12:40 PM, Anssi Kääriäinen <anssi.ka...@thl.fi> wrote:
Hmmh, this means Oracle, MSSQL, DB2 and Informix are doing more or less the same thing for limit/offset support? If so, then having a more generic approach to this problem than having a custom compiler per backend might be worth it...

I haven't looked at the DB2 or Informix code, but Oracle and MSSQL have a custom SQLCompiler primarily to handle constructing SQL to deal with limit/offset. A generic approach would be nice to have, but I can't imagine a generic way that would let me generate the "SELECT ... FROM (SELECT ROW_NUMBER() OVER (...)) WHERE ..." monstrosity with lots of column aliasing that I currently construct. I'm definitely interested in brainstorming what a generic approach might look like.
 
BTW there is already something like LIMIT and OFFSET in SQL 2008 standard. The syntax is different than LIMIT/OFFSET, and supported only by some vendors... See https://en.wikipedia.org/wiki/Select_%28SQL%29#FETCH_FIRST_clause

The SQL 2008 standard added FETCH FIRST, which covers the LIMIT part of the problem, but I didn't see anything in the standard to deal with OFFSET, which is usually the more painful part of the problem. Window functions were added in the 2003 standard, but they can have some issues with ordering and I don't think the standard requires a way of being able to filter by some sort of result row number.

Regards,
Michael Manfre

Tim Chase

unread,
Mar 26, 2013, 4:27:44 PM3/26/13
to Michael Manfre, django-d...@googlegroups.com, anssi.ka...@thl.fi
On 2013-03-26 15:54, Michael Manfre wrote:
> On Tue, Mar 26, 2013 at 12:40 PM, Anssi Kääriäinen
> deal with limit/offset. A generic approach would be nice to have,
> but I can't imagine a generic way that would let me generate the
> "SELECT ... FROM (SELECT ROW_NUMBER() OVER (...)) WHERE ..."
> monstrosity with lots of column aliasing that I currently
> construct.

The closest I've come is an ugly nested query using TOP (the TSQL
analog to LIMIT, but as mentioned earlier, there's no OFFSET
counterpart), and inverting the sort conditions:

-- want sorted by "a asc, b desc, c asc"
-- assuming LIMIT=10, OFFSET=20
select *
from (
select top 10 -- LIMIT
*
from (
select top 30 -- LIMIT+OFFSET=10+20
*
from tbl
order by a asc, b desc, c asc
) top_half
order by a desc, b asc, c desc -- note inversion
) reversed_top_half
order by a asc, b desc, c asc

It's been a while since I've done it, so it Works™, but (1) there's
the inevitable fence-posting error I'd have to verify, (2) it involves
sorting, reverse-sorting, then re-sorting (not exactly the speediest
operation), and (3) it's hideous. It doesn't seem to require the
column-aliasing you mention, and it is a fairly generic approach, but
I can't say I recommend it :-)

-tkc






Petite Abeille

unread,
Mar 26, 2013, 5:00:50 PM3/26/13
to django-d...@googlegroups.com

On Mar 26, 2013, at 4:19 PM, Michael Manfre <mma...@gmail.com> wrote:

> Maybe someday the non-standard LIMIT/OFFSET keywords will get added to the
> standard (I truly hope this happens) so Oracle, MSSQL, DB2, and Informix
> could share SQL with postgres and mysql without needing to mangle it.

FWIW, Oracle 12c is rumored to support OFFSET/FETCH FIRST:

• Top-N now using Row limiting clause eg. “OFFSET 10 ROWS FETCH FIRST 10 ROWS ONLY”. Similar to mySQL syntax.

http://www.oracle-base.com/blog/2012/10/06/oracle-openworld-2012-day-5/


Alex Gaynor

unread,
Mar 26, 2013, 5:03:02 PM3/26/13
to django-d...@googlegroups.com
For what it's worth, SQL2011 does define OFFSET, finally.

Alex


--
You received this message because you are subscribed to the Google Groups "Django developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.


Petite Abeille

unread,
Mar 26, 2013, 5:47:34 PM3/26/13
to django-d...@googlegroups.com

On Mar 26, 2013, at 10:03 PM, Alex Gaynor <alex....@gmail.com> wrote:

> For what it's worth, SQL2011 does define OFFSET, finally.

Perhaps worthwhile mentioning as well :

"Do not try to implement a scrolling window using LIMIT and OFFSET. Doing so will become sluggish as the user scrolls down toward the bottom of the list."
-- Scrolling Cursor, What Not To Do
http://www.sqlite.org/cvstrac/wiki?p=ScrollingCursor

Just saying...

Michael Manfre

unread,
Mar 26, 2013, 7:49:10 PM3/26/13
to django-d...@googlegroups.com
On Tue, Mar 26, 2013 at 5:47 PM, Petite Abeille <petite....@gmail.com> wrote:

On Mar 26, 2013, at 10:03 PM, Alex Gaynor <alex....@gmail.com> wrote:

> For what it's worth, SQL2011 does define OFFSET, finally.

Just checked and it appears this has been implemented with SQL Server 2012. A quick check of the other database with Django backends seems to show that all of them have a beta or stable release with support for a simple limit/offset syntax, with many supporting the standard defined syntax. This might make the generic approach as simple as adding an DatabaseOperatoins.limit_offset_sql() abstraction.

Regards,
Michael Manfre

Stefano Crosta

unread,
Mar 27, 2013, 4:10:32 AM3/27/13
to django-d...@googlegroups.com
This is already getting slightly off-topic wrt the original discussion on EXISTS/IN, so I allowed myself to change the subject.

But this is another important matter I think, and I've already seen mention of server-side cursors in these threads. 

I have a question on SO http://stackoverflow.com/questions/14144408/memory-efficient-constant-and-speed-optimized-iteration-over-a-large-table-in about exactly that problem, it would be great to see server-side cursors and/or a smarter way to iterate make it into core. I hacked my procedure by I'm sure there are better ways.
I'm decently versed in python and have come to learn django internals, but I am zero on database side so I'm finding it difficult to propose a real contribution (the best effort I could put together is in the SO question). If nudged, I can try and help.

thanks!

Stefano

 

Aymeric Augustin

unread,
Mar 27, 2013, 4:21:51 AM3/27/13
to django-d...@googlegroups.com
On 27 mars 2013, at 09:10, Stefano Crosta <stefano...@gmail.com> wrote:

> it would be great to see server-side cursors and/or a smarter way to iterate make it into core.

FYI this topic is tracked here: https://code.djangoproject.com/ticket/16614

--
Aymeric.



Anssi Kääriäinen

unread,
Mar 27, 2013, 4:49:23 AM3/27/13
to Django developers
On 27 maalis, 10:10, Stefano Crosta <stefano.cro...@gmail.com> wrote:
> On Tuesday, March 26, 2013 10:47:34 PM UTC+1, Petite Abeille wrote:
>
> > On Mar 26, 2013, at 10:03 PM, Alex Gaynor <alex....@gmail.com<javascript:>>
> > wrote:
>
> > > For what it's worth, SQL2011 does define OFFSET, finally.
>
> > Perhaps worthwhile mentioning as well :
>
> > "Do not try to implement a scrolling window using LIMIT and OFFSET. Doing
> > so will become sluggish as the user scrolls down toward the bottom of the
> > list."
> > -- Scrolling Cursor, What Not To Do
> >http://www.sqlite.org/cvstrac/wiki?p=ScrollingCursor
>
> > Just saying...
>
> This is already getting slightly off-topic wrt the original discussion on
> EXISTS/IN, so I allowed myself to change the subject.
>
> But this is another important matter I think, and I've already seen mention
> of server-side cursors in these threads.

Server side cursor would be useful in many situations. However, it is
hard to implement server side cursors in a way the ORM could use them
automatically. See https://code.djangoproject.com/ticket/16614 for
details.

The usual problem with LIMIT/OFFSET is pagination and going deep into
pages list. Server side cursors would only help if you could somehow
keep the same cursor open between requests for different pages.

The solution to large resultset pagination is to sort by some index
you have, and then when you go to next page, you do this:

SomeModel.objects.filter(indexedcol__gt=prev_pages_last_obj.indexedcol).order_by('indexedcol')
[0:PAGE_SIZE]
This operation can use the index and is a very efficient way to
retrieve pages. The downside is that this only works for unique
indexed columns (or "unique enough" that duplicates do not matter in
practice). Also, you can't easily give page numbers for pagination,
you can only have links "first, previous, next, last". Still, if you
have a lot of objects then normal pagination of "count all, give links
by page numbers" simply does not scale. The count(*) itself can be too
expensive for large resultsets.

Having in-built support for this type of pagination is something I
would like to see in Django.

If you want to implement an automatically refreshed scrolling window
using AJAX and want the ability to sort by any column and do so
efficiently, then you will need to use server side cursors and have
some way to get the same cursor back for different AJAX requests. In
practice you would need some sort of connection pool where you could
store a connection for reuse, and ask the same connection back when
next request arrives. This seems complex to implement correctly and
doesn't seem like something that belongs into Django...

- Anssi

Stefano Crosta

unread,
Mar 27, 2013, 4:53:45 AM3/27/13
to django-d...@googlegroups.com
Thanks Aymeric, 
that's true, and I should have put that link too.
sadly "akaariai" who's the only one who seemed to understand a bit about it (and is a core dev) did not seem interested in bringing this any forward, so I thought I'd try to raise some interest again!

Anssi Kääriäinen

unread,
Mar 27, 2013, 5:36:02 AM3/27/13
to django-d...@googlegroups.com, Stefano Crosta
On 03/27/2013 10:53 AM, Stefano Crosta wrote:
> Thanks Aymeric,
> that's true, and I should have put that link too.
> sadly "akaariai" who's the only one who seemed to understand a bit
> about it (and is a core dev) did not seem interested in bringing this
> any forward, so I thought I'd try to raise some interest again!
The problem is that server side cursors will need dedicated API, and
there will be database specific problems when using server side cursors.
On SQLite changes done to rows are visible in the results, on other
databases not. On PostgreSQL you will need to use WITH HOLD cursors if
you want to use server side cursors outside transactions (that is, in
autocommit mode normal server side cursors do not work). When using WITH
HOLD cursors you must close the cursor explicitly or you will have
cursor leak... And on MySQL WITH HOLD cursors aren't available at all,
so you must be in transaction to use server side cursors.

Oracle seems to be the only core DB that will work without problems. In
fact, using .iterator() on Oracle already works without memory problems.

Maybe the dedicated API could be adding two new keywords to .iterator():
server_side_cursor, and with_hold. with_hold=True implies
server_side_cursor=True. If you use with_hold=True you are responsible
for closing the iterator, too. The behaviour of server_side_cursor and
with_hold is database specific - it will be impossible to abstract the
differences away.

- Anssi

Stefano Crosta

unread,
Mar 27, 2013, 5:56:06 AM3/27/13
to django-d...@googlegroups.com, Stefano Crosta
Thanks Anssi for both your replies. These db interactions definitely are really beyond the scope of my knowledge, unluckily.

I did try and implement an iterator such as the one that you describe in your other message, based on index column filtering - what I dislike is that it's an additional function getting a qs parameter rather then naturally sitting in the queryset and (thus?) I did not manage to make it very generic...  Eg. some querysets don't even contain the key column anymore.. (the outlines of these are in my SO question but ). I'm sure it can be done better, especially if it's part of the queryset rather then an external function that gets an evaluated qs. I still manage to use it both for pagination and for very complex functions that I can't run in the DB.
As you say, it would still be nice to see this kind of generic iterator in core, even known the limitations (heck, the standard iterator just exploses the process by memory usage, so it's not that much better :) ).

WITH HOLD does not look like a safe solution if you have to explicitly close it (would need a separate thread with timeout maybe?), but I also read some discussions about creating a connection pool and persistent connections, so maybe at some point all these issues will find a common base for a solution..

As usual, I'll eagerly keep reading these posts and try to play with code on my own hoping to be able to contribute at some point..

Stefano

Stephan Seyboth

unread,
Sep 6, 2017, 9:29:21 AM9/6/17
to Django developers (Contributions to Django itself), django...@tim.thechases.com, anssi.ka...@thl.fi
Sorry for resurrecting this ancient thread, but it was the last direct mention I could find about this topic.

Curious to hear what happened to Anssi's proposal to change the ORM to use EXISTS subqueries instead of IN subqueries. Looks like the ORM still uses IN as of Django 1.11.4.

One of our queries just started running over 1000x slower from one day to the next b/c of the issues Postgres has with IN.

Are there any suggested workarounds that don't require resorting to raw SQL?

Todor Velichkov

unread,
Sep 6, 2017, 1:46:27 PM9/6/17
to Django developers (Contributions to Django itself), django...@tim.thechases.com, anssi.ka...@thl.fi
How does this query look like? Not sure if this is gonna help, but you can take a look at Exists() subqueries
Reply all
Reply to author
Forward
0 new messages