Configurable safety options for high performance Django systems

477 views
Skip to first unread message

Rick van Hattem

unread,
Nov 18, 2014, 7:58:00 AM11/18/14
to django-d...@googlegroups.com
Hi guys,

As it is right now Django has the tendency to kill either your browser (if you're lucky) or the entire application server when confronted with a large database. For example, the admin always does counts for pagination and a count over a table with many rows (say, in the order of 100M) can really stall your app and database servers.

In the past I've simply patched Django to include a few "security" features for cases like these so someone can't accidently kill the site by doing something silly, but I was wondering if it would be a good idea to include some of these configurable options in the core.

Examples:
 - If COUNT_LIMIT is set than wrap counts in a subquery and limit the results: SELECT COUNT(*) FROM (SELECT pk FROM table LIMIT <COUNT_LIMIT>)
 - When someone does "if queryset" automatically limit it to a default slice size so someone doesn't accidently fetch all rows in the queryset.

~wolph

Michael Manfre

unread,
Nov 18, 2014, 9:09:27 AM11/18/14
to django-d...@googlegroups.com
Django is a generic framework. Those who use it to implement functionality are the ones who kill either the browser or their entire application. Applications must be written to meet functional and performance requirements of the project. When dealing with large tables, more effort is certainly needed when using the Admin. E.g. write custom ModelAdmins or admin filters that restrict the resultset size.

I don't think Django should automatically limit these queries.

Regards,
Michael Manfre

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/9ec3e5e6-0681-4e5b-bb8e-c450ce38c96a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Rick van Hattem

unread,
Nov 18, 2014, 9:17:14 AM11/18/14
to django-d...@googlegroups.com
The framework is indeed generic which is why it should also take scenarios into account where there is a lot more data. There is obviously no one-size-fits-all solution which is why I propose configurable options. The needed patches within the Django core are small and have very limited impact but the debugging it can save when working with larger querysets is enormous.

Believe me when I tell you that debugging why your entire Django instance is hanging at random times due to some erroneous query with an unfortunate filter is painful at best. Especially since Django won't ever generate any useful debugging output in those cases, it will simply kill your machine and die.

Aymeric Augustin

unread,
Nov 18, 2014, 9:32:14 AM11/18/14
to django-d...@googlegroups.com
I've had to implement workarounds for such problems in many Django projects I created. For example I wrote code to determine which models have too many instances for dumping them all into a HTML dropdown and automatically add this model to raw_id_fields in related models.

The main difficulty I'm foreseeing is to provide a generic and useful set of restrictions. Even within specific projects, it was hard to come up with one-size-fits-all-models solutions. If we want to implement something in Django itself, the problem is much worse.

Having workarounds for 70% of the places where Django handles large tables ungracefully isn't much better than 0%. It creates a false feeling of security.

-- 
Aymeric.


For more options, visit https://groups.google.com/d/optout.



--
Aymeric.

Claude Paroz

unread,
Nov 18, 2014, 1:18:19 PM11/18/14
to django-d...@googlegroups.com
On Tuesday, November 18, 2014 1:58:00 PM UTC+1, Rick van Hattem wrote:
Hi guys,

As it is right now Django has the tendency to kill either your browser (if you're lucky) or the entire application server when confronted with a large database. For example, the admin always does counts for pagination and a count over a table with many rows (say, in the order of 100M) can really stall your app and database servers.


https://docs.djangoproject.com/en/dev/ref/contrib/admin/#django.contrib.admin.ModelAdmin.show_full_result_count seems to go in your direction, at least. count(*) performance is implementation-specific, so once again, no one-size-fits-all.
Maybe the documentation could be completed, here and there.

Claude

Rick van Hattem

unread,
Nov 18, 2014, 1:28:13 PM11/18/14
to django-d...@googlegroups.com
That certainly solves one part of the problem. After that I would still opt for an optional configurable default for slicing. Personally I prefer to raise an error when unsliced querysets are used since it's almost always harmful or at least dangerous behaviour.

--
You received this message because you are subscribed to a topic in the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/django-developers/aYwPykvLaMU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to django-develop...@googlegroups.com.

To post to this group, send email to django-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.

Schmitt, Christian

unread,
Nov 18, 2014, 1:37:39 PM11/18/14
to django-d...@googlegroups.com
Sorry, but I don't think this is suitable.
If somebody has 100M rows per Table then, he should prolly think about sharding/replication anyway. So the ORM would still suffer anyway.
Currently my company has a few tables with a high count as well but since we never used the django-admin and managed to shard some of the data, we never ran into such problems.

Django is a headstart for many projects / products, but once you have certain amount of data you still need to program a few things by yourself.

If your patches are enough than live with it, but I don't see a reason for optimizing django against big tables.



--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.

To post to this group, send email to django-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.

Rick van Hattem

unread,
Nov 18, 2014, 1:59:53 PM11/18/14
to django-d...@googlegroups.com
The 100M is an example of a really big problem, but the problem also exists on a much smaller scale.

Even doing "if some_queryset" with 10,000 items can get pretty slow, just because it doesn't kill the server in all cases doesn't make it a good thing to ignore.

In all but the most basic Django projects I've seen problems like these. Sane defaults won't hurt anyone and solves issues for people with larger systems. And running forks of Django seems counter productive as well.

The "raw_id_field" is a nice solution for making the admin workable again but I've seen many real-life projects have huge performance problems because at some point the amount of linked items grew to the point of being problematic. Having a dropdown with 100 items is a lot but workable, with 1000 items it's a problem. Which is not a large amount of items by any measure.

Josh Smeaton

unread,
Nov 18, 2014, 7:42:59 PM11/18/14
to django-d...@googlegroups.com, wo...@wol.ph
To me, "sane default" means django should not silently alter the query to provide a LIMIT when it is not asked for.

I have also run into situations where doing a .count() or iterating a full table has broken the application, or put too much pressure on the database. Specifically with django bindings to javascript datatables. But I still wouldn't want artificial limiting on such queries.

What *may* be useful, is to be able to apply queryset methods onto an already sliced queryset. That would allow users to implement queryset/manager methods that provide pre-sliced querysets to the rest of the code. The problem would be, what should happen in this case?

Model.objects.all()[0:10].filter(field=True)

Should the filter be logically/internally moved to before the limit? Or should the filter be applied to the result of the limit in an outer query? Traditionally, django applies mutations in succession, but this wouldn't be very useful the for the majority of operations that would occur "after" the slicing. We *could* say all slicing is "saved" and applied at the end, but we'd definitely run into issues with users reporting that filtering isn't working as they expect - after the slice.

Michael Manfre

unread,
Nov 18, 2014, 9:12:38 PM11/18/14
to django-d...@googlegroups.com
On Tue, Nov 18, 2014 at 7:42 PM, Josh Smeaton <josh.s...@gmail.com> wrote:
To me, "sane default" means django should not silently alter the query to provide a LIMIT when it is not asked for.

I have also run into situations where doing a .count() or iterating a full table has broken the application, or put too much pressure on the database. Specifically with django bindings to javascript datatables. But I still wouldn't want artificial limiting on such queries.

What *may* be useful, is to be able to apply queryset methods onto an already sliced queryset. That would allow users to implement queryset/manager methods that provide pre-sliced querysets to the rest of the code. The problem would be, what should happen in this case?

Model.objects.all()[0:10].filter(field=True)


I think we'd need a non-slicing syntax to allow chaining of limit/offset. E.g. Model.objects.all().set_slice(limit=10, offset=0).filter(field=True). Changing the existing slicing output would be a major backwards incompatible headache.
 
Should the filter be logically/internally moved to before the limit? Or should the filter be applied to the result of the limit in an outer query? Traditionally, django applies mutations in succession, but this wouldn't be very useful the for the majority of operations that would occur "after" the slicing. We *could* say all slicing is "saved" and applied at the end, but we'd definitely run into issues with users reporting that filtering isn't working as they expect - after the slice.

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.

Carl Meyer

unread,
Nov 18, 2014, 9:52:52 PM11/18/14
to django-d...@googlegroups.com
Hi Rick,

On 11/18/2014 11:59 AM, Rick van Hattem wrote:
[snip]
> In all but the most basic Django projects I've seen problems like these.
> Sane defaults won't hurt anyone and solves issues for people with larger
> systems. And running forks of Django seems counter productive as well.

As a side note -- I'm not sure why running a fork of Django should be
necessary. All the suggestions you've made so far in this thread could
be implemented in a custom QuerySet subclass, and custom QuerySet
classes can be hooked into your models via a custom Manager (which can
even be created for you by Manager.from_queryset). Create an abstract
base Model using this Manager, inherit all your models from it, and you
can get the behaviors you want throughout your system without forking
Django.

Carl

signature.asc

Rick van Hattem

unread,
Nov 19, 2014, 7:40:58 AM11/19/14
to django-d...@googlegroups.com
Hi Carl,

Thruthfully some part of my reason for forking was that I was running an older version of Django which didn't have custom user models. In that case it's a bit more difficult to override the manager and I've seen quite a few external projects (accidently) do something like "User.objects.all()" and doing an if or loop around it. Quite problematic with a few million users.

Regardless, it still requires forking either Django or all external projects (in case of different models) so the problem is not solved by custom managers.

Rick van Hattem

unread,
Nov 19, 2014, 7:50:20 AM11/19/14
to django-d...@googlegroups.com, wo...@wol.ph
Definitely agree on this, silently altering a query's limit is probably not the way to go. Raising an exception in case of no limit and lots of results could be useful.

For the sake of keeping the discussion useful:
- Let's say you have a table with 50,000 items, not an insanely large amount imho.
- Now someone does a non-restricted loop through "Model.objects.all()" which would return 50,000 items.
- In what case would this be desirable as opposed to limiting to 10,000 items and raising an error when the database actually returns 10,000 items.

Naturally this case would only apply if no slice is given, but if you're really processing over 10,000 items in a single loop you probably know how to slice the queryset when needed.
Perhaps something like this: raise QuerysetTooLarge('Your query returned over {LIMIT} results, if this is intentional please slice the queryset')

Not saying this should be a default straight away but having a default of 10,000 or even 100,000 should not hurt anyone and protects against killing a server which is always a positive result in my book.

Marco Paolini

unread,
Nov 19, 2014, 7:56:57 AM11/19/14
to django-d...@googlegroups.com, wo...@wol.ph
2014-11-19 13:50 GMT+01:00 Rick van Hattem <wol...@gmail.com>:
Definitely agree on this, silently altering a query's limit is probably not the way to go. Raising an exception in case of no limit and lots of results could be useful.

For the sake of keeping the discussion useful:
- Let's say you have a table with 50,000 items, not an insanely large amount imho.
- Now someone does a non-restricted loop through "Model.objects.all()" which would return 50,000 items.
even a restricted loop causes troubles (at least in postgres). 

if you do:

for i, m in enumerate(Model.objects.all()):
    if i > 4:
        break

then psycopg will fetch (and cache locally) the entire cursor resultset. Of course django orm will only fetch the first batch (100 items I think) from the psycppg cursor, bu that does not solve the underlying issue.

There are options to have server-side cursors in psycopg but django does not use them AFAIK





- In what case would this be desirable as opposed to limiting to 10,000 items and raising an error when the database actually returns 10,000 items.

Naturally this case would only apply if no slice is given, but if you're really processing over 10,000 items in a single loop you probably know how to slice the queryset when needed.
Perhaps something like this: raise QuerysetTooLarge('Your query returned over {LIMIT} results, if this is intentional please slice the queryset')

Not saying this should be a default straight away but having a default of 10,000 or even 100,000 should not hurt anyone and protects against killing a server which is always a positive result in my book.

On Wednesday, November 19, 2014 1:42:59 AM UTC+1, Josh Smeaton wrote:
To me, "sane default" means django should not silently alter the query to provide a LIMIT when it is not asked for.

I have also run into situations where doing a .count() or iterating a full table has broken the application, or put too much pressure on the database. Specifically with django bindings to javascript datatables. But I still wouldn't want artificial limiting on such queries.

What *may* be useful, is to be able to apply queryset methods onto an already sliced queryset. That would allow users to implement queryset/manager methods that provide pre-sliced querysets to the rest of the code. The problem would be, what should happen in this case?

Model.objects.all()[0:10].filter(field=True)

Should the filter be logically/internally moved to before the limit? Or should the filter be applied to the result of the limit in an outer query? Traditionally, django applies mutations in succession, but this wouldn't be very useful the for the majority of operations that would occur "after" the slicing. We *could* say all slicing is "saved" and applied at the end, but we'd definitely run into issues with users reporting that filtering isn't working as they expect - after the slice.

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.

Marco Paolini

unread,
Nov 19, 2014, 7:59:35 AM11/19/14
to django-d...@googlegroups.com, wo...@wol.ph
also, the offset + limit pagination strategy of django paginator is sub-optimal as it has N complexity: doing SELECT * FROM auth_user LIMIT 100 offset 1000000 causes a 100000-long table scan

Schmitt, Christian

unread,
Nov 19, 2014, 8:49:54 AM11/19/14
to django-d...@googlegroups.com, wo...@wol.ph
A sequence scan will only be made, if you query non indexed values. 
So if you add a simple ORDER BY you will make a index scan, which is very fast.
The problem relies more on the database than on the ORM.
As already said. If you need to deal with that much queries you need to log your SQL statements and need to optimize them. Nothing to do with django at all.
We ran into several performance issues of our applications as well even at something like 10.000 or 100.000 entries per table and it was quite obvious that the problem relied on the database/queries. 

Also we learned that using the cache gives you a big performance win.
"The best query you could have, is the query you never need to make"
I don't think its good to make django more safely when dealing with high performance systems, since it will make the system more complex and django is already really complex inside the core orm.



Marco Paolini

unread,
Nov 19, 2014, 9:11:37 AM11/19/14
to django-d...@googlegroups.com, wo...@wol.ph
2014-11-19 14:49 GMT+01:00 Schmitt, Christian <c.sc...@briefdomain.de>:
A sequence scan will only be made, if you query non indexed values.  
So if you add a simple ORDER BY you will make a index scan, which is very fast.

Nope. a large OFFSET of N will read through N rows, regardless index coverage. see http://www.postgresql.org/docs/9.1/static/queries-limit.html
 
The problem relies more on the database than on the ORM.

The problem relies on the django paginator, that builds OFFSET queries that are unefficient by design. There are alternatives such as ordering by primary key and doing SELECT * from auth_user WHERE id > latest_id_returned . that "latest_id_returned" has to be passed back and forth to the client to keep the pagination hapening. Too bad you can't do "jump to page 15" like the django admin does.

 
As already said. If you need to deal with that much queries you need to log your SQL statements and need to optimize them. Nothing to do with django at all.
We ran into several performance issues of our applications as well even at something like 10.000 or 100.000 entries per table and it was quite obvious that the problem relied on the database/queries. 

read the relevant theread I posted before. There are huge memory allocation issues even if you just iterate the first two items of a query that potentially returns a million rows: all the rows will be fetched and stored in memory by  psycopg and django ORM 


Also we learned that using the cache gives you a big performance win.
"The best query you could have, is the query you never need to make"

...except when the cache is bigger than your entire VM. 

In this case if your dataset is , say, 10 TB, you will need roughly 10 TB of memory and 10 TB of data transferred between your django app and your postgres server. It is not a matter of optimization: is mere survival.

 
I don't think its good to make django more safely when dealing with high performance systems, since it will make the system more complex and django is already really complex inside the core orm.
 
Let's first understand what's needed, than we can decide if it has a story inside the django core


Schmitt, Christian

unread,
Nov 20, 2014, 2:31:06 AM11/20/14
to django-d...@googlegroups.com, wo...@wol.ph
Nope. a large OFFSET of N will read through N rows, regardless index coverage. see http://www.postgresql.org/docs/9.1/static/queries-limit.html

That's simple not true.
If you define a Order By with a well indexes query, the database will only do a bitmap scan.
This wiki isn't well explained. Take this:


The problem relies on the django paginator, that builds OFFSET queries that are unefficient by design. There are alternatives such as ordering by primary key and doing SELECT * from auth_user WHERE id > latest_id_returned . that "latest_id_returned" has to be passed back and forth to the client to keep the pagination hapening. Too bad you can't do "jump to page 15" like the django admin does.

Paginator needs to use OFFSET, which is slower than your seek method, but SEEK has certain limitations. As you already said. Index and Order By to the Rescue, which makes OFFSET as fast as possible. WITH INDEX, ORDER BY and maybe some READ Slaves or even if you Shard your data the result is as good as the SEEK method.
 

read the relevant theread I posted before. There are huge memory allocation issues even if you just iterate the first two items of a query that potentially returns a million rows: all the rows will be fetched and stored in memory by  psycopg and django ORM 

Thats only happening since django will built something like "SELECT "bla", "bla", "bla" FROM table LIMIT 10 OFFSET 100;
There is no ORDER BY clause, which is of course really slow, even with only 1000 entries inside a table.
 

...except when the cache is bigger than your entire VM. 

In this case if your dataset is , say, 10 TB, you will need roughly 10 TB of memory and 10 TB of data transferred between your django app and your postgres server. It is not a matter of optimization: is mere survival.

Okai  just as I said, If you deal with that amount of data u certainly will use READ Slaves AND / OR partition your data.

 
Let's first understand what's needed, than we can decide if it has a story inside the django core
 

The only thing that could be improved is that we should force a ORDER BY id clause here and there. The index on the automatic id is already there. 

I don't know why we should enforce something just because somebody has a huge amount of data and can't deal with it. That wouldn't follow the clean design pattern. Most people rely on the "Page to" part of the Paginator, if we patch django to use a seek method as a default it wouldn't be that good for a lot of developers.
The most people that have a performance problem could easily "fix" that by themself or throwing more hardware at it.

Marco Paolini

unread,
Nov 20, 2014, 6:05:29 AM11/20/14
to django-d...@googlegroups.com, wo...@wol.ph
2014-11-20 8:30 GMT+01:00 Schmitt, Christian <c.sc...@briefdomain.de>:
Nope. a large OFFSET of N will read through N rows, regardless index coverage. see http://www.postgresql.org/docs/9.1/static/queries-limit.html

That's simple not true.
If you define a Order By with a well indexes query, the database will only do a bitmap scan.
This wiki isn't well explained. Take this:


You are right: it is not a table scan but the point is OFFSET has *linear complexity* even with the ORDER BY and the proper index.

The "seek" (with a PK > seek_point, as I was suggesting)  is much better.


The problem relies on the django paginator, that builds OFFSET queries that are unefficient by design. There are alternatives such as ordering by primary key and doing SELECT * from auth_user WHERE id > latest_id_returned . that "latest_id_returned" has to be passed back and forth to the client to keep the pagination hapening. Too bad you can't do "jump to page 15" like the django admin does.

Paginator needs to use OFFSET, which is slower than your seek method, but SEEK has certain limitations. As you already said. Index and Order By to the Rescue, which makes OFFSET as fast as possible. WITH INDEX, ORDER BY and maybe some READ Slaves or even if you Shard your data the result is as good as the SEEK method.

Putting sharding into the mix further complicates this discussion, but even in a sharded environment, having linar complexity in a single shard is to be avoided.
 
 

read the relevant theread I posted before. There are huge memory allocation issues even if you just iterate the first two items of a query that potentially returns a million rows: all the rows will be fetched and stored in memory by  psycopg and django ORM 

Thats only happening since django will built something like "SELECT "bla", "bla", "bla" FROM table LIMIT 10 OFFSET 100;
There is no ORDER BY clause, which is of course really slow, even with only 1000 entries inside a table.

No, it has nothing to do with the ORDER BY. It's just how psycopg works: the first time you fetch from a cursor, say you fetch the first row, it automatically fetches all the results from the server. The only solution is using server-side cursor, which apparently oracle uses by default.

 
 

...except when the cache is bigger than your entire VM. 

In this case if your dataset is , say, 10 TB, you will need roughly 10 TB of memory and 10 TB of data transferred between your django app and your postgres server. It is not a matter of optimization: is mere survival.

Okai  just as I said, If you deal with that amount of data u certainly will use READ Slaves AND / OR partition your data.

Not always true: a single machine can happily handle a few terabytes.

The point here is client-side cursors will eat up a lot your memory in the *application tier* if you happen to do a simple:

for user in User.objects.all().iterator():
  break

even if you only have a few thousand users, you will see a big memory and network spike


 
Let's first understand what's needed, than we can decide if it has a story inside the django core
 

The only thing that could be improved is that we should force a ORDER BY id clause here and there. The index on the automatic id is already there. 

That only solves the issue in the best-case scenario: when you fetch the first pages. The last page scenario even with order by and an index still has N complexity. That's written clearly in the article you posted.

 

I don't know why we should enforce something just because somebody has a huge amount of data and can't deal with it. That wouldn't follow the clean design pattern. Most people rely on the "Page to" part of the Paginator, if we patch django to use a seek method as a default it wouldn't be that good for a lot of developers.
The most people that have a performance problem could easily "fix" that by themself or throwing more hardware at it.

I am all against enforcing something: I only thin it is nice to provider the options for not killing your application layer . These things are used normally by all django developers that deal with medium-size applications.

Throwing iron is sub-optimal when you can just use 10 lines of code as I normally do



--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.

Rick van Hattem

unread,
Nov 20, 2014, 6:46:01 AM11/20/14
to django-d...@googlegroups.com, wo...@wol.ph
On Thursday, November 20, 2014 8:31:06 AM UTC+1, Christian Schmitt wrote:
Nope. a large OFFSET of N will read through N rows, regardless index coverage. see http://www.postgresql.org/docs/9.1/static/queries-limit.html

That's simple not true.
If you define a Order By with a well indexes query, the database will only do a bitmap scan.
This wiki isn't well explained. Take this:


Sounds awesome in theory, here's a real world example:

# explain analyze select * from entity_entity order by id limit 10 offset 1000000;
                                                                            QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=44832.84..44833.28 rows=10 width=423) (actual time=8280.248..8280.382 rows=10 loops=1)
   ->  Index Scan using entity_entity_pkey_two on entity_entity  (cost=0.44..823410.92 rows=18366416 width=423) (actual time=0.232..6861.507 rows=1000010 loops=1)
 Total runtime: 8280.442 ms
(3 rows)

# explain analyze select * from entity_entity where id > 1000000 order by id limit 10;
                                                                        QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.44..0.91 rows=10 width=423) (actual time=0.044..0.102 rows=10 loops=1)
   ->  Index Scan using entity_entity_pkey_two on entity_entity  (cost=0.44..823830.94 rows=17405140 width=423) (actual time=0.040..0.071 rows=10 loops=1)
         Index Cond: (id > 1000000)
 Total runtime: 0.152 ms
(4 rows)


And that's not even all the way in the table. That's just the beginning.


read the relevant theread I posted before. There are huge memory allocation issues even if you just iterate the first two items of a query that potentially returns a million rows: all the rows will be fetched and stored in memory by  psycopg and django ORM 

Thats only happening since django will built something like "SELECT "bla", "bla", "bla" FROM table LIMIT 10 OFFSET 100;
There is no ORDER BY clause, which is of course really slow, even with only 1000 entries inside a table.

Once again, great in theory. Doesn't always mean much in real world scenarios. Compare the following numbers to the ones above:


# explain analyze select * from entity_entity where id > 1000000 limit 10;
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..0.67 rows=10 width=423) (actual time=0.012..0.070 rows=10 loops=1)
   ->  Seq Scan on entity_entity  (cost=0.00..1168998.20 rows=17405140 width=423) (actual time=0.008..0.038 rows=10 loops=1)
         Filter: (id > 1000000)
 Total runtime: 0.122 ms
(4 rows)

# explain analyze select * from entity_entity limit 10 offset 1000000;
                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=61148.68..61149.30 rows=10 width=423) (actual time=3820.305..3820.347 rows=10 loops=1)
   ->  Seq Scan on entity_entity  (cost=0.00..1123082.16 rows=18366416 width=423) (actual time=0.007..2410.809 rows=1000010 loops=1)
 Total runtime: 3820.403 ms
(3 rows)


I don't know why we should enforce something just because somebody has a huge amount of data and can't deal with it. That wouldn't follow the clean design pattern. Most people rely on the "Page to" part of the Paginator, if we patch django to use a seek method as a default it wouldn't be that good for a lot of developers.
The most people that have a performance problem could easily "fix" that by themself or throwing more hardware at it.

The point of this discussion is to make Django easier and safer to use in all scenarios. Even in scenarios with relatively small amounts of data (sub 100k rows) Django can already kill a small VPS easily. Having some (optional) safeties in place doesn't hurt and can help a lot of people.

It's not that large amounts of data are not something you can deal with, but I've seen many real-life projects and Django apps where people didn't take into consideration that some people have more than 100 rows. With 1M+ rows you can kill a lot, but many things slow down or die with even 10k rows.

Mattias Linnap

unread,
Nov 23, 2014, 5:49:14 AM11/23/14
to django-d...@googlegroups.com, wo...@wol.ph
I thought I'd chime in since I run a Django app with fairly large tables (1M, 30M, 120M rows) on a fairly poor VPS (4GB ram, 2 cores, database and all services on one machine).

I just don't see a situation where "safety limits on ORM queries" would improve things. Changing behaviour based on the size of the queryset seems like a really bad idea. Now instead of gradual slowdown as the dataset grows, you start getting surprise exceptions, invalid data in results, or missing functionality once you hit a magic threshold.

There is no good way to set that threshold automatically, since people have very different expectations of what's "slow". For example, loading the second-to-last list page in the Django admin for my 1M row table takes a perfectly reasonable 2-3 seconds. That's too slow for the front page of a popular site, but even 30 seconds would be fine for an admin page that's visited maybe once a month.

I'm not saying that performance improvements or optimisation suggestions are a bad idea, but if they are possible, they should stay consistent. If using "if queryset:" is a bug, then it should show a warning and suggestion to use "if queryset.exists():" for both 100 and 100M rows. If there's an option to turn off full pagination in admin, then it should be a ModelAdmin option, not dependent on data size - the developer can enable it for big tables manually.

Behaviour changes based on the queryset size make it more difficult, not easier for developers - right now you can delay optimisation work until it really becomes too painful for users, whereas having strict limits would require doing that work as soon as a limit is reached.

Finally, I agree that it would be nice if Postgresql server-side cursors were better integrated with the ORM. It's possible to use them for raw queries already:

import connection, transaction
connection.ensure_connection()
with transaction.atomic():
    cursor = connection.connection.cursor(name='hello')
    cursor.execute('BIG QUERYSET SQL')
    for row in cursor: # fetches rows in cursor.itersize chunks
        pass

I use this in a few scripts that iterate over 10GB of table data. But a way to map the rows to Django objects would be nice.

Rick van Hattem

unread,
Nov 23, 2014, 7:11:13 AM11/23/14
to django-d...@googlegroups.com, wo...@wol.ph
Hi Mattias,

Can you comment on the example I've given where this should not cause any problems and should help pretty much everyone (but just in case, the setting could be optional)?

If/when an unsliced queryset were to reach a certain limit (say, 10,000, but configurable) the system would raise an error.

My arguments for why this would be a good idea and not a problem for anyone:

- Protects servers from going down due to memory constraints.
- Why would anyone need to run through a list of 10,000 items without any slicing/pagination/etc?
- If someone _does_ need to walk through more than 10,000 items, they'll know how to slice a queryset.
- You can write your own apps perfectly, but installing 3rd party apps can still kill the system without fixes like these.


So please, can anyone give a good argument as to why any sane person would have a problem with a huge default limit which will kill the performance of your site anyhow but isn't enough to kill the entire system?

Shai Berger

unread,
Nov 23, 2014, 7:28:04 AM11/23/14
to django-d...@googlegroups.com
Hi Rick,

On Sunday 23 November 2014 14:11:13 Rick van Hattem wrote:
>
> So please, can anyone give a good argument as to why any sane person would
> have a problem with a huge default limit which will kill the performance of
> your site anyhow but isn't enough to kill the entire system?
>
...because Django is also used to implement batch processes, and you'd be very
surprised if your system, which works well enough for a year or so, suddenly
has its night-jobs starting to fail one day because some size-limit has been
reached.

The Django project has traditionally held the position, with regard to such
proposals, that we should avoid making such decisions for our users, but we
should allow them to make them by themselves; this can also lead to more
general solutions.

In this case, I think you could get support for allowing yourself query-
validation hooks -- say, allow installing a function that will be called by
the ORM when query results are received, before turning them into objects. I
think this lets you solve your problem outside of core (and then, if your
solution proves useful and popular, it may go into core as well). This could
also allow other interesting functionalities, I suspect, if it is allowed to
modify (and not just validate) the results.

I think (speaking only for myself, of course) that a proposal along these
lines has a better chance of gaining traction.

HTH,
Shai.

Florian Apolloner

unread,
Nov 23, 2014, 10:22:55 AM11/23/14
to django-d...@googlegroups.com, wo...@wol.ph
Hi Rick,


On Sunday, November 23, 2014 1:11:13 PM UTC+1, Rick van Hattem wrote:
If/when an unsliced queryset were to reach a certain limit (say, 10,000, but configurable) the system would raise an error.

Django can't know if that would be the case without issuing an extra query -- and even then another transaction might commit a batch operation adding 10k items…
 
- Protects servers from going down due to memory constraints.

Not really, cause psycopg already fetched everything.

So please, can anyone give a good argument as to why any sane person would have a problem with a huge default limit which will kill the performance of your site anyhow but isn't enough to kill the entire system?

There are way easier ways to kill the performance of your site than fetching a few rows, an arbitrary limit on queries is not going to fix anything. Code applications defensively and don't use 3rd party stuff without introspecting it for problems first. While you are right that there are issues in theory, I think in practice they can be worked around easily and don't cause any real problem. Killing a VPS due to out of memory conditions can happen all the time -- be it the database or a service gone rogue or whatever -- but in the end I've rarely seen Django to be the cause of that.

Marco Paolini

unread,
Nov 23, 2014, 2:39:20 PM11/23/14
to django-d...@googlegroups.com
that function wil likeky be called much too late.

The client-side postgresql cursor fetches *all query results* in one go when
reading from the cursor the first time.

So again, the issue with "big" resultsets is not the ORM generating django objects
but the very way django uses psycopg.

Why don't we just write a Model.manager.chunked() like proposed here?

Rick van Hattem

unread,
Nov 23, 2014, 4:08:00 PM11/23/14
to Florian Apolloner, django-d...@googlegroups.com

Hi Florian,

On 23 Nov 2014 16:22, "Florian Apolloner" <f.apo...@gmail.com> wrote:
>
> Hi Rick,
>
>
> On Sunday, November 23, 2014 1:11:13 PM UTC+1, Rick van Hattem wrote:
>>
>> If/when an unsliced queryset were to reach a certain limit (say, 10,000, but configurable) the system would raise an error.
>
>
> Django can't know if that would be the case without issuing an extra query -- and even then another transaction might commit a batch operation adding 10k items…
>  

Actually, it can. If Django limits unsliced queries to 10001 items it can simply raise an error if 10001 items are returned.

>> - Protects servers from going down due to memory constraints.
>
>
> Not really, cause psycopg already fetched everything.

Not if Django limits it by default :)

>> So please, can anyone give a good argument as to why any sane person would have a problem with a huge default limit which will kill the performance of your site anyhow but isn't enough to kill the entire system?
>
>
> There are way easier ways to kill the performance of your site than fetching a few rows, an arbitrary limit on queries is not going to fix anything. Code applications defensively and don't use 3rd party stuff without introspecting it for problems first. While you are right that there are issues in theory, I think in practice they can be worked around easily and don't cause any real problem. Killing a VPS due to out of memory conditions can happen all the time -- be it the database or a service gone rogue or whatever -- but in the end I've rarely seen Django to be the cause of that.

In that case you are lucky. I've seen in over half a dozen different projects (which I didn't write myself so couldn't set up properly) problems like these.

Even the batch job argument has flaws imho, in the case of batch jobs it's still not a good idea to fetch everything in one big go. But if that would be the case, within Django a view is easily distinguishable from batch jobs so that doesn't have to be a problem.

But the hostility against the ideas alone make me lose hope that anything will ever be merged in. It's too bad that people always assume full control over projects. If you're hired to fix problems you can't fix design flaws that easily.

Christophe Pettus

unread,
Nov 23, 2014, 4:13:55 PM11/23/14
to django-d...@googlegroups.com

On Nov 23, 2014, at 1:07 PM, Rick van Hattem <wo...@wol.ph> wrote:

> > Not really, cause psycopg already fetched everything.
>
> Not if Django limits it by default :)

Unfortunately, that's not how it works. There are three things that take up memory as the result of a query result:

1. The Django objects. These are created in units of 100 at a time.
2. The psycopg2 Python objects from the result. These are already limited to a certain number (I believe to 100) at a time.
3. The results from libpq. These are not limited, and there is no way of limiting them without creating a named cursor, which is a significant change to how Django interacts with the database.

In short, without substantial, application-breaking changes, you can't limit the amount of memory a query returns unless you add a LIMIT clause to it. However, adding a LIMIT clause can often cause performance issues all by itself:

http://thebuild.com/blog/2014/11/18/when-limit-attacks/

There's no clean fix that wouldn't have significant effects on unsuspecting applications.
--
-- Christophe Pettus
x...@thebuild.com

Rick van Hattem

unread,
Nov 23, 2014, 4:53:26 PM11/23/14
to django-d...@googlegroups.com

Very true, that's a fair point. That's why I'm opting for a configurable option. Patching this within Django has saved me in quite a few cases but it can have drawbacks.

> --
> -- Christophe Pettus
>    x...@thebuild.com
>

> --
> You received this message because you are subscribed to a topic in the Google Groups "Django developers  (Contributions to Django itself)" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/django-developers/aYwPykvLaMU/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to django-develop...@googlegroups.com.


> To post to this group, send email to django-d...@googlegroups.com.
> Visit this group at http://groups.google.com/group/django-developers.

> To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CD10DACF-4D0A-458D-BB85-0F7BB8BFF4C0%40thebuild.com.

Christophe Pettus

unread,
Nov 23, 2014, 4:57:48 PM11/23/14
to django-d...@googlegroups.com

On Nov 23, 2014, at 1:53 PM, Rick van Hattem <wo...@wol.ph> wrote:

> Very true, that's a fair point. That's why I'm opting for a configurable option. Patching this within Django has saved me in quite a few cases but it can have drawbacks.

As a DB guy, I have to say that if an application is sending a query that expects to get 100 results back but gets 1,000,000 back, you have a bug that needs to be tracked down and fixed. Patching it by limiting the results is kind of a database version of "except: pass" to get rid of an inconvenient but mysterious exception.

Mattias Linnap

unread,
Nov 24, 2014, 3:36:32 AM11/24/14
to django-d...@googlegroups.com, wo...@wol.ph
Since your use case seems to be "avoid blocking the server when one pageview uses unexpectedly many resources", and you don't want to fix or optimise the actual application code, why not set these limits at a level where it makes sense?

For example:
* http://uwsgi-docs.readthedocs.org/en/latest/Options.html#limit-as sets a memory allocation limit on the UWSGI worker processes, and terminates & restarts the worker if it uses more memory. This will protect against not just big querysets, but also big plain python lists, loading huge image or xml files, memory leaks in C extensions and any other memory-consuming issues.
* http://uwsgi-docs.readthedocs.org/en/latest/FAQ.html#what-is-harakiri-mode sets a timeout on request handling, and terminates & restarts the worker if it takes more than X seconds to serve the request. This protects against not just big querysets, but also any other slow code that ties up the worker, for example forgetting to set a timeout on calls to external APIs, or a plain old infinite loop.
http://stackoverflow.com/questions/5421776/how-can-i-limit-database-query-time-during-web-requests You can set a Postgresql query timeout in seconds for the queries run by the web app user. This makes far more sense than a limit on the result size, as complex JOINs can be very resource-intensive but only return a few rows in the result.
There should be similar configuration options for other app servers and databases as well.

Django's ORM is just the wrong level in the software stack for these limits, since there are hundreds of other ways to kill the performance of a server, and the number of results in a queryset is a poor indicator of performance issues.

Rick van Hattem

unread,
Nov 24, 2014, 4:01:07 AM11/24/14
to Mattias Linnap, django-d...@googlegroups.com

If that is an option than it's definitely a better location to set limits to prevent the server from going down.

It helps nothing when it comes to debugging though. Which is the primary reason for patching the orm.

And in addition to that, quite a few customers won't let you change the hosting setup just like that. Especially since uwsgi is one of the least used deployment options in my experience.

Rick van Hattem

unread,
Nov 24, 2014, 4:09:17 AM11/24/14
to django-d...@googlegroups.com
Indeed, except it's not an "except: pass" but an "except: raise" which I'm proposing. Which makes a world of difference.


Christophe Pettus

unread,
Nov 24, 2014, 6:17:04 AM11/24/14
to django-d...@googlegroups.com

On Nov 24, 2014, at 1:08 AM, Rick van Hattem <wo...@wol.ph> wrote:

> Indeed, except it's not an "except: pass" but an "except: raise" which I'm proposing. Which makes a world of difference.

Well, as previously noted, this option would introduce another round-trip into every database if it's actually going to check (and, in most databases, would have to run inside a transaction at, at least, REPEATABLE READ isolation mode in order to provide a strong guarantee, so you can add three more statements to a lot of interactions right there). That seems to reduce the "high performance" part of the system.

What I'm getting at is: This is a bug in the application. It's not a misfeature in Django. If you have a bug in an application whose source code cannot be changed, that's a shame, but I don't think that Django can be expected to introduce configuration options to cover every possible scenario in which a previously-written Django application interacts badly with the database in a way in which, in theory, it could do extra work (slowing down non-buggy applications and introducing more code to QA) to patch the bad application. To me, this is about as logical as putting in DIVIDE_BY_ZERO_RETURNS_NONE with a default of False because someone wrote some legacy code that would work if we did that.

Rick van Hattem

unread,
Nov 24, 2014, 6:37:07 AM11/24/14
to django-d...@googlegroups.com
Hi Christophe,

As I previously explained, there's no need for a roundtrip or different transactions levels.

If you fetch N+1 items you know if there are over N items in your list. So with that in mind you can simply do a query with a limit of N+1 to know if your query returns over N items.

No need for different isolation modes or extra statements, the only cost is that instead of a maximum of 10,000 items you will be fetching 10,001 items iff. there is no limit currently defined for the query.


As for this being a bug in the application instead of Django, I fully agree. But it can be very difficult to debug issues like these if at some random moment the entire server crashes.

 - Limiting the memory usage for the application doesn't help in debugging since it will just kill the Django app without a proper stacktrace.
 - In some cases you are unable to limit the resources on a server level and you'll just have to watch the entire machine die

The proposed solution is simply a safety net to prevent a complete server from dying completely, it doesn't even have to be a default option but it's one that I would like to have to make debugging easier and to protect servers from dying without any traceable cause as the resource starvation can kill the machine without ever showing anything useful in the logs.

--
You received this message because you are subscribed to a topic in the Google Groups "Django developers  (Contributions to Django itself)" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/django-developers/aYwPykvLaMU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.

Christophe Pettus

unread,
Nov 24, 2014, 1:40:00 PM11/24/14
to django-d...@googlegroups.com

On Nov 24, 2014, at 3:36 AM, Rick van Hattem <wo...@wol.ph> wrote:
> If you fetch N+1 items you know if there are over N items in your list.

Let's stop there. Unfortunately, because of the way libpq works, just sending the query and checking the result set size won't solve your problem, except for an even smaller edge case.

Using the standard (non-named-curosr) protocol, when you get the first result back from libpq, *every result* is sent over the wire, and takes up space on the client. Thus, no limitation on the client side (number of Django objects created, number of Python objects created by psycopg2) will prevent an out-of-memory error. In my experience (and I've seen this problem a lot), the OOM occurs on the libpq results, not on the other parts.

Thus, the proposal only solves the relatively narrow case where the libpq result does *not* create an OOM, but creating the Django and psycopg2 objects does.

I'll note also that it's not that server that dies in this case; the particular thread doing the request gets an exception.

Rick van Hattem

unread,
Nov 24, 2014, 2:17:04 PM11/24/14
to django-d...@googlegroups.com
It seems you are misunderstanding what I am trying to do here. The 10,000 (or whatever, that should be configurable) is a number large enough not to bother anyone but small enough not to trigger the OOM system. In that case it works perfectly.

Due note that the proposed solution is not a hypothetical one, it's a tried and proven one that has saved me a lot of headache in the past. It's definitely helped a lot when I was trying 3rd party apps at Fashiolista.com, one of the larger Django based websites.

--
You received this message because you are subscribed to a topic in the Google Groups "Django developers  (Contributions to Django itself)" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/django-developers/aYwPykvLaMU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.

Christophe Pettus

unread,
Nov 24, 2014, 3:32:51 PM11/24/14
to django-d...@googlegroups.com

On Nov 24, 2014, at 11:16 AM, Rick van Hattem <wo...@wol.ph> wrote:

> It seems you are misunderstanding what I am trying to do here. The 10,000 (or whatever, that should be configurable) is a number large enough not to bother anyone but small enough not to trigger the OOM system.

There are really only four options that could be implemented at the framework level:

1. You add a LIMIT to every query. As noted, this can change the performance of a query, sometimes quite radically, often in a bad way. An OOM error is clear and explicit; a query suddenly running 100x slower is a lot stranger to track down. (Also, unless you decide that returning the entire LIMIT amount is an error, you can't really know if the query would have returned *more* than that.)

2. You don't add the LIMIT. At that point, you have no guard against excessive memory consumption.

3. You do two queries, one a count(*), one to get the result. This is a huge protocol change, and quite inefficient (and without wrapping it a transaction at an elevated isolation mode, not reliable).

4. You use a named cursor. This is also a huge protocol change.

Those are your choices.

It's not that the *problem* is unusual; it's that being unable to fix the problem in the applcication is unusual. I've encountered this problem, sure, and what I did was fix the code; it took like two minutes. Indeed, I would *strongly* advise that if you are issuing queries that you expect to get 100 results back and are getting a memory-crushing result back, you fundamentally don't understand something about your data, and need to address that promptly. Running properly on large datasets is a big job; just a patch like this isn't going to solve all the issues.

In your particular case, where you have the relatively unusual situation that:

1. You have this problem, and,
2. You can't fix the code to solve this problem.

... you probably have the right answer is having a local patch for Django.

Michael Manfre

unread,
Nov 24, 2014, 3:53:56 PM11/24/14
to django-d...@googlegroups.com
On Mon, Nov 24, 2014 at 3:32 PM, Christophe Pettus <x...@thebuild.com> wrote:
In your particular case, where you have the relatively unusual situation that:

1. You have this problem, and,
2. You can't fix the code to solve this problem.

... you probably have the right answer is having a local patch for Django.

Alternatively, the same could be accomplished with a custom database backend that enforces whatever limit value is desired. If I needed this behavior, I'd probably tweak the query.low_mark and query.high_mark values in SQLCompiler.as_sql before passing it along to the parent.

I think it's safe to say that this functionality is unlikely to find its way in to Django, but if you have questions about the custom database backend approach, feel free to contact me.

Regards,
Michael Manfre

Rick van Hattem

unread,
Nov 24, 2014, 4:05:47 PM11/24/14
to django-d...@googlegroups.com
Thanks for the help but writing the custom database backend won't be a problem, I've written one before :)

My goal was simply to move the Django project forward but it seems the problems I've encountered in the field are too uncommon for most other developers to care or understand.

Thank you all for the discussion.

--
You received this message because you are subscribed to a topic in the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/django-developers/aYwPykvLaMU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.

Florian Apolloner

unread,
Nov 25, 2014, 3:06:53 AM11/25/14
to django-d...@googlegroups.com, wo...@wol.ph


On Monday, November 24, 2014 10:05:47 PM UTC+1, Rick van Hattem wrote:
My goal was simply to move the Django project forward but it seems the problems I've encountered in the field are too uncommon for most other developers to care or understand.

Oh, I can assure you that we care and understand (and you are certainly not the only one with a somewhat big site), but in the end I think the arguments against it as shown by Christophe outweigh what you are proposing. That said, I still fail to see the issue, set a memory limit for your apps (man ulimit) and you get a nice MemoryError when libpq tries to overallocate -- including a nice traceback… Adding security guards for any ill written application is not going to fly (especially if it's not as easy as adding a LIMIT which has it's own problems as pointed out), sorry.

Cheers,
Florian
Reply all
Reply to author
Forward
0 new messages