[Django] #30207: Optimise paginator for tables with massive records

8 views
Skip to first unread message

Django

unread,
Feb 24, 2019, 8:27:56 AM2/24/19
to django-...@googlegroups.com
#30207: Optimise paginator for tables with massive records
------------------------------------------------+------------------------
Reporter: M. Javidan Darugar | Owner: nobody
Type: Cleanup/optimization | Status: new
Component: Core (Other) | Version: master
Severity: Normal | Keywords:
Triage Stage: Unreviewed | Has patch: 0
Needs documentation: 0 | Needs tests: 0
Patch needs improvement: 0 | Easy pickings: 0
UI/UX: 0 |
------------------------------------------------+------------------------
I had a problem with Paginator class to slice huge dataset. In case of
millions of record it can take 30 second to reach the last page. So I cam
up with a solution to overcome this problem.

The problem is that for slicing data, current solution needs to compute
whole list to render data `bottom:top`. The reason is that, current script
will generate for example `SELECT ID, COL_1, ..., COL_N, ... WHERE ...`
which has huge burden for database to slice the data. To overcome this
problem we can instead select primary keys and do the slicing step and
then fetch records that their pk are in that sliced list. Very simple but
very efficient solution. I improved the performance for our project
significantly using this approach. So form 30 seconds to only 2-3 seconds
for 8 million records.

In this ticket I propose same approach to improve django Paginator class.

--
Ticket URL: <https://code.djangoproject.com/ticket/30207>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

Django

unread,
Feb 24, 2019, 8:28:50 AM2/24/19
to django-...@googlegroups.com
#30207: Optimise paginator for tables with massive records
-------------------------------------+-------------------------------------
Reporter: M. Javidan Darugar | Owner: M.
Type: | Javidan Darugar
Cleanup/optimization | Status: assigned

Component: Core (Other) | Version: master
Severity: Normal | Resolution:

Keywords: | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by M. Javidan Darugar):

* status: new => assigned
* owner: nobody => M. Javidan Darugar


--
Ticket URL: <https://code.djangoproject.com/ticket/30207#comment:1>

Django

unread,
Feb 24, 2019, 2:46:42 PM2/24/19
to django-...@googlegroups.com
#30207: Optimise paginator for tables with massive records
-------------------------------------+-------------------------------------
Reporter: M. Javidan Darugar | Owner: M.
Type: | Javidan Darugar
Cleanup/optimization | Status: assigned
Component: Core (Other) | Version: master
Severity: Normal | Resolution:

Keywords: | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Simon Charette):

Hello there,

Could you provide more details about which database backend is sped up
with this technique? I'm a bit surprised two queries perform faster than
one in this case because even if the first one would allows the usage of
an index only scan the tuples still have to be fetched in the end. The
query planner should be smart enough to chain the operations in a single
query. I suspect something else might be at play here such as the query to
''count'' the number of results.

By the way, the issue title should be changed to reflect what the
optimization technique is as right now it's pretty vague.

--
Ticket URL: <https://code.djangoproject.com/ticket/30207#comment:2>

Django

unread,
Feb 25, 2019, 10:06:22 AM2/25/19
to django-...@googlegroups.com
#30207: Optimise paginator for tables with massive records
-------------------------------------+-------------------------------------
Reporter: M. Javidan Darugar | Owner: M.
Type: | Javidan Darugar
Cleanup/optimization | Status: assigned
Component: Core (Other) | Version: master
Severity: Normal | Resolution:

Keywords: | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Nikolas):

1. i too use paginator hack for admin, work for Postgres only if not use
any filters, this is can be inbox good optimization

{{{
class LargeTablePaginator(Paginator):
"""
Warning: Postgresql only hack
Overrides the count method of QuerySet objects to get an estimate
instead of actual count when not filtered.
However, this estimate can be stale and hence not fit for situations
where the count of objects actually matter.
"""

def _get_count(self):
if getattr(self, '_count', None) is not None:
return self._count

query = self.object_list.query
if not query.where:
try:
cursor = connection.cursor()
cursor.execute("SELECT reltuples FROM pg_class WHERE
relname = %s",
[query.model._meta.db_table])
self._count = int(cursor.fetchone()[0])
except:
self._count = super(LargeTablePaginator,
self)._get_count()
else:
self._count = super(LargeTablePaginator, self)._get_count()

return self._count

count = property(_get_count)
}}}

maximum speed on any sized table

2. if use only pk order query can serios optimize when have max pk and
wont get page-1, in this case just need select ..... where pk<last_max_id.
In this case will be use index and have maximum speed

--
Ticket URL: <https://code.djangoproject.com/ticket/30207#comment:3>

Django

unread,
Mar 1, 2019, 5:33:35 PM3/1/19
to django-...@googlegroups.com
#30207: Optimise paginator for tables with massive records
-------------------------------------+-------------------------------------
Reporter: M. Javidan Darugar | Owner: M.
Type: | Javidan Darugar
Cleanup/optimization | Status: closed

Component: Core (Other) | Version: master
Severity: Normal | Resolution: needsinfo

Keywords: | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by Tim Graham):

* status: assigned => closed
* resolution: => needsinfo


Comment:

The technical justification seems lacking both here and on the
[https://github.com/django/django/pull/11016 PR].

--
Ticket URL: <https://code.djangoproject.com/ticket/30207#comment:4>

Django

unread,
Jun 9, 2019, 4:01:46 AM6/9/19
to django-...@googlegroups.com
#30207: Optimise paginator for tables with massive records
-------------------------------------+-------------------------------------
Reporter: M. Javidan Darugar | Owner: M.
Type: | Javidan Darugar

Cleanup/optimization | Status: new
Component: Core (Other) | Version: master
Severity: Normal | Resolution:

Keywords: | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by M. Javidan Darugar):

* status: closed => new
* resolution: needsinfo =>


Comment:

I reopen this ticket because it is not solved yet, if I am replying to
messages by delay it is because I have a full time job and rarely have
time to work on a side project but still love to contribute to the django
community.

Some of the comments in the github branch was about naming for variable or
function arguments which means there is no standards for naming for
variables because everyone can suggest something new.

Another comment was too irrelevant about what I proposed. If postgres has
a new feature it does not mean that it is good for django because django
has an ORM which should work with other databases as well.

--
Ticket URL: <https://code.djangoproject.com/ticket/30207#comment:5>

Django

unread,
Jun 9, 2019, 4:52:04 AM6/9/19
to django-...@googlegroups.com
#30207: Optimise paginator for tables with massive records
-------------------------------------+-------------------------------------
Reporter: M. Javidan Darugar | Owner: M.
Type: | Javidan Darugar

Cleanup/optimization | Status: new
Component: Core (Other) | Version: master
Severity: Normal | Resolution:

Keywords: | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by M. Javidan Darugar):

What do you mean by index? Tables index? for a table with such huge data
index table can be very huge. In our case indexing used almost 18GB of
disk memory. Indexing is very useful method but not always it depends on
the problem and the situation.

Replying to [comment:3 Nikolas]:

--
Ticket URL: <https://code.djangoproject.com/ticket/30207#comment:6>

Django

unread,
Jun 9, 2019, 4:56:03 AM6/9/19
to django-...@googlegroups.com
#30207: Optimise paginator for tables with massive records
-------------------------------------+-------------------------------------
Reporter: M. Javidan Darugar | Owner: M.
Type: | Javidan Darugar

Cleanup/optimization | Status: new
Component: Core (Other) | Version: master
Severity: Normal | Resolution:

Keywords: | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by M. Javidan Darugar):

Well this is a bit surprising for me as well that many developers don't
know about this, because it is a very basic fact in database core, maybe
because most developers focused on code base rather than raw queries.

Replying to [comment:2 Simon Charette]:


> Hello there,
>
> Could you provide more details about which database backend is sped up
with this technique? I'm a bit surprised two queries perform faster than
one in this case because even if the first one would allows the usage of
an index only scan the tuples still have to be fetched in the end. The
query planner should be smart enough to chain the operations in a single
query. I suspect something else might be at play here such as the query to
''count'' the number of results.
>
> By the way, the issue title should be changed to reflect what the
optimization technique is as right now it's pretty vague.

--
Ticket URL: <https://code.djangoproject.com/ticket/30207#comment:7>

Django

unread,
Jun 9, 2019, 6:21:34 AM6/9/19
to django-...@googlegroups.com
#30207: Optimise paginator for tables with massive records
-------------------------------------+-------------------------------------
Reporter: M. Javidan Darugar | Owner: M.
Type: | Javidan Darugar

Cleanup/optimization | Status: new
Component: Core (Other) | Version: master
Severity: Normal | Resolution:

Keywords: | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by Can Sarıgöl):

* cc: Can Sarıgöl (added)


--
Ticket URL: <https://code.djangoproject.com/ticket/30207#comment:8>

Django

unread,
Jun 10, 2019, 4:08:00 AM6/10/19
to django-...@googlegroups.com
#30207: Optimise paginator for tables with massive records
-------------------------------------+-------------------------------------
Reporter: M. Javidan Darugar | Owner: M.
Type: | Javidan Darugar
Cleanup/optimization | Status: closed

Component: Core (Other) | Version: master
Severity: Normal | Resolution: needsinfo

Keywords: | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by Carlton Gibson):

* status: new => closed
* resolution: => needsinfo


Comment:

Hi. At this point, the issue tracker here is not the appropriate place for
this discussion.

Please see TicketClosingReasons/DontReopenTickets.

Rather, a message to the DevelopersMailingList explaining the issue and
the proposed change would be needed to progress.

A couple of points to make it easier:

* This kind of thing has come up previously: search here and the mailing
list history to find the discussions. Summarizing that history may help.
* Wrapping your suggestion in a custom paginator class and providing a
sample project, with some benchmarking data maybe, so people can explore
your solution, would be useful.
* In general, the proficiency level of contributors to the ORM is very
high. They know how databases work. Assume that.

Thanks.

--
Ticket URL: <https://code.djangoproject.com/ticket/30207#comment:9>

Reply all
Reply to author
Forward
0 new messages