Benchmarking change to query.get() #15361

mmcnickle

unread,

Feb 24, 2011, 9:59:01 AM2/24/11

to Django developers

Hi All,

Background to this post is available at http://code.djangoproject.com/ticket/15361

I've created a better benchmark in order to test where the change in
the above ticket causes a performance regression. These are the
results of those tests.

First of all, the results are based on query.get() on 20000 Book
objects, generated by the following code [1]

I ran the following 2 benchmarks query_get [2] and query_get_multiple
[3] on MySQL and sqlite. query_get is a simple get() using a unique
indexed column and will return one object only. query_get_multiple is
a get() on an non-indexed column, it will return 1284 objects.

The benchmark results, as reported by djangobench [4] are as follows:

sqlite:

-----------------------------------
Running all benchmarks
Control: Django 1.3 beta 1 (in django-control)
Experiment: Django 1.3 beta 1 (in django-experiment)

Running 'query_get' benchmark ...
Min: 0.000000 -> 0.000000: incomparable (one result was zero)
Avg: 0.000745 -> 0.000979: 1.3141x slower
Significant (t=-5.900348)
Stddev: 0.00263 -> 0.00297: 1.1318x larger (N = 10000)

Running 'query_get_multiple' benchmark ...
Min: 0.020000 -> 0.000000: incomparable (one result was zero)
Avg: 0.029883 -> 0.001072: 27.8759x faster
Significant (t=482.259535)
Stddev: 0.00511 -> 0.00309: 1.6519x smaller (N = 10000)
-----------------------------------

mysql
-----------------------------------
Running all benchmarks
Control: Django 1.3 beta 1 (in django-control)
Experiment: Django 1.3 beta 1 (in django-experiment)

Running 'query_get' benchmark ...
Min: 0.000000 -> 0.000000: incomparable (one result was zero)
Avg: 0.000810 -> 0.001039: 1.2827x slower
Significant (t=-5.591014)
Stddev: 0.00273 -> 0.00305: 1.1169x larger (N = 10000)

Running 'query_get_multiple' benchmark ...
Min: 0.020000 -> 0.000000: incomparable (one result was zero)
Avg: 0.028856 -> 0.001152: 25.0486x faster
Significant (t=429.695948)
Stddev: 0.00560 -> 0.00319: 1.7544x smaller (N = 10000)
-----------------------------------

As you can see, with 10000 trials, the speed differences are
significant (and repeatable) as are roughly as follows:

query.get() on a unique indexed column runs 1.3x slower
query.get() on an non-unique, non-index column runs 25-27x FASTER

I've done some very quick tests to see how these gains/losses scale
for various values of n objects:

The speedup for the non-indexed columns is exponential to n.
The slowdown for the indexed columns is roughly constant for all n.

So there you have it, we have a small regression in performance for
the most common case use, and a huge potential gain for the less used
(and some would argue, badly designed) query.

What do you think, is the gain worth the hit? Is it possible to have 2
different code paths depended on what column(s) the query is filtering
on?

-- Martin

P.S Between each trial, djangobench will try and reload the
initial_data.json fixture, which for 20000 objects is very time
consuming. If you want to reproduce the results yourself, I'd suggest
creating a database with the objects already in it, instead of relying
on fixtures.

----
[1] object generation script -- http://pastebin.com/6JAJDA6f
[2] query_get benchmark -- http://pastebin.com/qZBdvSie
[3] query_get_multiple benchmark -- http://pastebin.com/iEYsfmd5
[4] djangobench project (Luke's fork) -- https://github.com/spookylukey/djangobench

Jacob Kaplan-Moss

unread,

Feb 24, 2011, 10:15:22 AM2/24/11

to django-d...@googlegroups.com

On Thu, Feb 24, 2011 at 4:59 PM, mmcnickle <mmcn...@gmail.com> wrote:
> So there you have it, we have a small regression in performance for
> the most common case use, and a huge potential gain for the less used
> (and some would argue, badly designed) query.
>
> What do you think, is the gain worth the hit? Is it possible to have 2
> different code paths depended on what column(s) the query is filtering
> on?

Hm, I don't think it is. If get() is a performance concern, you should
have a unique index on the column. I think penalizing people "doing it
right" even a little bit is a bad idea.

I *do* think we should add a note to the documentation -- in get(),
and/or in the database optimization doc -- about being sure that
you're only using get() on unique columns for best performance.

Jacob

Russell Keith-Magee

unread,

Feb 24, 2011, 6:40:24 PM2/24/11

to django-d...@googlegroups.com

I concur. If you're retrieving by a column, you should be indexing on
that column. Optimizing Django's retrieval code for the bad design
case strikes me as equally bad design. I say document this and leave
the code as is.

Yours,
Russ Magee %-)

mmcnickle

unread,

Feb 25, 2011, 5:39:47 AM2/25/11

to Django developers

On Feb 24, 11:40 pm, Russell Keith-Magee <russ...@keith-magee.com>
wrote:

> Optimizing Django's retrieval code for the bad design
> case strikes me as equally bad design. I say document this and leave
> the code as is.

Ok, I'll write a documentation patch for this for the optimisation
section.

Good job for catching this slowdown, Luke. I would never have thought
of it.

-- Martin

Reply all

Reply to author

Forward