Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Benchmarking change to query.get() #15361
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  4 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
mmcnickle  
View profile   Translate to Translated (View Original)
 More options Feb 24 2011, 9:59 am
From: mmcnickle <mmcnic...@gmail.com>
Date: Thu, 24 Feb 2011 06:59:01 -0800 (PST)
Local: Thurs, Feb 24 2011 9:59 am
Subject: Benchmarking change to query.get() #15361
Hi All,

Background to this post is available at http://code.djangoproject.com/ticket/15361

I've created a better benchmark in order to test where the change in
the above ticket causes a performance regression. These are the
results of those tests.

First of all, the results are based on query.get() on 20000 Book
objects, generated by the following code [1]

I ran the following 2 benchmarks query_get [2] and query_get_multiple
[3] on MySQL and sqlite. query_get is a simple get() using a unique
indexed column and will return one object only. query_get_multiple is
a get() on an non-indexed column, it will return 1284 objects.

The benchmark results, as reported by djangobench [4] are as follows:

sqlite:

-----------------------------------
Running all benchmarks
Control: Django 1.3 beta 1 (in django-control)
Experiment: Django 1.3 beta 1 (in django-experiment)

Running 'query_get' benchmark ...
Min: 0.000000 -> 0.000000: incomparable (one result was zero)
Avg: 0.000745 -> 0.000979: 1.3141x slower
Significant (t=-5.900348)
Stddev: 0.00263 -> 0.00297: 1.1318x larger (N = 10000)

Running 'query_get_multiple' benchmark ...
Min: 0.020000 -> 0.000000: incomparable (one result was zero)
Avg: 0.029883 -> 0.001072: 27.8759x faster
Significant (t=482.259535)
Stddev: 0.00511 -> 0.00309: 1.6519x smaller (N = 10000)
-----------------------------------

mysql
-----------------------------------
Running all benchmarks
Control: Django 1.3 beta 1 (in django-control)
Experiment: Django 1.3 beta 1 (in django-experiment)

Running 'query_get' benchmark ...
Min: 0.000000 -> 0.000000: incomparable (one result was zero)
Avg: 0.000810 -> 0.001039: 1.2827x slower
Significant (t=-5.591014)
Stddev: 0.00273 -> 0.00305: 1.1169x larger (N = 10000)

Running 'query_get_multiple' benchmark ...
Min: 0.020000 -> 0.000000: incomparable (one result was zero)
Avg: 0.028856 -> 0.001152: 25.0486x faster
Significant (t=429.695948)
Stddev: 0.00560 -> 0.00319: 1.7544x smaller (N = 10000)
-----------------------------------

As you can see, with 10000 trials, the speed differences are
significant (and repeatable) as are roughly as follows:

query.get() on a unique indexed column runs 1.3x slower
query.get() on an non-unique, non-index column runs 25-27x FASTER

I've done some very quick tests to see how these gains/losses scale
for various values of n objects:

The speedup for the non-indexed columns is exponential to n.
The slowdown for the indexed columns is roughly constant for all n.

So there you have it, we have a small regression in performance for
the most common case use, and a huge potential gain for the less used
(and some would argue, badly designed) query.

What do you think, is the gain worth the hit? Is it possible to have 2
different code paths depended on what column(s) the query is filtering
on?

-- Martin

P.S Between each trial, djangobench will try and reload the
initial_data.json fixture, which for 20000 objects is very time
consuming. If you want to reproduce the results yourself, I'd suggest
creating a database with the objects already in it, instead of relying
on fixtures.

----
[1] object generation script -- http://pastebin.com/6JAJDA6f
[2] query_get benchmark -- http://pastebin.com/qZBdvSie
[3] query_get_multiple benchmark -- http://pastebin.com/iEYsfmd5
[4] djangobench project (Luke's fork) -- https://github.com/spookylukey/djangobench


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jacob Kaplan-Moss  
View profile  
 More options Feb 24 2011, 10:15 am
From: Jacob Kaplan-Moss <ja...@jacobian.org>
Date: Thu, 24 Feb 2011 17:15:22 +0200
Local: Thurs, Feb 24 2011 10:15 am
Subject: Re: Benchmarking change to query.get() #15361

On Thu, Feb 24, 2011 at 4:59 PM, mmcnickle <mmcnic...@gmail.com> wrote:
> So there you have it, we have a small regression in performance for
> the most common case use, and a huge potential gain for the less used
> (and some would argue, badly designed) query.

> What do you think, is the gain worth the hit? Is it possible to have 2
> different code paths depended on what column(s) the query is filtering
> on?

Hm, I don't think it is. If get() is a performance concern, you should
have a unique index on the column. I think penalizing people "doing it
right" even a little bit is a bad idea.

I *do* think we should add a note to the documentation -- in get(),
and/or in the database optimization doc -- about being sure that
you're only using get() on unique columns for best performance.

Jacob


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Russell Keith-Magee  
View profile  
 More options Feb 24 2011, 6:40 pm
From: Russell Keith-Magee <russ...@keith-magee.com>
Date: Fri, 25 Feb 2011 07:40:24 +0800
Local: Thurs, Feb 24 2011 6:40 pm
Subject: Re: Benchmarking change to query.get() #15361

I concur. If you're retrieving by a column, you should be indexing on
that column. Optimizing Django's retrieval code for the bad design
case strikes me as equally bad design. I say document this and leave
the code as is.

Yours,
Russ Magee %-)


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
mmcnickle  
View profile  
 More options Feb 25 2011, 5:39 am
From: mmcnickle <mmcnic...@gmail.com>
Date: Fri, 25 Feb 2011 02:39:47 -0800 (PST)
Local: Fri, Feb 25 2011 5:39 am
Subject: Re: Benchmarking change to query.get() #15361
On Feb 24, 11:40 pm, Russell Keith-Magee <russ...@keith-magee.com>
wrote:

> Optimizing Django's retrieval code for the bad design
> case strikes me as equally bad design. I say document this and leave
> the code as is.

Ok, I'll write a documentation patch for this for the optimisation
section.

Good job for catching this slowdown, Luke. I would never have thought
of it.

-- Martin


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »