Select x random rows from DB

427 views
Skip to first unread message

galago

unread,
Feb 20, 2011, 5:19:14 PM2/20/11
to django...@googlegroups.com
What is the best way, to select X random rows from DB? I know that method: .all().order_by('?')[:X] is not good idea.
What methods do you use?

dele...@dudupay.com

unread,
Feb 20, 2011, 5:22:06 PM2/20/11
to django...@googlegroups.com
Knowing what is not a good idea means you know what the good idea is, logically.
So, Galago, what's the good idea???

Sent from my BlackBerry wireless device from MTN


From: galago <pro...@gmail.com>
Date: Sun, 20 Feb 2011 14:19:14 -0800 (PST)
Subject: Select x random rows from DB

What is the best way, to select X random rows from DB? I know that method: .all().order_by('?')[:X] is not good idea.
What methods do you use?

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To post to this group, send email to django...@googlegroups.com.
To unsubscribe from this group, send email to django-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-users?hl=en.

galago

unread,
Feb 20, 2011, 5:24:37 PM2/20/11
to django...@googlegroups.com
If i knew what is a good idea, I wouldn't aks here:D
I only know - that described method is DB killer :)

dele...@dudupay.com

unread,
Feb 20, 2011, 5:26:54 PM2/20/11
to django...@googlegroups.com
That would have been my approach if I were to do that. So, we are now two seeking the good way to get it done.
Hello People,
Galago and I have the same challenge, please help us.

Sent from my BlackBerry wireless device from MTN


From: galago <pro...@gmail.com>
Date: Sun, 20 Feb 2011 14:24:37 -0800 (PST)
Subject: Re: Select x random rows from DB

If i knew what is a good idea, I wouldn't aks here:D
I only know - that described method is DB killer :)

--

Cal Leeming [Simplicity Media Ltd]

unread,
Feb 20, 2011, 5:31:44 PM2/20/11
to django...@googlegroups.com, galago
Hi Galago,

There are several ways to do this, and it really depends on the size of your database, and what your storage policies are.

Here are some of the various methods we have used:
  • Use ID max, then random.randint().
    Only works if you *never* delete rows from the database, and instead have a field called 'is_deleted', which is set to 1 each time. Great if you have a database with more than 100 thousand rows.
    POC: from django.db.models import Max,Count,Q,F; total_items = model.aggregate(Max('id'))['id__max']; random.randint(0,total_items)

  • Use .count(), then random.randint()
    Count is slow as hell depending on how many rows you have. 

  • Use ORDER BY RAND() inside the SQL itself, although from what I remember, this wasn't great.
It really does just come down to how well your database has been designed, how many rows you have, and what you are attempting to achieve. I would suggest creating a script which benchmarks all these different methods against the data you have, and see which comes out top, that's what we do, the results are sometimes surprising :)

Hope this helps.

Cal


On Sun, Feb 20, 2011 at 10:19 PM, galago <pro...@gmail.com> wrote:
What is the best way, to select X random rows from DB? I know that method: .all().order_by('?')[:X] is not good idea.
What methods do you use?

--

Cal Leeming [Simplicity Media Ltd]

unread,
Feb 20, 2011, 5:35:16 PM2/20/11
to django...@googlegroups.com, dele...@dudupay.com
Also, this snippet isn't directly related, but it does show the methods we used to paginate through 50 million rows, with query times of below 0.5 seconds :) It uses the same method that was in my first suggestion (the ID max thing)


Cal

Cal Leeming [Simplicity Media Ltd]

unread,
Feb 20, 2011, 5:36:43 PM2/20/11
to django...@googlegroups.com, dele...@dudupay.com
Also, take strong notice of this part:

    EXAMPLE:
    >>> _t = time.time(); x = map(lambda x: x, Post.objects.filter(id__gte=400000, id__lt=400500).all()); print "Took %ss"%(time.time() - _t)
    Took 0.0467309951782s
    >>> _t = time.time(); _res = map(lambda x: x, Post.objects.all()[400000:400500]); print "Took %ss"%(time.time() - _t)
    Took 1.05785298347s
    >>>

Christophe Pettus

unread,
Feb 20, 2011, 7:30:56 PM2/20/11
to django...@googlegroups.com

On Feb 20, 2011, at 2:19 PM, galago wrote:

> What is the best way, to select X random rows from DB? I know that method: .all().order_by('?')[:X] is not good idea.

The best way is to push it onto the DB, using a raw query:

random_results = Table.objects.raw("SELECT * FROM table ORDER BY random() LIMIT X")

--
-- Christophe Pettus
x...@thebuild.com

Cal Leeming [Simplicity Media Ltd]

unread,
Feb 21, 2011, 9:33:32 AM2/21/11
to django...@googlegroups.com

Chris, please don't tell someone "this is the best way", when the solution is dependant on so many other factors (as stated in my original reply). Our answers strongly influence other peoples decisions, and "best practice" answers should only ever be given if you are 100% sure it is correct, and have personally experimented with every other possible method.

Please don't take offence to this email, I would expect someone to say the same to me if I had made a similar mistake.



On 21 Feb 2011 00:31, "Christophe Pettus" <x...@thebuild.com> wrote:


On Feb 20, 2011, at 2:19 PM, galago wrote:

> What is the best way, to select X random rows from DB...

The best way is to push it onto the DB, using a raw query:

       random_results = Table.objects.raw("SELECT * FROM table ORDER BY random() LIMIT X")

--
-- Christophe Pettus
  x...@thebuild.com


--
You received this message because you are subscribed to the Google Groups "Django users" group....



Christophe Pettus

unread,
Feb 21, 2011, 1:42:45 PM2/21/11
to django...@googlegroups.com

On Feb 21, 2011, at 10:34 AM, Eric Chamberlain wrote:
> If you have lots of rows, this query is really slow as the db must build a new table for the ORDER BY.

You can do better if you have a guaranteed ordinal on the rows; otherwise, it has to do a full table scan no matter what.

Eric Chamberlain

unread,
Feb 21, 2011, 1:34:30 PM2/21/11
to django...@googlegroups.com

On Feb 20, 2011, at 4:30 PM, Christophe Pettus wrote:

>
> On Feb 20, 2011, at 2:19 PM, galago wrote:
>
>> What is the best way, to select X random rows from DB? I know that method: .all().order_by('?')[:X] is not good idea.
>
> The best way is to push it onto the DB, using a raw query:
>
> random_results = Table.objects.raw("SELECT * FROM table ORDER BY random() LIMIT X")

If you have lots of rows, this query is really slow as the db must build a new table for the ORDER BY.

--
Eric Chamberlain, Founder
RF.com - http://RF.com/

galago

unread,
Feb 21, 2011, 4:10:40 PM2/21/11
to django...@googlegroups.com
I need to do this in my tagcloud.

Mikhail Korobov

unread,
Feb 21, 2011, 5:06:55 PM2/21/11
to Django users
This is the function for getting 1 item that works even if some rows
were deleted that works times faster than order_by('?') even for not-
so-big datasets at least on mysql:

def get_random_item(model, max_id=None):
if max_id is None:
max_id = model.objects.aggregate(Max('id')).values()[0]
min_id = math.ceil(max_id*random.random())
return model.objects.filter(id__gte=min_id)[0]

It assumes that almost all records are still in DB and ids are more or
less successive because otherwise the distribution won't be uniform.
If there are a lot of items in DB then it should be almost safe to
call this several times (and re-call if the same row is obtained).

Mikhail Korobov

unread,
Feb 21, 2011, 5:09:19 PM2/21/11
to Django users

Cal Leeming [Simplicity Media Ltd]

unread,
Feb 21, 2011, 5:13:31 PM2/21/11
to django...@googlegroups.com, Mikhail Korobov
Please see my previous post with the django snippet for more info on this method (as it uses this same principle).

--

Christophe Pettus

unread,
Feb 21, 2011, 7:54:05 PM2/21/11
to django...@googlegroups.com
If you know this is going to be an important part of the application, it might make sense to have a random primary key, such as a UUID.

Phlip

unread,
Feb 21, 2011, 11:28:06 PM2/21/11
to Django users
order_by('RAND()')

That might use the same seed each time.

To create, for example, a rotating home page with different content
each day, but the same after each page refresh, take today's date, MD5
hash it, and stick the hash between the () on RAND().

We used to use that with MySQL in a performance-sensitive environment,
and nobody complained about it...
Reply all
Reply to author
Forward
0 new messages