Querying random from collections with various sizes

33 views
Skip to first unread message

Attila Bukor

unread,
Jul 10, 2014, 4:55:59 AM7/10/14
to mongod...@googlegroups.com
Hey all,

Let's say I have a site with multiple sub-domains where the contents are
generated. Every day tens to hundreds of thousands of them. Every
Content has an Image. The Images are stored on the FS but the path,
metadata and which subdomain they belong to (they're thematic), etc.
are in MongoDB, as well as the Contents.

The problem is that when a Content is generated, we need to randomly
select an Image from the collection to be associated with the Content.
Multiple Contents can use the same Image. I know there is no ".random()"
implement so far (#SERVER-533), and I understand it's far from trivial
to implement it.

I've already looked into some possible solutions of course, even tried
it with the first solution in the cookbook[1], but as it's pretty much
expected from the algorithm, it's far from random. I tested it with 25
Images (I know it's practically 0, but I can make a fairly precise
projection out of it) and 25 Contents, and there are only 2 images used.

The skip-limit method seems far better in terms of distribution, however
it's *really* slow. According to Leonid Beschastny's benchmark[2] with
5.5M documents in the collection, it's roughly 800 times slower than the
Cookbook #1 example method. Of course we usually won't work with so
large numbers (maybe a select few sub-domains, but still unlikely).

I figured if I want to use the skip-limit method, I need the count and
probably that will be the case usually. In case count > n, I will switch
to the second method on the stackoverflow answer, n being a number over
which the distribution of the second algorithm is fairly even and the
exec time of the first algorithm is significantly higher than of the
second one.

There are two questions regarding this:
- Is this a good solution and if not, what's a better one?
- What's that "n" magic number I should use?

[1] http://cookbook.mongodb.org/patterns/random-attribute/
[2] http://stackoverflow.com/a/13524742/898545

--
Regards,
r1pp3rj4ck

Will Berkeley

unread,
Jul 11, 2014, 11:19:36 AM7/11/14
to mongod...@googlegroups.com, r1pp3...@w4it.eu
Hi Attila. Make sure you create the index on rand as it says in the cookbook. Otherwise, the procedure doesn't work. You also need to have a lot of images in order to get a good distribution. When I tested the cookbook procedure for 1000 images and 1 million contents, I had a nice, nearly-uniform distribution of images over contents.

I don't really understand the idea behind matching up a random image with a content, however. What is the point of doing this?

-Will

Attila Bukor

unread,
Jul 14, 2014, 5:43:28 AM7/14/14
to mongod...@googlegroups.com
Hi Will,

Thanks for the tips, of course I didn't miss the index on the random
key. Actually I managed to combine these two solutions by writing a
little benchmarking script and finding a magic number after which it
switches to the cookbook procedure.

Please disregard my random image situation, it's just a clearly bad
example :)

Cheers,
r1pp3rj4ck
> --
> You received this message because you are subscribed to the Google
> Groups "mongodb-user"
> group.
>
> For other MongoDB technical support options, see:
> http://www.mongodb.org/about/support/.
> ---
> You received this message because you are subscribed to the Google
> Groups "mongodb-user" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to
> mongodb-user...@googlegroups.com
> <mailto:mongodb-user...@googlegroups.com>.
> To post to this group, send email to
> mongod...@googlegroups.com
> <mailto:mongod...@googlegroups.com>.
> Visit this group at http://groups.google.com/group/mongodb-user.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/mongodb-user/8b39ae0c-8fc1-4c76-a0ae-de02f5415be3%40googlegroups.com
> <https://groups.google.com/d/msgid/mongodb-user/8b39ae0c-8fc1-4c76-a0ae-de02f5415be3%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.


Reply all
Reply to author
Forward
0 new messages