Hey all,
Let's say I have a site with multiple sub-domains where the contents are
generated. Every day tens to hundreds of thousands of them. Every
Content has an Image. The Images are stored on the FS but the path,
metadata and which subdomain they belong to (they're thematic), etc.
are in MongoDB, as well as the Contents.
The problem is that when a Content is generated, we need to randomly
select an Image from the collection to be associated with the Content.
Multiple Contents can use the same Image. I know there is no ".random()"
implement so far (#SERVER-533), and I understand it's far from trivial
to implement it.
I've already looked into some possible solutions of course, even tried
it with the first solution in the cookbook[1], but as it's pretty much
expected from the algorithm, it's far from random. I tested it with 25
Images (I know it's practically 0, but I can make a fairly precise
projection out of it) and 25 Contents, and there are only 2 images used.
The skip-limit method seems far better in terms of distribution, however
it's *really* slow. According to Leonid Beschastny's benchmark[2] with
5.5M documents in the collection, it's roughly 800 times slower than the
Cookbook #1 example method. Of course we usually won't work with so
large numbers (maybe a select few sub-domains, but still unlikely).
I figured if I want to use the skip-limit method, I need the count and
probably that will be the case usually. In case count > n, I will switch
to the second method on the stackoverflow answer, n being a number over
which the distribution of the second algorithm is fairly even and the
exec time of the first algorithm is significantly higher than of the
second one.
There are two questions regarding this:
- Is this a good solution and if not, what's a better one?
- What's that "n" magic number I should use?
[1]
http://cookbook.mongodb.org/patterns/random-attribute/
[2]
http://stackoverflow.com/a/13524742/898545
--
Regards,
r1pp3rj4ck