why? because it has worked fine! And was elegant and performant (caveat: I don't have a sharding concern)
> You will need to consider exactly what your doing and why your doing it and restructure appropriately.
Yep, that's what I'm trying to figure out. Perhaps if I describe a scenario below, someone can point out a standard approach or pattern that might fit? NB: I've translated this from my domain terminology to something more readily understandable.
Let's say the problem concerns matching and then rating/ranking a huge number of menu combinations of meat + veg dishes (say, for a fast food chain doing menu planning)
The starting point is two collections of data that come from external sources (so we can't change or restructure them): Meat and Veg
Meat: is a large collection of meat dishes with attributes like cost, and nutritional detail. Also flagged for either lunch or dinner.
Veg: is a large collection of veg dishes with attributes like cost, and nutritional detail. Also flagged for either lunch or dinner.
What we want to produce (as input for all subsequent map/reduce processing) is a MeatVegCombo collection, which is the cross product of all 'lunch' meat dishes with all 'lunch' veg dishes, and likewise for 'dinner' courses.
At the moment (and now broken in 2.4) MeatVegCombo is manufactured with a map/reduce on Meat that will emit a varying number of Meat/Veg combos for each Meat dish (to do so it needs to lookup the matching Veg options). This is basically tricking map-reduce into doing a cross join of Meat and Veg on matching lunch/dinner flag. (In fact, this was originally done as a script that did the naive iteration to generate the MeatVegCombo collection. But it was redone as map-reduce to push it all into MongoDB, and take advantage of the replace/reduce functionality to achieve 'upsert' type behaviour).
So, of course we could revert to the external scripting technique, but the question is really what is 'best practice' for this kind of requirement, and in particular how can we keep it as a workload for MongoDB and not for a client script. I suspect the core issue here is that I am trying to get MongoDB to fake a join, but I want a technique that works well for processing huge data volumes.
(a) normalise the data: well, Meat and Veg actually have no relation. The MeatVegCombo I want to manufacture is essentially a fully normalised and independent collection.
(b) multi-part M/R: seems a chicken-and-egg proposition. Neither Meat nor Veg know how to emit the keys required for MeatVegCombo without reference to each other.
Hope that all makes sense. Any guidance would be appreciated.