Re: [mongodb-user] Finding Interesting People / Nested Map Reduce Mayhem?

151 views
Skip to first unread message

Jan Riechers

unread,
Sep 12, 2012, 6:19:37 AM9/12/12
to mongod...@googlegroups.com
On 11.09.2012 19:24, pctj101 wrote:
> I'm trying to analyze a dataset of People. I'd like to find interesting
> people.
>
> I don't need you guys to write the entire solution... I'd like to just
> ask a few questions:
>
> Here's the situation:
>
> For each person
> Find the average salary of my "peers" defined as:
> All people that are:
> Within +/- 5 years of age
> Within +/- 10 pounds
>
> Only include the Person in the "ultimate" result set:
> If that average salary of my "peers" is more than $X different than
> this person
> And include the "set" of peers that this person got compared to
>
>
[TRIMMED]

Hi,

I am not sure which language you are using to write down your
application, but you could also, instead of a map/reduce, make use of
"$lte" and "$gte" modificator, to scout out ranges.

For example in Python
currentMoney = 100 (Pounds)
currentAge = 25 (years)
{'$gte': {'money':currentMoney, 'age': currentAge}, '$lte':
{'money':currentMoney+10, 'age': currentAge+5}}

What this simply does - it looks for matching money and age entries
which are either higher or equal to the current-Values BUT also lower or
equal the currentValue +10 and +5 for the age (your example range values)

I think this will do the trick for you without even making use of
map/reduce, just by using the query magic.

Jenna deBoisblanc

unread,
Sep 12, 2012, 2:35:34 PM9/12/12
to mongodb-user
Hi,

You have an interesting problem to solve. Jan makes a good point-
using a query to filter out non-matching peers will significantly
improve MR performance/complexity. If possible, I would recommend
using the aggregation framework (available in MongoDB version 2.2) for
this task. MR uses Javascript, and since MongoDB uses SpiderMonkey -
a single-threaded Javascript engine, MR is a slow, blocking operation.
As such, I would not recommend running multiple MR commands back-to-
back for each user. It is also possible to specify a query in your
aggregation command:
Agg framework: http://docs.mongodb.org/manual/applications/aggregation/

It may also be possible to do the calculation for all users with a
single aggregation command. If you are still searching for a solution,
could you provide a sample document to help us tailor our response to
your use case?

"Binning" or "bucketing" the users based on age could simplify the
problem, and it might allow you could accomplish the aggregation with
a single command. An example of "binning":

{name: Bob, age: 21, weight: 100}
{name: Jenna, age: 23, weight: 108}

{name: Gene, age: 24, weight: 120}
{name: Susan, age: 22, weight; 127}

{name: Tom, age: 25, weight: 101}
{name: Ellen, age: 26, weight: 102}

bin age by 5, weight by 10:
Bob and Jenna are peers, defined as users in the ages 20 - 24, weight
100 - 109.
Gene and Susan are peers, defined as users in the ages 20 - 24, weight
120 - 129.
Tom and Ellen are peers, defined as users in ages 25 - 29, weight 100
- 109.

Hope this helps! Let us know if you need any additional help.

pctj101

unread,
Sep 15, 2012, 6:52:26 AM9/15/12
to mongod...@googlegroups.com
So okay... thanks for the tips so far.  In fact I am using the $gt/$lt operators and aggregation.

The problem is that I feel like I have to run this "query" for each row in my collection.

For Bob, map-reduce/aggregate/whatever { :age => {$gt => bobs.age - 5} .... }
For Jane, map-reduce/aggregate/whatever { :age => {$gt => janes.age - 5} .... }
For Andy, map-reduce/aggregate/whatever { :age => {$gt => andys.age - 5} .... }
... repeat this n-times ...


Hey just as a quick question.. if I ran a "for" / "foreach" loop on a result set.. like:

peoplez = db.people.find()
forEach (person in peoplez, function(person) { whatever })

Does that loop occur entirely on the mongodb process? or does mongocli have to fetch all the records back into the client side to run the loop?

If the loop runs entirely on the serverside without having to transfer the entire people collection to the client side, I suppose that would be an alternative solution?

Thanks!

Jeff


On Wednesday, September 12, 2012 2:24:15 AM UTC+10, pctj101 wrote:
I'm trying to analyze a dataset of People.  I'd like to find interesting people.

I don't need you guys to write the entire solution... I'd like to just ask a few questions:

Here's the situation:

For each person
  Find the average salary of my "peers" defined as: 
        All people that are:
            Within +/- 5 years of age
            Within +/- 10 pounds 
            

Only include the Person in the "ultimate" result set:
    If that average salary of my "peers" is more than $X different than this person
    And include the "set" of peers that this person got compared to


I can use map-reduce to get the average of a set (for just 1 person's peers).  No problem!

    * In this case, because the set of "peers" is different for each "Person", do I:
        1a) Write a more crazy map/reduce function that... 
            1b) Perhaps even a map/reduce function that calls another map/reduce function?
                M-R: (Key = Person, Value = ( M-R ( Key = Uhhhh?, Query: Peers ) ) <- bad pseudocode but.. general idea... loop in a loop
        
        1c) Run map-reduce N times, once for each person, each with a different "query" filter to pick my peers?

            1d) If running N times, do you see "a problem" with using a forEach loop on the mongodb server? (rather than having a Perl/PHP script run 10000 M-R calls, just have the serverside Javascript run the loop)


Just general thoughts are fine... If you see anything like "OMG don't do that!" or "Here's a great idea" please let me know :)

Thanks guys for your thoughts!
    
        

Jenna deBoisblanc

unread,
Sep 18, 2012, 12:57:28 PM9/18/12
to mongod...@googlegroups.com
Hi Jeff,

Do you need the ranges to be +- 5 years of the particular user, or can you "bin" the users into groups (see the previous comment for an example of binning)? The latter procedure will drastically simplify the aggregation command.

>Does that loop occur entirely on the mongodb process? or does mongocli have to fetch all the records back into the client side to run the loop?

The client does not fetch all of the records at once; instead, the server returns documents in batches.  The cursor iterates through the batch client-side, and if additional records are required to meet the query, a getMore() command is issued to the server:
Reply all
Reply to author
Forward
0 new messages