Re: any alternative to client side processing in place of what would be a RDMS join?

238 views
Skip to first unread message

William Zola

unread,
Oct 8, 2012, 9:17:09 PM10/8/12
to mongod...@googlegroups.com
Hi Brent!

For folks coming from the SQL world (such as myself) one of the hardest things to learn about MongoDB is the new style of schema design.  In the SQL world, everything goes into third normal form.  Folks come to think that there is a single right way to design their schema, because there typically is one.

There is a cognitive shift required from a relational mindset in order to properly design a MongoDB schema.  If you are still "thinking relationally", then you will miss most of the advantages of using MongoDB.

More importantly, if all you have is a hammer, everything looks like a nail.  I know that, for myself, it took me a long time to take off the "relational blinders".  While you still may have entities with one-to-one, many-to-one, or many-to-many relationships, it is no more necessary to force them into third normal form in the database than it is in your programming language.  Just as there are ways to represent those relationships within an object-oriented language such as Java or Ruby, there are equally effective ways to represent them in a document-oriented database, such as MongoDB.

Unlike the SQL world, in the MongoDB world, there is no one best schema design.  More accurately, in MongoDB schema design depends on how the application is going to access the data. 

Here are the key questions that you need to have answered in order to design a good schema for MongoDB:
 - How much data do you have?
 - What are your most common operations?  Will you be mostly inserting new data, updating existing data, or doing queries?
 - What are your most common queries?
 - What are your most common updates?
 - How many I/O operations do you expect per second?

Here's how these questions might play out if you are considering one-to-many object relationships.

In SQL you model a many-to-one relationship by creating a pair of master/detail tables with a primary key/foreign key relationship.  That's really the only way to do it.

In MongoDB, you have a number of choices: you can embed the data, you can create a linked relationship, you can duplicate and denormalize the data, or you can use a hybrid approach. The correct approach would depend on a lot of details about the use case of your application.

The key thing required to make this cognitive shift is to get enough exposure to MongoDB schemas to un-learn what you have learned.  Here are some references to get you started.

Here are some good general references on MongoDB schema design.

MongoDB presentations:
 - http://www.10gen.com/presentations/mongosf2011/schemabasics
 - http://www.10gen.com/presentations/mongosv-2011/schema-design-by-example
 - http://www.10gen.com/presentations/mongosf2011/schemascale
 - http://www.10gen.com/presentations/MongoNYC-2012/Building-a-MongoDB-Power-Chat-Server

Here is a book about MongoDB schema design that I think you would find useful:
 - http://www.manning.com/banker/ (MongoDB in Action)

Here are some sample schema designs:
 - http://docs.mongodb.org/manual/use-cases/
 - https://openshift.redhat.com/community/blogs/designing-mongodb-schemas-with-embedded-non-embedded-and-bucket-structures

Ultimately, the way that relational databases work is by splitting your data up among many tables & then joining them dynamically at query time.  MongoDB has two modes.  The more native mode involves pre-aggregating your data into logical chunks when you insert it, and then querying those pre-aggregated chunks.  MongoDB can also work relationally: you just have to do the joins on the client side in your code, rather than relying on the database server to do it for you. 

In my opinion, you often get cleaner code and faster performance by performing the join yourself: you know more about the semantics of the data and the data distribution than any RDBMS optimizer can ever know.

I hope you found this useful.  Let me know if you have further questions.

 -William





On Sunday, October 7, 2012 3:06:38 AM UTC-7, Brent Gracey wrote:
If a data set & use cases are focused on "relational" work - would that reduce the fit of Mongodb for the project? For example a "friendship" data set where questions like please get me the detailed info of users (held in person_info) who interacted with each other (held in person interaction) who bought product X ( held in product and purchase_history collections) at event Y. 

Would it be possible to use a map reduce function which operates over multiple tables?

I guess foursquare have some way of answering those sorts of questions efficiently with  Mongo!

(this is a shortened version of half of post https://groups.google.com/forum/?fromgroups=#!searchin/mongodb-user/guidance/mongodb-user/dDEynRftg8E/IluWzJcABwQJ - maybe the length of that post was too much!)

Regards,
Brent 

willia...@10gen.com

unread,
Oct 9, 2012, 9:52:15 AM10/9/12
to info-...@10gen.com, mongod...@googlegroups.com, brent...@gmail.com
Hi Brent!

On Tuesday, October 9, 2012 6:04:21 AM UTC-7, Brent Gracey wrote:
Hi - thank you very much for you input - I am trying to take of the blinders!

At the moment I am working on a new requirement which my data is not pre aggregted to answer - are the below my full set of options, or am I missing a trick or two?


Unfortunately, you haven't given me enough information to answer the question.  There's no one "right" generalized solution for a master/detail relationship in MongoDB.  

I'd need to know the details of your data model and the semantics of what you're trying to accomplish in order to make suggestions for your particular use case.  Are you doing targeted queries to satisfy a web page?  Or are you trying to do analytics to roll-up a large data set that's spread across two collections?  The tools in the toolbox and the set of options are highly dependent on which one you're doing.

Once I have more information I can give you a better answer.  

 -William 

William Zola

unread,
Oct 10, 2012, 5:32:35 PM10/10/12
to mongod...@googlegroups.com, info-...@10gen.com, brent...@gmail.com

Hi Brent!

Here's a quick overview of the tools in the toolbox for many-to-one relationships:

 - Embedding instead of referencing
 - Placing references to the detail document in the master document
 - Placing a reference to the master document in the detail document.  (AKA a "Parent Pointer" tree.)  This lets you get all of the details for a single master document in a single query.
 - Combining both master->detail and detail->master references
 - Denormalizing the data so that some of the needed information from the detail documents are present in both the master and detail documents.  (This works best if writes are infrequent.)

See here for more references on "link-vs-embed":
 - http://stackoverflow.com/questions/5373198/a-simple-mongodb-question-embed-or-reference
 - http://www.slideshare.net/kbanker/mongodb-schema-design-mongofr

If you have a "hard" real-time (as in 'up-to-the-second') requirement, then you have to do database queries at the time that the application requests them.  If it's OK for data to be stale, then you can run batch jobs to do the pre-aggregation, and then perform the reporting queries off of them.  (Think "Data Mart" and "scheduled ETL".)


MongoDB has a reasonably powerful MapReduce functionality: you can often use multiple mMapReduce jobs to join data from multiple collections.  While not suitable for hard real-time requirements, it can be useful for

See here for details:
 - http://stackoverflow.com/questions/3837394/mongodb-map-reduce-over-multiple-collections
 - http://tebros.com/2011/07/using-mongodb-mapreduce-to-join-2-collections/

For more example of what you can do with MapReduce, see here:
 - http://cookbook.mongodb.org/patterns/pivot/
 - http://cookbook.mongodb.org/index.html

You can also use the Hadoop adaptor to do MapReduce jobs.  The Hadoop "dependent jobs" feature can allow you to read from multiple collections in the context of a single Hadoop job.

See here for details:
 - http://www.10gen.com/presentations/webinar/mongodb-hadoop-taming-elephant-room
 - http://api.mongodb.org/hadoop/MongoDB%2BHadoop+Connector.html

Finally, if your document counts are only in the thousands or tens of thousands, then I'd suggest staying with the simple strategy of doing joins in your client application.  The database would have to do these anyway: there's no extra work involved, just a matter of moving it off of the database server.


Let me know if you have further questions.

 -William



On Tuesday, October 9, 2012 1:50:37 PM UTC-7, Brent Gracey wrote:
Hi,

I would like to get my head around as many of the tools in the toolbox as possible, so I can make an informed choice of the best fit for my current requirements and understand what road map I should have in mind for as the data set grows. So if there is an approach that doesn't fit under one of the three methods I listed, it would be great if you could highlight that to me.

Some more details on my current work is its a report, so its is aggregating records, and it needs to be available in real time, so users can view a summary of the event as it progresses through the day. At the moment the record sets are in the 1000s on the two collections, and the relationship is 1 to M but a low M, so 1,2 detail records is the most common case.

Thanks again for your help

--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com
See also the IRC channel -- freenode.net#mongodb

Brent Gracey

unread,
Oct 18, 2012, 6:19:54 PM10/18/12
to mongod...@googlegroups.com
Hi - thanks for the comprehensive reply - I'm still working through most of the links you have suggested. One thing I would like to clarify - for your comment "suggest staying with the simple strategy of doing joins in your client application.  The database would have to do these anyway:"

Yes the database application would have to join - but assuming it uses an index, isn't the database operation not much more efficient than 

for all x in array X
    Look up Y[x]

My assumptions for this are:

There is overhead in maintaining DB indexes - this has already been "expended effort" at the time of getting data
Getting all the data from the DB to the client application loses the index info
Repeating the effort of generating an index (not sure how many languages would support that - any Scala specific thoughts welcome) is duplicated effort
Not doing something to replicate index functionality will make the look up in Y much less efficient.

So I can't see how client side processing can be as efficient as getting the db to do it - and as one scales - you either get to a point of having to do a large data restructure or live with the extra overhead.

Main point of the post if for people to point out if any of my assumptions are incorrect.

Thanks

William Zola

unread,
Oct 29, 2012, 1:40:26 PM10/29/12
to mongod...@googlegroups.com

Hi Brent!

Sorry for taking so long to get back to you. 

The answer to your question is: the database is not going to be any more efficient than doing the queries client side -- IF you have designed your schema properly.

1) Let's begin by looking at how you'd do this in SQL:

CREATE TABLE dept (deptno NUMBER CONSTRAINT pk_dept PRIMARY KEY,
        dname VARCHAR2(30) );
CREATE TABLE emp (
        empid NUMBER CONSTRAINT pk_emp PRIMARY KEY,
        ename VARCHAR(20),
        deptno NUMER CONSTRAINT fk_deptno FOREIGN KEY REFERENCES dept(deptno) );

First off: this creates two indexes "behind your back": one on dept.deptno, one on emp.empid.

Second: consider how you would query these tables:

 SELECT empid, ename FROM emp, dept
    WHERE
        dept.dname = 'SALES'
    JOIN
        dept.deptno = emp.deptno
    ORDER BY ename;

Note that unless you index 'emp.deptno', this query is going to be slow (since it will otherwise perform a full table scan on 'emp').  Unless you index 'dept.dname', it will have to perform a full table scan on 'dept'.

You're now at no less than FOUR indexes.

This query is going to force the RDBMS to do the following:
  A) Read the 'dept.dname' index to find the location of the row(s) where "dname = 'SALES'"
  B) Fetch the matching row(s) from the 'dept' table, and save off the set of 'deptno' values
  C) Read the 'emp.deptno' index to find the locations of the matching rows in the 'emp' table
  D) Fetch the matching rows from the 'emp' table
  E) Sort the result set in-memory

2) Now, let's consider how to do this in MongoDB.  For legibility, I'm going to use integers for the _id field, but everything I will do will work with ObjectIDs as well:

> db.dept.save( {_id: 1, name: 'Sales'} );
> db.dept.save( {_id: 2, name: 'Marketing'} );
> db.dept.save( {_id: 3, name: 'Support'} );

> db.emp.save( { _id: 1, name: 'Ben', dept: 1 } );
> db.emp.save( { _id: 2, name: 'William', dept: 3 } );
> db.emp.save( { _id: 3, name: 'Jenna', dept: 3 } );
> db.emp.save( { _id: 4, name: 'Steven', dept: 3 } );

To make this efficient, I'll create a secondary index on the 'dept' field in the 'emp' collection, and the 'name' field in the 'dept' collection:

> db.emp.ensureIndex( dept: 1 );
> db.dept.ensureIndex( name: 1 );

By default, there are already indexes on "dept._id" and "emp._id".

To find all of the employees in the "Support" department, I need to do two queries:

    > result = db.dept.findOne({name: 'Support'}, {_id:1} );
    { "_id" : 3 }
    > desired_dept = result["_id"];
    3
    > db.emp.find( { dept: desired_dept }).sort({name:1}).pretty();
    { "_id" : 3, "name" : "Jenna", "dept" : 3 }
    { "_id" : 4, "name" : "Steven", "dept" : 3 }
    { "_id" : 2, "name" : "William", "dept" : 3 }
   
In terms of the actual work done, there are the same number of indexes involved, and the same number of operations involved as with the SQL example.  The only difference is that I had to build one portion of the query in the client side instead of on the server side.


Let me know if you have further questions.

 -William



Brent Gracey

unread,
Nov 13, 2012, 6:10:15 AM11/13/12
to mongod...@googlegroups.com
Hi William - thanks again for the detailed reply

I can see how doing this for a query that requires information for one record is efficient, I however need to do a look up for multiple records.

If I pull all the detail records in one query - then the client side "detail" look up will be inefficient as there is no "index" concept in the client side code.

If I do a query to the db for each detail record then I get 100s of db connections which is also a big overhead.

So my understanding is I would need to update my data structure to address this efficiently.

Sam Millman

unread,
Nov 13, 2012, 7:31:27 AM11/13/12
to mongod...@googlegroups.com
"If I do a query to the db for each detail record then I get 100s of db connections which is also a big overhead."

You only get one, and it should be a persistent one.


"If I pull all the detail records in one query - then the client side "detail" look up will be inefficient as there is no "index" concept in the client side code."

Indexes on JOINs are a questionable matter, I have seen many many times in SQL where a simple JOIN has resulted in the index not being used for some unknown reason even if that JOIN is specifically defined within the internals of MySQL (which is what I was using at the time) using pk and fk format, most specifically on non-standard ranges of data. You have also gotta consider that JOINs can become a pain to scale in certain situations.

You have got to consider that in MySQL the client side "detail" lookup for each record would be slow because of what it takes to make a SQL query and result set for every row instead of a large JOINed result set and most likely not because of the "index" concept.

MongoDB is quite different in how it queries, it is used to querying in this way.

There is one caveat you will get into soon however, which I think needs fixing, which is that MongoDB has no server-side sub selects, so if you wanna do an $in based on another query you must pull the data over the wire and operate on it client side even if you don't need those initial rows, which is a little painful.


--

Sam Millman

unread,
Nov 13, 2012, 7:32:08 AM11/13/12
to mongod...@googlegroups.com
"You have got to consider that in MySQL"

Sorry I mean any SQL tech there.

Brent Gracey

unread,
Nov 13, 2012, 8:39:19 AM11/13/12
to mongod...@googlegroups.com
Thanks for the reply,

I know very well that joins in SQL can have their own nuances, but feel that is a separate discussion. And yes, connection pooling will reduce the overhead of multiple db queries, but it is still an overhead.

My current situation is I am doing multiple detail queries - and the report is taking about 30 seconds to complete for  ~200 records, and my boss wants it to go faster, and I don't know how to that without telling him we need to re-architect our data structure.

I haven't got the exact breakdown of what is taking the time - so can't tell you how much "overhead" the multiple db connections is introducing, but I'll work on some logging to break down exactly where time is being spend in the report.

Thanks

Sam Millman

unread,
Nov 13, 2012, 9:22:40 AM11/13/12
to mongod...@googlegroups.com
At 200 records it should be giving 0 overhead.

You are right that the initial post was a little tl;dr. I definitely missed it.

So let's start fresh: if you turn on logging for MongoDB and check the log for slow queries do you see any?

Brent Gracey

unread,
Nov 14, 2012, 5:51:17 AM11/14/12
to mongod...@googlegroups.com
Hi - so I done some logging to work out what is taking the time

First, a big correction - I'm actually hitting the db 2000 times, not 200, I forgot that 1800 records don't actually do anything in this report, but I have to go look for a detailed record for them before I know that.

Logging from my application code

09:27:31.409 [qtp741338313-114 - /man/json/counts] DEBUG bootstrap.liftweb.RequestAnalyzer - Total request time on /json/counts: 0 ms
09:27:31.421 [qtp741338313-114 - /man/json/counts] DEBUG c.b.sojo_manager.api.RestAPI - Starting to generate report 
09:27:32.446 [qtp741338313-114 - /man/json/counts] DEBUG c.b.sojo_manager.api.RestAPI - Got all users from the database -> This is the first select that returns 2000 users
09:27:41.531 [qtp741338313-114 - /man/json/counts] DEBUG c.b.sojo_manager.api.RestAPI - Have looped through all users and checked which ones have devices with connections -> This is after looping through all 2000 and doing a detail lookup in mongo, which takes about 10 seconds. 
09:27:41.719 [qtp741338313-114 - /man/json/counts] DEBUG c.b.sojo_manager.api.RestAPI - Have created the from to mapping for the force diagram 
09:27:41.719 [qtp741338313-114 - /man/json/counts] DEBUG bootstrap.liftweb.RequestAnalyzer - Total request time on /json/counts: 10301 ms

I enabled  db.setProfilingLevel(2) - and after capturing commands for just this report

> db.system.profile.find().count()
2845

Then I ran a summary - to just group everything and give a total "db" run time - which comes in at 1794 milli sec, so I don't think the main part of the 10 seconds is db execution. I am working on refining the application logging to get a better view of what components of the query are taking the remaining 8 seconds.
 
db.system.profile.group( 
{ cond: {   } 
, key: {client: "127.0.0.1"} 
, initial: {count: 0, total_time:0}  
, reduce: function(doc, out){ out.count++; out.total_time+=doc.millis } 
, finalize: function(out){ out.avg_time = out.total_time / out.count } 
} );

OUTPUT

[
{
"client" : "127.0.0.1",
"count" : 2845,
"total_time" : 1794,
"avg_time" : 0.6305799648506151
}
]

Sam Millman

unread,
Nov 14, 2012, 9:57:42 AM11/14/12
to mongod...@googlegroups.com
1794ms still sounds quite slow for 2000, especially since that figure comes from the database; it, however, appears that the main bottleneck is the app. Let's see what you find there :)

William Zola

unread,
Nov 14, 2012, 2:24:26 PM11/14/12
to mongod...@googlegroups.com
Hi Brent!

I'm not sure I understand your issues here.  You're right that doing the lookup on the client side is expensive.  I don't understand your concerns about running multiple queries, though. 

Here is some sample Python code: it only uses one database connection, and a maximum of two database cursors:

def setup() :
    port = sys.argv[1]
    conn = Connection('localhost', int(port) )
    return conn.test

def query_dept_emp(db) :
    dept = db.dept
    emp = db.emp
    for ddoc in dept.find().sort("name",1) :
        print "Employees in", ddoc["name"]
        for edoc in emp.find( {"dept": ddoc["_id"] } ).sort("name",1) :
            print "  ", edoc["name"]

def main() :
    if len(sys.argv) != 2:
        print "usage: ", sys.argv[0], " <port number>"
        exit(1)

    db = setup()
    query_dept_emp(db)

if __name__ == "__main__":
    main()
 

As I said before: this is no different than what an RDBMS engine would do.  The only difference is that you're running the query on the client side, instead of having the server run it for you invisibly on the server side.

If you're running into a separate performance issue, then you'll need to diagnose that. 

It's not clear from your application logs what language you're using, or how you've structured your code.  If it doesn't look (roughly) like the sample above, then you're probably doing something inefficient.


Let me know if you have further questions.

 -William

Brent Gracey

unread,
Nov 16, 2012, 7:22:26 AM11/16/12
to mongod...@googlegroups.com
Hi,

Its a scala application  - and I'm a few layers of abstraction above the actual db connection. But it comes down to /mongo-java-driver-2.6.5.jar!/com/mongodb/DBTCPConnector.class and if this is the right documentation http://www.mongodb.org/display/DOCS/Java+Driver+Concurrency "The Mongo object maintains an internal pool of connections to the database (default pool size of 10)"

William Zola

unread,
Nov 26, 2012, 2:49:49 PM11/26/12
to mongod...@googlegroups.com

Hi Brent!

I don't know Scala at all, but I believe your concern to be unfounded.

The Scala driver, like the Java driver, does *not* use a new connection for each query.  If you've got a connection from the MongoConnection() method (or whatever your framework wraps around that structure), then you can have as many queries as you like multiplexed on that single connection.

In my code sample from before:


def query_dept_emp(db) :
    dept = db.dept
    emp = db.emp
    for ddoc in dept.find().sort("name",1) :
        print "Employees in", ddoc["name"]
        for edoc in emp.find( {"dept": ddoc["_id"] } ).sort("name",1) :
            print "  ", edoc["name"]


This code opens two cursors but only uses a single connection.  The same would be true of the equivalent code in Scala or Java.  The dept.find() opens a new cursor using the existing connection, and the emp.find() opens a different cursor using the same connection.  Queries to the two different cursors can be interleaved on a single connection without any problem.


Let me know if you have further questions.

 -William



Reply all
Reply to author
Forward
0 new messages