If a data set & use cases are focused on "relational" work - would that reduce the fit of Mongodb for the project? For example a "friendship" data set where questions like please get me the detailed info of users (held in person_info) who interacted with each other (held in person interaction) who bought product X ( held in product and purchase_history collections) at event Y.
Would it be possible to use a map reduce function which operates over multiple tables?
I guess foursquare have some way of answering those sorts of questions efficiently with Mongo!
For folks coming from the SQL world (such as myself) one of the hardest things to learn about MongoDB is the new style of schema design. In the SQL world, everything goes into third normal form. Folks come to think that there is a single right way to design their schema, because there typically is one.
There is a cognitive shift required from a relational mindset in order to properly design a MongoDB schema. If you are still "thinking relationally", then you will miss most of the advantages of using MongoDB.
More importantly, if all you have is a hammer, everything looks like a nail. I know that, for myself, it took me a long time to take off the "relational blinders". While you still may have entities with one-to-one, many-to-one, or many-to-many relationships, it is no more necessary to force them into third normal form in the database than it is in your programming language. Just as there are ways to represent those relationships within an object-oriented language such as Java or Ruby, there are equally effective ways to represent them in a document-oriented database, such as MongoDB.
Unlike the SQL world, in the MongoDB world, there is no one best schema design. More accurately, in MongoDB schema design depends on how the application is going to access the data.
Here are the key questions that you need to have answered in order to design a good schema for MongoDB: - How much data do you have? - What are your most common operations? Will you be mostly inserting new data, updating existing data, or doing queries? - What are your most common queries? - What are your most common updates? - How many I/O operations do you expect per second?
Here's how these questions might play out if you are considering one-to-many object relationships.
In SQL you model a many-to-one relationship by creating a pair of master/detail tables with a primary key/foreign key relationship. That's really the only way to do it.
In MongoDB, you have a number of choices: you can embed the data, you can create a linked relationship, you can duplicate and denormalize the data, or you can use a hybrid approach. The correct approach would depend on a lot of details about the use case of your application.
The key thing required to make this cognitive shift is to get enough exposure to MongoDB schemas to un-learn what you have learned. Here are some references to get you started.
Here are some good general references on MongoDB schema design.
Ultimately, the way that relational databases work is by splitting your data up among many tables & then joining them dynamically at query time. MongoDB has two modes. The more native mode involves pre-aggregating your data into logical chunks when you insert it, and then querying those pre-aggregated chunks. MongoDB can also work relationally: you just have to do the joins on the client side in your code, rather than relying on the database server to do it for you.
In my opinion, you often get cleaner code and faster performance by performing the join yourself: you know more about the semantics of the data and the data distribution than any RDBMS optimizer can ever know.
I hope you found this useful. Let me know if you have further questions.
On Sunday, October 7, 2012 3:06:38 AM UTC-7, Brent Gracey wrote:
> If a data set & use cases are focused on "relational" work - would that > reduce the fit of Mongodb for the project? For example a "friendship" data > set where questions like please get me the detailed info of users (held in > person_info) who interacted with each other (held in person interaction) > who bought product X ( held in product and purchase_history collections) at > event Y.
> Would it be possible to use a map reduce function which operates over > multiple tables?
> I guess foursquare have some way of answering those sorts of questions > efficiently with Mongo!
Hi - thank you very much for you input - I am trying to take of the blinders!
At the moment I am working on a new requirement which my data is not pre aggregted to answer - are the below my full set of options, or am I missing a trick or two?
1) Update schema to provide pre aggregration + Can lead to optimal Mongo performance - Requires db migration and code update to keep aggrations current with ongoing system usage
2) Answer requirement with client side processing - eg get Master and Detail table, then loop over Master in client code, and lookup required details + Low number of db connections - Perhaps this is just a knowledge gap for me, but I have to ensure I optimize handling larger data sets in my "web app" code (ie ensure find product details based on product ID isn't doing a full list scan on the detail data set) Sorting the data would be the first approach I will look into - any other suggestions welcome.
3) Mixture client and multiple queries - eg get the Master table and then do selects to get Details as required + Detail selects can use DB indexes to be efficient - Very high number of db connections
At the moment I'm using option 3, which is slower than what I would get in the same setup with a RDMS just joining the tables based on an index, but I'm hesitant to move to 2, as users being able to drive full tables selects isn't a model I'm used to working with, and 1 is a fairly large piece of work, which may have to be repeated when the next requirement come along
On Tuesday, October 9, 2012 2:17:09 AM UTC+1, William Zola wrote:
> Hi Brent!
> For folks coming from the SQL world (such as myself) one of the hardest > things to learn about MongoDB is the new style of schema design. In the > SQL world, everything goes into third normal form. Folks come to think > that there is a single right way to design their schema, because there > typically is one.
> There is a cognitive shift required from a relational mindset in order to > properly design a MongoDB schema. If you are still "thinking > relationally", then you will miss most of the advantages of using MongoDB.
> More importantly, if all you have is a hammer, everything looks like a > nail. I know that, for myself, it took me a long time to take off the > "relational blinders". While you still may have entities with one-to-one, > many-to-one, or many-to-many relationships, it is no more necessary to > force them into third normal form in the database than it is in your > programming language. Just as there are ways to represent those > relationships within an object-oriented language such as Java or Ruby, > there are equally effective ways to represent them in a document-oriented > database, such as MongoDB.
> Unlike the SQL world, in the MongoDB world, there is no one best schema > design. More accurately, in MongoDB schema design depends on how the > application is going to access the data.
> Here are the key questions that you need to have answered in order to > design a good schema for MongoDB: > - How much data do you have? > - What are your most common operations? Will you be mostly inserting new > data, updating existing data, or doing queries? > - What are your most common queries? > - What are your most common updates? > - How many I/O operations do you expect per second?
> Here's how these questions might play out if you are considering > one-to-many object relationships.
> In SQL you model a many-to-one relationship by creating a pair of > master/detail tables with a primary key/foreign key relationship. That's > really the only way to do it.
> In MongoDB, you have a number of choices: you can embed the data, you can > create a linked relationship, you can duplicate and denormalize the data, > or you can use a hybrid approach. The correct approach would depend on a > lot of details about the use case of your application.
> The key thing required to make this cognitive shift is to get enough > exposure to MongoDB schemas to un-learn what you have learned. Here are > some references to get you started.
> Here are some good general references on MongoDB schema design.
> Ultimately, the way that relational databases work is by splitting your > data up among many tables & then joining them dynamically at query time. > MongoDB has two modes. The more native mode involves pre-aggregating your > data into logical chunks when you insert it, and then querying those > pre-aggregated chunks. MongoDB can also work relationally: you just have > to do the joins on the client side in your code, rather than relying on the > database server to do it for you.
> In my opinion, you often get cleaner code and faster performance by > performing the join yourself: you know more about the semantics of the data > and the data distribution than any RDBMS optimizer can ever know.
> I hope you found this useful. Let me know if you have further questions.
> -William
> On Sunday, October 7, 2012 3:06:38 AM UTC-7, Brent Gracey wrote:
>> If a data set & use cases are focused on "relational" work - would that >> reduce the fit of Mongodb for the project? For example a "friendship" data >> set where questions like please get me the detailed info of users (held in >> person_info) who interacted with each other (held in person interaction) >> who bought product X ( held in product and purchase_history collections) at >> event Y.
>> Would it be possible to use a map reduce function which operates over >> multiple tables?
>> I guess foursquare have some way of answering those sorts of questions >> efficiently with Mongo!
On Tuesday, October 9, 2012 6:04:21 AM UTC-7, Brent Gracey wrote:
> Hi - thank you very much for you input - I am trying to take of the > blinders!
> At the moment I am working on a new requirement which my data is not pre > aggregted to answer - are the below my full set of options, or am I missing > a trick or two?
Unfortunately, you haven't given me enough information to answer the question. There's no one "right" generalized solution for a master/detail relationship in MongoDB.
I'd need to know the details of your data model and the semantics of what you're trying to accomplish in order to make suggestions for your particular use case. Are you doing targeted queries to satisfy a web page? Or are you trying to do analytics to roll-up a large data set that's spread across two collections? The tools in the toolbox and the set of options are highly dependent on which one you're doing.
Once I have more information I can give you a better answer.
I would like to get my head around as many of the tools in the toolbox as possible, so I can make an informed choice of the best fit for my current requirements and understand what road map I should have in mind for as the data set grows. So if there is an approach that doesn't fit under one of the three methods I listed, it would be great if you could highlight that to me.
Some more details on my current work is its a report, so its is aggregating records, and it needs to be available in real time, so users can view a summary of the event as it progresses through the day. At the moment the record sets are in the 1000s on the two collections, and the relationship is 1 to M but a low M, so 1,2 detail records is the most common case.
On Tuesday, October 9, 2012 2:52:28 PM UTC+1, William Zola wrote:
> Hi Brent!
> On Tuesday, October 9, 2012 6:04:21 AM UTC-7, Brent Gracey wrote:
>> Hi - thank you very much for you input - I am trying to take of the >> blinders!
>> At the moment I am working on a new requirement which my data is not pre >> aggregted to answer - are the below my full set of options, or am I missing >> a trick or two?
> Unfortunately, you haven't given me enough information to answer the > question. There's no one "right" generalized solution for a master/detail > relationship in MongoDB.
> I'd need to know the details of your data model and the semantics of what > you're trying to accomplish in order to make suggestions for your > particular use case. Are you doing targeted queries to satisfy a web page? > Or are you trying to do analytics to roll-up a large data set that's > spread across two collections? The tools in the toolbox and the set of > options are highly dependent on which one you're doing.
> Once I have more information I can give you a better answer.
> -William
> -- > You received this message because you are subscribed to the Google > Groups "mongodb-user" group. > To post to this group, send email to mongod...@googlegroups.com<javascript:> > To unsubscribe from this group, send email to > mongodb-user...@googlegroups.com <javascript:> > See also the IRC channel -- freenode.net#mongodb
Here's a quick overview of the tools in the toolbox for many-to-one relationships:
- Embedding instead of referencing
- Placing references to the detail document in the master document
- Placing a reference to the master document in the detail document. (AKA a "Parent Pointer" tree.) This lets you get all of the details for a single master document in a single query.
- Combining both master->detail and detail->master references
- Denormalizing the data so that some of the needed information from the detail documents are present in both the master and detail documents.
(This works best if writes are infrequent.)
If you have a "hard" real-time (as in 'up-to-the-second') requirement, then you have to do database queries at the time that the application requests them. If it's OK for data to be stale, then you can run batch jobs to do the pre-aggregation, and then perform the reporting queries off of them.
(Think "Data Mart" and "scheduled ETL".)
MongoDB has a reasonably powerful MapReduce functionality: you can often use multiple mMapReduce jobs to join data from multiple collections. While not suitable for hard real-time requirements, it can be useful for
You can also use the Hadoop adaptor to do MapReduce jobs. The Hadoop "dependent jobs" feature can allow you to read from multiple collections in the context of a single Hadoop job.
Finally, if your document counts are only in the thousands or tens of thousands, then I'd suggest staying with the simple strategy of doing joins in your client application. The database would have to do these anyway: there's no extra work involved, just a matter of moving it off of the database server.
On Tuesday, October 9, 2012 1:50:37 PM UTC-7, Brent Gracey wrote:
> Hi,
> I would like to get my head around as many of the tools in the toolbox as > possible, so I can make an informed choice of the best fit for my current > requirements and understand what road map I should have in mind for as the > data set grows. So if there is an approach that doesn't fit under one of > the three methods I listed, it would be great if you could highlight that > to me.
> Some more details on my current work is its a report, so its is > aggregating records, and it needs to be available in real time, so users > can view a summary of the event as it progresses through the day. At the > moment the record sets are in the 1000s on the two collections, and the > relationship is 1 to M but a low M, so 1,2 detail records is the most > common case.
> Thanks again for your help
> On Tuesday, October 9, 2012 2:52:28 PM UTC+1, William Zola wrote:
>> Hi Brent!
>> On Tuesday, October 9, 2012 6:04:21 AM UTC-7, Brent Gracey wrote:
>>> Hi - thank you very much for you input - I am trying to take of the >>> blinders!
>>> At the moment I am working on a new requirement which my data is not pre >>> aggregted to answer - are the below my full set of options, or am I missing >>> a trick or two?
>> Unfortunately, you haven't given me enough information to answer the >> question. There's no one "right" generalized solution for a master/detail >> relationship in MongoDB.
>> I'd need to know the details of your data model and the semantics of what >> you're trying to accomplish in order to make suggestions for your >> particular use case. Are you doing targeted queries to satisfy a web page? >> Or are you trying to do analytics to roll-up a large data set that's >> spread across two collections? The tools in the toolbox and the set of >> options are highly dependent on which one you're doing.
>> Once I have more information I can give you a better answer.
>> -William
>> -- >> You received this message because you are subscribed to the Google
>> Groups "mongodb-user" group.
>> To post to this group, send email to mongod...@googlegroups.com
>> To unsubscribe from this group, send email to
>> mongodb-user...@googlegroups.com
>> See also the IRC channel -- freenode.net#mongodb
Hi - thanks for the comprehensive reply - I'm still working through most of the links you have suggested. One thing I would like to clarify - for your comment "suggest staying with the simple strategy of doing joins in your client application. The database would have to do these anyway:"
Yes the database application would have to join - but assuming it uses an index, isn't the database operation not much more efficient than
for all x in array X Look up Y[x]
My assumptions for this are:
There is overhead in maintaining DB indexes - this has already been "expended effort" at the time of getting data Getting all the data from the DB to the client application loses the index info Repeating the effort of generating an index (not sure how many languages would support that - any Scala specific thoughts welcome) is duplicated effort Not doing something to replicate index functionality will make the look up in Y much less efficient.
So I can't see how client side processing can be as efficient as getting the db to do it - and as one scales - you either get to a point of having to do a large data restructure or live with the extra overhead.
Main point of the post if for people to point out if any of my assumptions are incorrect.
On Sunday, October 7, 2012 11:06:38 AM UTC+1, Brent Gracey wrote:
> If a data set & use cases are focused on "relational" work - would that > reduce the fit of Mongodb for the project? For example a "friendship" data > set where questions like please get me the detailed info of users (held in > person_info) who interacted with each other (held in person interaction) > who bought product X ( held in product and purchase_history collections) at > event Y.
> Would it be possible to use a map reduce function which operates over > multiple tables?
> I guess foursquare have some way of answering those sorts of questions > efficiently with Mongo!
The answer to your question is: the database is not going to be any more efficient than doing the queries client side -- IF you have designed your schema properly.
1) Let's begin by looking at how you'd do this in SQL:
First off: this creates two indexes "behind your back": one on dept.deptno, one on emp.empid.
Second: consider how you would query these tables:
SELECT empid, ename FROM emp, dept WHERE dept.dname = 'SALES' JOIN dept.deptno = emp.deptno ORDER BY ename;
Note that unless you index 'emp.deptno', this query is going to be slow (since it will otherwise perform a full table scan on 'emp'). Unless you index 'dept.dname', it will have to perform a full table scan on 'dept'.
You're now at no less than FOUR indexes.
This query is going to force the RDBMS to do the following: A) Read the 'dept.dname' index to find the location of the row(s) where "dname = 'SALES'" B) Fetch the matching row(s) from the 'dept' table, and save off the set of 'deptno' values C) Read the 'emp.deptno' index to find the locations of the matching rows in the 'emp' table D) Fetch the matching rows from the 'emp' table E) Sort the result set in-memory
2) Now, let's consider how to do this in MongoDB. For legibility, I'm going to use integers for the _id field, but everything I will do will work with ObjectIDs as well:
In terms of the actual work done, there are the same number of indexes involved, and the same number of operations involved as with the SQL example. The only difference is that I had to build one portion of the query in the client side instead of on the server side.
On Thursday, October 18, 2012 3:19:54 PM UTC-7, Brent Gracey wrote:
> Hi - thanks for the comprehensive reply - I'm still working through most > of the links you have suggested. One thing I would like to clarify - for > your comment "suggest staying with the simple strategy of doing joins in > your client application. The database would have to do these anyway:"
> Yes the database application would have to join - but assuming it uses an > index, isn't the database operation not much more efficient than
> for all x in array X > Look up Y[x]
> My assumptions for this are:
> There is overhead in maintaining DB indexes - this has already been > "expended effort" at the time of getting data > Getting all the data from the DB to the client application loses the index > info > Repeating the effort of generating an index (not sure how many languages > would support that - any Scala specific thoughts welcome) is duplicated > effort > Not doing something to replicate index functionality will make the look up > in Y much less efficient.
> So I can't see how client side processing can be as efficient as getting > the db to do it - and as one scales - you either get to a point of having > to do a large data restructure or live with the extra overhead.
> Main point of the post if for people to point out if any of my assumptions > are incorrect.
> Thanks
> On Sunday, October 7, 2012 11:06:38 AM UTC+1, Brent Gracey wrote:
>> If a data set & use cases are focused on "relational" work - would that >> reduce the fit of Mongodb for the project? For example a "friendship" data >> set where questions like please get me the detailed info of users (held in >> person_info) who interacted with each other (held in person interaction) >> who bought product X ( held in product and purchase_history collections) at >> event Y.
>> Would it be possible to use a map reduce function which operates over >> multiple tables?
>> I guess foursquare have some way of answering those sorts of questions >> efficiently with Mongo!
I can see how doing this for a query that requires information for one record is efficient, I however need to do a look up for multiple records.
If I pull all the detail records in one query - then the client side "detail" look up will be inefficient as there is no "index" concept in the client side code.
If I do a query to the db for each detail record then I get 100s of db connections which is also a big overhead.
So my understanding is I would need to update my data structure to address this efficiently.
On Monday, October 29, 2012 5:40:26 PM UTC, William Zola wrote:
> Hi Brent!
> Sorry for taking so long to get back to you.
> The answer to your question is: the database is not going to be any more > efficient than doing the queries client side -- IF you have designed your > schema properly.
> 1) Let's begin by looking at how you'd do this in SQL:
> First off: this creates two indexes "behind your back": one on > dept.deptno, one on emp.empid.
> Second: consider how you would query these tables:
> SELECT empid, ename FROM emp, dept > WHERE > dept.dname = 'SALES' > JOIN > dept.deptno = emp.deptno > ORDER BY ename;
> Note that unless you index 'emp.deptno', this query is going to be slow > (since it will otherwise perform a full table scan on 'emp'). Unless you > index 'dept.dname', it will have to perform a full table scan on 'dept'.
> You're now at no less than FOUR indexes.
> This query is going to force the RDBMS to do the following: > A) Read the 'dept.dname' index to find the location of the row(s) where > "dname = 'SALES'" > B) Fetch the matching row(s) from the 'dept' table, and save off the set > of 'deptno' values > C) Read the 'emp.deptno' index to find the locations of the matching > rows in the 'emp' table > D) Fetch the matching rows from the 'emp' table > E) Sort the result set in-memory
> 2) Now, let's consider how to do this in MongoDB. For legibility, I'm > going to use integers for the _id field, but everything I will do will work > with ObjectIDs as well:
> In terms of the actual work done, there are the same number of indexes > involved, and the same number of operations involved as with the SQL > example. The only difference is that I had to build one portion of the > query in the client side instead of on the server side.
> Let me know if you have further questions.
> -William
> On Thursday, October 18, 2012 3:19:54 PM UTC-7, Brent Gracey wrote:
>> Hi - thanks for the comprehensive reply - I'm still working through most >> of the links you have suggested. One thing I would like to clarify - for >> your comment "suggest staying with the simple strategy of doing joins in >> your client application. The database would have to do these anyway:"
>> Yes the database application would have to join - but assuming it uses an >> index, isn't the database operation not much more efficient than
>> for all x in array X >> Look up Y[x]
>> My assumptions for this are:
>> There is overhead in maintaining DB indexes - this has already been >> "expended effort" at the time of getting data >> Getting all the data from the DB to the client application loses the >> index info >> Repeating the effort of generating an index (not sure how many languages >> would support that - any Scala specific thoughts welcome) is duplicated >> effort >> Not doing something to replicate index functionality will make the look >> up in Y much less efficient.
>> So I can't see how client side processing can be as efficient as getting >> the db to do it - and as one scales - you either get to a point of having >> to do a large data restructure or live with the extra overhead.
>> Main point of the post if for people to point out if any of my >> assumptions are incorrect.
>> Thanks
>> On Sunday, October 7, 2012 11:06:38 AM UTC+1, Brent Gracey wrote:
>>> If a data set & use cases are focused on "relational" work - would that >>> reduce the fit of Mongodb for the project? For example a "friendship" data >>> set where questions like please get me the detailed info of users (held in >>> person_info) who interacted with each other (held in person interaction) >>> who bought product X ( held in product and purchase_history collections) at >>> event Y.
>>> Would it be possible to use a map reduce function which operates over >>> multiple tables?
>>> I guess foursquare have some way of answering those sorts of questions >>> efficiently with Mongo!
"If I do a query to the db for each detail record then I get 100s of db
connections which is also a big overhead."
You only get one, and it should be a persistent one.
"If I pull all the detail records in one query - then the client side
"detail" look up will be inefficient as there is no "index" concept in the
client side code."
Indexes on JOINs are a questionable matter, I have seen many many times in
SQL where a simple JOIN has resulted in the index not being used for some
unknown reason even if that JOIN is specifically defined within the
internals of MySQL (which is what I was using at the time) using pk and fk
format, most specifically on non-standard ranges of data. You have also
gotta consider that JOINs can become a pain to scale in certain situations.
You have got to consider that in MySQL the client side "detail" lookup for
each record would be slow because of what it takes to make a SQL query and
result set for every row instead of a large JOINed result set and most
likely not because of the "index" concept.
MongoDB is quite different in how it queries, it is used to querying in
this way.
There is one caveat you will get into soon however, which I think needs
fixing, which is that MongoDB has no server-side sub selects, so if you
wanna do an $in based on another query you must pull the data over the wire
and operate on it client side even if you don't need those initial rows,
which is a little painful.
On 13 November 2012 11:10, Brent Gracey <brentgra...@gmail.com> wrote:
> Hi William - thanks again for the detailed reply
> I can see how doing this for a query that requires information for one
> record is efficient, I however need to do a look up for multiple records.
> If I pull all the detail records in one query - then the client side
> "detail" look up will be inefficient as there is no "index" concept in the
> client side code.
> If I do a query to the db for each detail record then I get 100s of db
> connections which is also a big overhead.
> So my understanding is I would need to update my data structure to address
> this efficiently.
> On Monday, October 29, 2012 5:40:26 PM UTC, William Zola wrote:
>> Hi Brent!
>> Sorry for taking so long to get back to you.
>> The answer to your question is: the database is not going to be any more
>> efficient than doing the queries client side -- IF you have designed your
>> schema properly.
>> 1) Let's begin by looking at how you'd do this in SQL:
>> First off: this creates two indexes "behind your back": one on
>> dept.deptno, one on emp.empid.
>> Second: consider how you would query these tables:
>> SELECT empid, ename FROM emp, dept
>> WHERE
>> dept.dname = 'SALES'
>> JOIN
>> dept.deptno = emp.deptno
>> ORDER BY ename;
>> Note that unless you index 'emp.deptno', this query is going to be slow
>> (since it will otherwise perform a full table scan on 'emp'). Unless you
>> index 'dept.dname', it will have to perform a full table scan on 'dept'.
>> You're now at no less than FOUR indexes.
>> This query is going to force the RDBMS to do the following:
>> A) Read the 'dept.dname' index to find the location of the row(s) where
>> "dname = 'SALES'"
>> B) Fetch the matching row(s) from the 'dept' table, and save off the
>> set of 'deptno' values
>> C) Read the 'emp.deptno' index to find the locations of the matching
>> rows in the 'emp' table
>> D) Fetch the matching rows from the 'emp' table
>> E) Sort the result set in-memory
>> 2) Now, let's consider how to do this in MongoDB. For legibility, I'm
>> going to use integers for the _id field, but everything I will do will work
>> with ObjectIDs as well:
>> To make this efficient, I'll create a secondary index on the 'dept' field
>> in the 'emp' collection, and the 'name' field in the 'dept' collection:
>> In terms of the actual work done, there are the same number of indexes
>> involved, and the same number of operations involved as with the SQL
>> example. The only difference is that I had to build one portion of the
>> query in the client side instead of on the server side.
>> Let me know if you have further questions.
>> -William
>> On Thursday, October 18, 2012 3:19:54 PM UTC-7, Brent Gracey wrote:
>>> Hi - thanks for the comprehensive reply - I'm still working through most
>>> of the links you have suggested. One thing I would like to clarify - for
>>> your comment "suggest staying with the simple strategy of doing joins in
>>> your client application. The database would have to do these anyway:"
>>> Yes the database application would have to join - but assuming it uses
>>> an index, isn't the database operation not much more efficient than
>>> for all x in array X
>>> Look up Y[x]
>>> My assumptions for this are:
>>> There is overhead in maintaining DB indexes - this has already been
>>> "expended effort" at the time of getting data
>>> Getting all the data from the DB to the client application loses the
>>> index info
>>> Repeating the effort of generating an index (not sure how many languages
>>> would support that - any Scala specific thoughts welcome) is duplicated
>>> effort
>>> Not doing something to replicate index functionality will make the look
>>> up in Y much less efficient.
>>> So I can't see how client side processing can be as efficient as getting
>>> the db to do it - and as one scales - you either get to a point of having
>>> to do a large data restructure or live with the extra overhead.
>>> Main point of the post if for people to point out if any of my
>>> assumptions are incorrect.
>>> Thanks
>>> On Sunday, October 7, 2012 11:06:38 AM UTC+1, Brent Gracey wrote:
>>>> If a data set & use cases are focused on "relational" work - would that
>>>> reduce the fit of Mongodb for the project? For example a "friendship" data
>>>> set where questions like please get me the detailed info of users (held in
>>>> person_info) who interacted with each other (held in person interaction)
>>>> who bought product X ( held in product and purchase_history collections) at
>>>> event Y.
>>>> Would it be possible to use a map reduce function which operates over
>>>> multiple tables?
>>>> I guess foursquare have some way of answering those sorts of questions
>>>> efficiently with Mongo!
>>> --
> You received this message because you are subscribed to the Google
> Groups "mongodb-user" group.
> To post to this group, send email to mongodb-user@googlegroups.com
> To unsubscribe from this group, send email to
> mongodb-user+unsubscribe@googlegroups.com
> See also the IRC channel -- freenode.net#mongodb
> "If I do a query to the db for each detail record then I get 100s of db
> connections which is also a big overhead."
> You only get one, and it should be a persistent one.
> "If I pull all the detail records in one query - then the client side
> "detail" look up will be inefficient as there is no "index" concept in the
> client side code."
> Indexes on JOINs are a questionable matter, I have seen many many times in
> SQL where a simple JOIN has resulted in the index not being used for some
> unknown reason even if that JOIN is specifically defined within the
> internals of MySQL (which is what I was using at the time) using pk and fk
> format, most specifically on non-standard ranges of data. You have also
> gotta consider that JOINs can become a pain to scale in certain situations.
> You have got to consider that in MySQL the client side "detail" lookup for
> each record would be slow because of what it takes to make a SQL query and
> result set for every row instead of a large JOINed result set and most
> likely not because of the "index" concept.
> MongoDB is quite different in how it queries, it is used to querying in
> this way.
> There is one caveat you will get into soon however, which I think needs
> fixing, which is that MongoDB has no server-side sub selects, so if you
> wanna do an $in based on another query you must pull the data over the wire
> and operate on it client side even if you don't need those initial rows,
> which is a little painful.
> On 13 November 2012 11:10, Brent Gracey <brentgra...@gmail.com> wrote:
>> Hi William - thanks again for the detailed reply
>> I can see how doing this for a query that requires information for one
>> record is efficient, I however need to do a look up for multiple records.
>> If I pull all the detail records in one query - then the client side
>> "detail" look up will be inefficient as there is no "index" concept in the
>> client side code.
>> If I do a query to the db for each detail record then I get 100s of db
>> connections which is also a big overhead.
>> So my understanding is I would need to update my data structure to
>> address this efficiently.
>> On Monday, October 29, 2012 5:40:26 PM UTC, William Zola wrote:
>>> Hi Brent!
>>> Sorry for taking so long to get back to you.
>>> The answer to your question is: the database is not going to be any more
>>> efficient than doing the queries client side -- IF you have designed your
>>> schema properly.
>>> 1) Let's begin by looking at how you'd do this in SQL:
>>> First off: this creates two indexes "behind your back": one on
>>> dept.deptno, one on emp.empid.
>>> Second: consider how you would query these tables:
>>> SELECT empid, ename FROM emp, dept
>>> WHERE
>>> dept.dname = 'SALES'
>>> JOIN
>>> dept.deptno = emp.deptno
>>> ORDER BY ename;
>>> Note that unless you index 'emp.deptno', this query is going to be slow
>>> (since it will otherwise perform a full table scan on 'emp'). Unless you
>>> index 'dept.dname', it will have to perform a full table scan on 'dept'.
>>> You're now at no less than FOUR indexes.
>>> This query is going to force the RDBMS to do the following:
>>> A) Read the 'dept.dname' index to find the location of the row(s)
>>> where "dname = 'SALES'"
>>> B) Fetch the matching row(s) from the 'dept' table, and save off the
>>> set of 'deptno' values
>>> C) Read the 'emp.deptno' index to find the locations of the matching
>>> rows in the 'emp' table
>>> D) Fetch the matching rows from the 'emp' table
>>> E) Sort the result set in-memory
>>> 2) Now, let's consider how to do this in MongoDB. For legibility, I'm
>>> going to use integers for the _id field, but everything I will do will work
>>> with ObjectIDs as well:
>>> To make this efficient, I'll create a secondary index on the 'dept'
>>> field in the 'emp' collection, and the 'name' field in the 'dept'
>>> collection:
>>> In terms of the actual work done, there are the same number of indexes
>>> involved, and the same number of operations involved as with the SQL
>>> example. The only difference is that I had to build one portion of the
>>> query in the client side instead of on the server side.
>>> Let me know if you have further questions.
>>> -William
>>> On Thursday, October 18, 2012 3:19:54 PM UTC-7, Brent Gracey wrote:
>>>> Hi - thanks for the comprehensive reply - I'm still working through
>>>> most of the links you have suggested. One thing I would like to clarify -
>>>> for your comment "suggest staying with the simple strategy of doing joins
>>>> in your client application. The database would have to do these anyway:"
>>>> Yes the database application would have to join - but assuming it uses
>>>> an index, isn't the database operation not much more efficient than
>>>> for all x in array X
>>>> Look up Y[x]
>>>> My assumptions for this are:
>>>> There is overhead in maintaining DB indexes - this has already been
>>>> "expended effort" at the time of getting data
>>>> Getting all the data from the DB to the client application loses the
>>>> index info
>>>> Repeating the effort of generating an index (not sure how many
>>>> languages would support that - any Scala specific thoughts welcome) is
>>>> duplicated effort
>>>> Not doing something to replicate index functionality will make the look
>>>> up in Y much less efficient.
>>>> So I can't see how client side processing can be as efficient as
>>>> getting the db to do it - and as one scales - you either get to a point of
>>>> having to do a large data restructure or live with the extra overhead.
>>>> Main point of the post if for people to point out if any of my
>>>> assumptions are incorrect.
>>>> Thanks
>>>> On Sunday, October 7, 2012 11:06:38 AM UTC+1, Brent Gracey wrote:
>>>>> If a data set & use cases are focused on "relational" work - would
>>>>> that reduce the fit of Mongodb for the project? For example a "friendship"
>>>>> data set where questions like please get me the detailed info of users
>>>>> (held in person_info) who interacted with each other (held in person
>>>>> interaction) who bought product X ( held in product and purchase_history
>>>>> collections) at event Y.
>>>>> Would it be possible to use a map reduce function which operates over
>>>>> multiple tables?
>>>>> I guess foursquare have some way of answering those sorts of questions
>>>>> efficiently with Mongo!
>>>> --
>> You received this message because you are subscribed to the Google
>> Groups "mongodb-user" group.
>> To post to this group, send email to mongodb-user@googlegroups.com
>> To unsubscribe from this group, send email to
>> mongodb-user+unsubscribe@googlegroups.com
>> See also the IRC channel -- freenode.net#mongodb
I know very well that joins in SQL can have their own nuances, but feel that is a separate discussion. And yes, connection pooling will reduce the overhead of multiple db queries, but it is still an overhead.
My current situation is I am doing multiple detail queries - and the report is taking about 30 seconds to complete for ~200 records, and my boss wants it to go faster, and I don't know how to that without telling him we need to re-architect our data structure.
I haven't got the exact breakdown of what is taking the time - so can't tell you how much "overhead" the multiple db connections is introducing, but I'll work on some logging to break down exactly where time is being spend in the report.
On Tuesday, November 13, 2012 12:32:19 PM UTC, Sammaye wrote:
> "You have got to consider that in MySQL"
> Sorry I mean any SQL tech there.
> On 13 November 2012 12:31, Sam Millman <sam.m...@gmail.com <javascript:>>wrote:
>> "If I do a query to the db for each detail record then I get 100s of db >> connections which is also a big overhead."
>> You only get one, and it should be a persistent one.
>> "If I pull all the detail records in one query - then the client side >> "detail" look up will be inefficient as there is no "index" concept in the >> client side code."
>> Indexes on JOINs are a questionable matter, I have seen many many times >> in SQL where a simple JOIN has resulted in the index not being used for >> some unknown reason even if that JOIN is specifically defined within the >> internals of MySQL (which is what I was using at the time) using pk and fk >> format, most specifically on non-standard ranges of data. You have also >> gotta consider that JOINs can become a pain to scale in certain situations.
>> You have got to consider that in MySQL the client side "detail" lookup >> for each record would be slow because of what it takes to make a SQL query >> and result set for every row instead of a large JOINed result set and most >> likely not because of the "index" concept.
>> MongoDB is quite different in how it queries, it is used to querying in >> this way.
>> There is one caveat you will get into soon however, which I think needs >> fixing, which is that MongoDB has no server-side sub selects, so if you >> wanna do an $in based on another query you must pull the data over the wire >> and operate on it client side even if you don't need those initial rows, >> which is a little painful.
>> On 13 November 2012 11:10, Brent Gracey <brent...@gmail.com <javascript:> >> > wrote:
>>> Hi William - thanks again for the detailed reply
>>> I can see how doing this for a query that requires information for one >>> record is efficient, I however need to do a look up for multiple records.
>>> If I pull all the detail records in one query - then the client side >>> "detail" look up will be inefficient as there is no "index" concept in the >>> client side code.
>>> If I do a query to the db for each detail record then I get 100s of db >>> connections which is also a big overhead.
>>> So my understanding is I would need to update my data structure to >>> address this efficiently.
>>> On Monday, October 29, 2012 5:40:26 PM UTC, William Zola wrote:
>>>> Hi Brent!
>>>> Sorry for taking so long to get back to you.
>>>> The answer to your question is: the database is not going to be any >>>> more efficient than doing the queries client side -- IF you have designed >>>> your schema properly.
>>>> 1) Let's begin by looking at how you'd do this in SQL:
>>>> First off: this creates two indexes "behind your back": one on >>>> dept.deptno, one on emp.empid.
>>>> Second: consider how you would query these tables:
>>>> SELECT empid, ename FROM emp, dept >>>> WHERE >>>> dept.dname = 'SALES' >>>> JOIN >>>> dept.deptno = emp.deptno >>>> ORDER BY ename;
>>>> Note that unless you index 'emp.deptno', this query is going to be slow >>>> (since it will otherwise perform a full table scan on 'emp'). Unless you >>>> index 'dept.dname', it will have to perform a full table scan on 'dept'.
>>>> You're now at no less than FOUR indexes.
>>>> This query is going to force the RDBMS to do the following: >>>> A) Read the 'dept.dname' index to find the location of the row(s) >>>> where "dname = 'SALES'" >>>> B) Fetch the matching row(s) from the 'dept' table, and save off the >>>> set of 'deptno' values >>>> C) Read the 'emp.deptno' index to find the locations of the matching >>>> rows in the 'emp' table >>>> D) Fetch the matching rows from the 'emp' table >>>> E) Sort the result set in-memory
>>>> 2) Now, let's consider how to do this in MongoDB. For legibility, I'm >>>> going to use integers for the _id field, but everything I will do will work >>>> with ObjectIDs as well:
>>>> To make this efficient, I'll create a secondary index on the 'dept' >>>> field in the 'emp' collection, and the 'name' field in the 'dept' >>>> collection:
>>>> In terms of the actual work done, there are the same number of indexes >>>> involved, and the same number of operations involved as with the SQL >>>> example. The only difference is that I had to build one portion of the >>>> query in the client side instead of on the server side.
>>>> Let me know if you have further questions.
>>>> -William
>>>> On Thursday, October 18, 2012 3:19:54 PM UTC-7, Brent Gracey wrote:
>>>>> Hi - thanks for the comprehensive reply - I'm still working through >>>>> most of the links you have suggested. One thing I would like to clarify - >>>>> for your comment "suggest staying with the simple strategy of doing joins >>>>> in your client application. The database would have to do these anyway:"
>>>>> Yes the database application would have to join - but assuming it uses >>>>> an index, isn't the database operation not much more efficient than
>>>>> for all x in array X >>>>> Look up Y[x]
>>>>> My assumptions for this are:
>>>>> There is overhead in maintaining DB indexes - this has already been >>>>> "expended effort" at the time of getting data >>>>> Getting all the data from the DB to the client application loses the >>>>> index info >>>>> Repeating the effort of generating an index (not sure how many >>>>> languages would support that - any Scala specific thoughts welcome) is >>>>> duplicated effort >>>>> Not doing something to replicate index functionality will make the >>>>> look up in Y much less efficient.
>>>>> So I can't see how client side processing can be as efficient as >>>>> getting the db to do it - and as one scales - you either get to a point of >>>>> having to do a large data restructure or live with the extra overhead.
>>>>> Main point of the post if for people to point out if any of my >>>>> assumptions are incorrect.
>>>>> Thanks
>>>>> On Sunday, October 7, 2012 11:06:38 AM UTC+1, Brent Gracey wrote:
>>>>>> If a data set & use cases are focused on "relational" work - would >>>>>> that reduce the fit of Mongodb for the project? For example a "friendship" >>>>>> data set where questions like please get me the detailed info of users >>>>>> (held in person_info) who interacted with each other (held in person >>>>>> interaction) who bought product X ( held in product and purchase_history >>>>>> collections) at event Y.
>>>>>> Would it be possible to use a map reduce function which operates over >>>>>> multiple tables?
>>>>>> I guess foursquare have some way of answering those sorts of >>>>>> questions efficiently with Mongo!
>>>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "mongodb-user" group. >>> To post to this group, send email to mongod...@googlegroups.com<javascript:> >>> To unsubscribe from this group, send email to >>> mongodb-user...@googlegroups.com <javascript:> >>> See also the IRC channel -- freenode.net#mongodb
> I know very well that joins in SQL can have their own nuances, but feel
> that is a separate discussion. And yes, connection pooling will reduce the
> overhead of multiple db queries, but it is still an overhead.
> My current situation is I am doing multiple detail queries - and the
> report is taking about 30 seconds to complete for ~200 records, and my
> boss wants it to go faster, and I don't know how to that without telling
> him we need to re-architect our data structure.
> I haven't got the exact breakdown of what is taking the time - so can't
> tell you how much "overhead" the multiple db connections is introducing,
> but I'll work on some logging to break down exactly where time is being
> spend in the report.
> Thanks
> On Tuesday, November 13, 2012 12:32:19 PM UTC, Sammaye wrote:
>> "You have got to consider that in MySQL"
>> Sorry I mean any SQL tech there.
>> On 13 November 2012 12:31, Sam Millman <sam.m...@gmail.com> wrote:
>>> "If I do a query to the db for each detail record then I get 100s of db
>>> connections which is also a big overhead."
>>> You only get one, and it should be a persistent one.
>>> "If I pull all the detail records in one query - then the client side
>>> "detail" look up will be inefficient as there is no "index" concept in the
>>> client side code."
>>> Indexes on JOINs are a questionable matter, I have seen many many times
>>> in SQL where a simple JOIN has resulted in the index not being used for
>>> some unknown reason even if that JOIN is specifically defined within the
>>> internals of MySQL (which is what I was using at the time) using pk and fk
>>> format, most specifically on non-standard ranges of data. You have also
>>> gotta consider that JOINs can become a pain to scale in certain situations.
>>> You have got to consider that in MySQL the client side "detail" lookup
>>> for each record would be slow because of what it takes to make a SQL query
>>> and result set for every row instead of a large JOINed result set and most
>>> likely not because of the "index" concept.
>>> MongoDB is quite different in how it queries, it is used to querying in
>>> this way.
>>> There is one caveat you will get into soon however, which I think needs
>>> fixing, which is that MongoDB has no server-side sub selects, so if you
>>> wanna do an $in based on another query you must pull the data over the wire
>>> and operate on it client side even if you don't need those initial rows,
>>> which is a little painful.
>>> On 13 November 2012 11:10, Brent Gracey <brent...@gmail.com> wrote:
>>>> Hi William - thanks again for the detailed reply
>>>> I can see how doing this for a query that requires information for one
>>>> record is efficient, I however need to do a look up for multiple records.
>>>> If I pull all the detail records in one query - then the client side
>>>> "detail" look up will be inefficient as there is no "index" concept in the
>>>> client side code.
>>>> If I do a query to the db for each detail record then I get 100s of db
>>>> connections which is also a big overhead.
>>>> So my understanding is I would need to update my data structure to
>>>> address this efficiently.
>>>> On Monday, October 29, 2012 5:40:26 PM UTC, William Zola wrote:
>>>>> Hi Brent!
>>>>> Sorry for taking so long to get back to you.
>>>>> The answer to your question is: the database is not going to be any
>>>>> more efficient than doing the queries client side -- IF you have designed
>>>>> your schema properly.
>>>>> 1) Let's begin by looking at how you'd do this in SQL:
>>>>> First off: this creates two indexes "behind your back": one on
>>>>> dept.deptno, one on emp.empid.
>>>>> Second: consider how you would query these tables:
>>>>> SELECT empid, ename FROM emp, dept
>>>>> WHERE
>>>>> dept.dname = 'SALES'
>>>>> JOIN
>>>>> dept.deptno = emp.deptno
>>>>> ORDER BY ename;
>>>>> Note that unless you index 'emp.deptno', this query is going to be
>>>>> slow (since it will otherwise perform a full table scan on 'emp'). Unless
>>>>> you index 'dept.dname', it will have to perform a full table scan on 'dept'.
>>>>> You're now at no less than FOUR indexes.
>>>>> This query is going to force the RDBMS to do the following:
>>>>> A) Read the 'dept.dname' index to find the location of the row(s)
>>>>> where "dname = 'SALES'"
>>>>> B) Fetch the matching row(s) from the 'dept' table, and save off the
>>>>> set of 'deptno' values
>>>>> C) Read the 'emp.deptno' index to find the locations of the matching
>>>>> rows in the 'emp' table
>>>>> D) Fetch the matching rows from the 'emp' table
>>>>> E) Sort the result set in-memory
>>>>> 2) Now, let's consider how to do this in MongoDB. For legibility, I'm
>>>>> going to use integers for the _id field, but everything I will do will work
>>>>> with ObjectIDs as well:
>>>>> To make this efficient, I'll create a secondary index on the 'dept'
>>>>> field in the 'emp' collection, and the 'name' field in the 'dept'
>>>>> collection:
>>>>> In terms of the actual work done, there are the same number of indexes
>>>>> involved, and the same number of operations involved as with the SQL
>>>>> example. The only difference is that I had to build one portion of the
>>>>> query in the client side instead of on the server side.
>>>>> Let me know if you have further questions.
>>>>> -William
>>>>> On Thursday, October 18, 2012 3:19:54 PM UTC-7, Brent Gracey wrote:
>>>>>> Hi - thanks for the comprehensive reply - I'm still working through
>>>>>> most of the links you have suggested. One thing I would like to clarify -
>>>>>> for your comment "suggest staying with the simple strategy of doing joins
>>>>>> in your client application. The database would have to do these anyway:"
>>>>>> Yes the database application would have to join - but assuming it
>>>>>> uses an index, isn't the database operation not much more efficient than
>>>>>> for all x in array X
>>>>>> Look up Y[x]
>>>>>> My assumptions for this are:
>>>>>> There is overhead in maintaining DB indexes - this has already been
>>>>>> "expended effort" at the time of getting data
>>>>>> Getting all the data from the DB to the client application loses the
>>>>>> index info
>>>>>> Repeating the effort of generating an index (not sure how many
>>>>>> languages would support that - any Scala specific thoughts welcome) is
>>>>>> duplicated effort
>>>>>> Not doing something to replicate index functionality will make the
>>>>>> look up in Y much less efficient.
>>>>>> So I can't see how client side processing can be as efficient as
>>>>>> getting the db to do it - and as one scales - you either get to a point of
>>>>>> having to do a large data restructure or live with the extra overhead.
>>>>>> Main point of the post if for people to point out if any of my
>>>>>> assumptions are incorrect.
>>>>>> Thanks
>>>>>> On Sunday, October 7, 2012 11:06:38 AM UTC+1, Brent Gracey wrote:
>>>>>>> If a data set & use cases are focused on "relational" work - would
>>>>>>> that reduce the fit of Mongodb for the project? For example a "friendship"
>>>>>>> data set where questions like please get me the detailed info of users
>>>>>>> (held in person_info) who interacted with each other (held in person
>>>>>>> interaction) who bought product X ( held in product and purchase_history
>>>>>>> collections) at event Y.
>>>>>>> Would it be possible to use a map reduce function which operates
>>>>>>> over multiple tables?
>>>>>>> I guess foursquare have some way of answering those sorts of
>>>>>>> questions efficiently with Mongo!
>>>>>>> (this is a shortened version of half of post https://groups.google.* >>>>>>> *com**/forum/?fromgroups=#!**searchin/**mongodb-user/**
Hi - so I done some logging to work out what is taking the time
First, a big correction - I'm actually hitting the db 2000 times, not 200, I forgot that 1800 records don't actually do anything in this report, but I have to go look for a detailed record for them before I know that.
Logging from my application code
09:27:31.409 [qtp741338313-114 - /man/json/counts] DEBUG bootstrap.liftweb.RequestAnalyzer - Total request time on /json/counts: 0 ms 09:27:31.421 [qtp741338313-114 - /man/json/counts] DEBUG c.b.sojo_manager.api.RestAPI - Starting to generate report 09:27:32.446 [qtp741338313-114 - /man/json/counts] DEBUG c.b.sojo_manager.api.RestAPI - Got all users from the database -> This is the first select that returns 2000 users 09:27:41.531 [qtp741338313-114 - /man/json/counts] DEBUG c.b.sojo_manager.api.RestAPI - Have looped through all users and checked which ones have devices with connections -> This is after looping through all 2000 and doing a detail lookup in mongo, which takes about 10 seconds. 09:27:41.719 [qtp741338313-114 - /man/json/counts] DEBUG c.b.sojo_manager.api.RestAPI - Have created the from to mapping for the force diagram 09:27:41.719 [qtp741338313-114 - /man/json/counts] DEBUG bootstrap.liftweb.RequestAnalyzer - Total request time on /json/counts: 10301 ms
I enabled db.setProfilingLevel(2) - and after capturing commands for just this report
> db.system.profile.find().count()
2845
Then I ran a summary - to just group everything and give a total "db" run time - which comes in at 1794 milli sec, so I don't think the main part of the 10 seconds is db execution. I am working on refining the application logging to get a better view of what components of the query are taking the remaining 8 seconds.
On Tuesday, November 13, 2012 2:22:58 PM UTC, Sammaye wrote:
> At 200 records it should be giving 0 overhead.
> You are right that the initial post was a little tl;dr. I definitely > missed it.
> So let's start fresh: if you turn on logging for MongoDB and check the log > for slow queries do you see any?
> On 13 November 2012 13:39, Brent Gracey <brent...@gmail.com <javascript:>>wrote:
>> Thanks for the reply,
>> I know very well that joins in SQL can have their own nuances, but feel >> that is a separate discussion. And yes, connection pooling will reduce the >> overhead of multiple db queries, but it is still an overhead.
>> My current situation is I am doing multiple detail queries - and the >> report is taking about 30 seconds to complete for ~200 records, and my >> boss wants it to go faster, and I don't know how to that without telling >> him we need to re-architect our data structure.
>> I haven't got the exact breakdown of what is taking the time - so can't >> tell you how much "overhead" the multiple db connections is introducing, >> but I'll work on some logging to break down exactly where time is being >> spend in the report.
>> Thanks
>> On Tuesday, November 13, 2012 12:32:19 PM UTC, Sammaye wrote:
>>> "You have got to consider that in MySQL"
>>> Sorry I mean any SQL tech there.
>>> On 13 November 2012 12:31, Sam Millman <sam.m...@gmail.com> wrote:
>>>> "If I do a query to the db for each detail record then I get 100s of db >>>> connections which is also a big overhead."
>>>> You only get one, and it should be a persistent one.
>>>> "If I pull all the detail records in one query - then the client side >>>> "detail" look up will be inefficient as there is no "index" concept in the >>>> client side code."
>>>> Indexes on JOINs are a questionable matter, I have seen many many times >>>> in SQL where a simple JOIN has resulted in the index not being used for >>>> some unknown reason even if that JOIN is specifically defined within the >>>> internals of MySQL (which is what I was using at the time) using pk and fk >>>> format, most specifically on non-standard ranges of data. You have also >>>> gotta consider that JOINs can become a pain to scale in certain situations.
>>>> You have got to consider that in MySQL the client side "detail" lookup >>>> for each record would be slow because of what it takes to make a SQL query >>>> and result set for every row instead of a large JOINed result set and most >>>> likely not because of the "index" concept.
>>>> MongoDB is quite different in how it queries, it is used to querying in >>>> this way.
>>>> There is one caveat you will get into soon however, which I think needs >>>> fixing, which is that MongoDB has no server-side sub selects, so if you >>>> wanna do an $in based on another query you must pull the data over the wire >>>> and operate on it client side even if you don't need those initial rows, >>>> which is a little painful.
>>>> On 13 November 2012 11:10, Brent Gracey <brent...@gmail.com> wrote:
>>>>> Hi William - thanks again for the detailed reply
>>>>> I can see how doing this for a query that requires information for one >>>>> record is efficient, I however need to do a look up for multiple records.
>>>>> If I pull all the detail records in one query - then the client side >>>>> "detail" look up will be inefficient as there is no "index" concept in the >>>>> client side code.
>>>>> If I do a query to the db for each detail record then I get 100s of db >>>>> connections which is also a big overhead.
>>>>> So my understanding is I would need to update my data structure to >>>>> address this efficiently.
>>>>> On Monday, October 29, 2012 5:40:26 PM UTC, William Zola wrote:
>>>>>> Hi Brent!
>>>>>> Sorry for taking so long to get back to you.
>>>>>> The answer to your question is: the database is not going to be any >>>>>> more efficient than doing the queries client side -- IF you have designed >>>>>> your schema properly.
>>>>>> 1) Let's begin by looking at how you'd do this in SQL:
>>>>>> First off: this creates two indexes "behind your back": one on >>>>>> dept.deptno, one on emp.empid.
>>>>>> Second: consider how you would query these tables:
>>>>>> SELECT empid, ename FROM emp, dept >>>>>> WHERE >>>>>> dept.dname = 'SALES' >>>>>> JOIN >>>>>> dept.deptno = emp.deptno >>>>>> ORDER BY ename;
>>>>>> Note that unless you index 'emp.deptno', this query is going to be >>>>>> slow (since it will otherwise perform a full table scan on 'emp'). Unless >>>>>> you index 'dept.dname', it will have to perform a full table scan on 'dept'.
>>>>>> You're now at no less than FOUR indexes.
>>>>>> This query is going to force the RDBMS to do the following: >>>>>> A) Read the 'dept.dname' index to find the location of the row(s) >>>>>> where "dname = 'SALES'" >>>>>> B) Fetch the matching row(s) from the 'dept' table, and save off >>>>>> the set of 'deptno' values >>>>>> C) Read the 'emp.deptno' index to find the locations of the >>>>>> matching rows in the 'emp' table >>>>>> D) Fetch the matching rows from the 'emp' table >>>>>> E) Sort the result set in-memory
>>>>>> 2) Now, let's consider how to do this in MongoDB. For legibility, >>>>>> I'm going to use integers for the _id field, but everything I will do will >>>>>> work with ObjectIDs as well:
>>>>>> To make this efficient, I'll create a secondary index on the 'dept' >>>>>> field in the 'emp' collection, and the 'name' field in the 'dept' >>>>>> collection:
>>>>>> In terms of the actual work done, there are the same number of >>>>>> indexes involved, and the same number of operations involved as with the >>>>>> SQL example. The only difference is that I had to build one portion of the >>>>>> query in the client side instead of on the server side.
>>>>>> Let me know if you have further questions.
>>>>>> -William
>>>>>> On Thursday, October 18, 2012 3:19:54 PM UTC-7, Brent Gracey wrote:
>>>>>>> Hi - thanks for the comprehensive reply - I'm still working through >>>>>>> most of the links you have suggested. One thing I would like to clarify -
1794ms still sounds quite slow for 2000, especially since that figure comes
from the database; it, however, appears that the main bottleneck is the
app. Let's see what you find there :)
On 14 November 2012 10:51, Brent Gracey <brentgra...@gmail.com> wrote:
> Hi - so I done some logging to work out what is taking the time
> First, a big correction - I'm actually hitting the db 2000 times, not 200,
> I forgot that 1800 records don't actually do anything in this report, but I
> have to go look for a detailed record for them before I know that.
> Logging from my application code
> 09:27:31.409 [qtp741338313-114 - /man/json/counts] DEBUG
> bootstrap.liftweb.RequestAnalyzer - Total request time on /json/counts: 0 ms
> 09:27:31.421 [qtp741338313-114 - /man/json/counts] DEBUG
> c.b.sojo_manager.api.RestAPI - Starting to generate report
> 09:27:32.446 [qtp741338313-114 - /man/json/counts] DEBUG
> c.b.sojo_manager.api.RestAPI - Got all users from the database -> This is
> the first select that returns 2000 users
> 09:27:41.531 [qtp741338313-114 - /man/json/counts] DEBUG
> c.b.sojo_manager.api.RestAPI - Have looped through all users and checked
> which ones have devices with connections -> This is after looping through
> all 2000 and doing a detail lookup in mongo, which takes about 10 seconds.
> 09:27:41.719 [qtp741338313-114 - /man/json/counts] DEBUG
> c.b.sojo_manager.api.RestAPI - Have created the from to mapping for the
> force diagram
> 09:27:41.719 [qtp741338313-114 - /man/json/counts] DEBUG
> bootstrap.liftweb.RequestAnalyzer - Total request time on /json/counts:
> 10301 ms
> I enabled db.setProfilingLevel(2) - and after capturing commands for just
> this report
> > db.system.profile.find().count()
> 2845
> Then I ran a summary - to just group everything and give a total "db" run
> time - which comes in at 1794 milli sec, so I don't think the main part of
> the 10 seconds is db execution. I am working on refining the application
> logging to get a better view of what components of the query are taking the
> remaining 8 seconds.
> On Tuesday, November 13, 2012 2:22:58 PM UTC, Sammaye wrote:
>> At 200 records it should be giving 0 overhead.
>> You are right that the initial post was a little tl;dr. I definitely
>> missed it.
>> So let's start fresh: if you turn on logging for MongoDB and check the
>> log for slow queries do you see any?
>> On 13 November 2012 13:39, Brent Gracey <brent...@gmail.com> wrote:
>>> Thanks for the reply,
>>> I know very well that joins in SQL can have their own nuances, but feel
>>> that is a separate discussion. And yes, connection pooling will reduce the
>>> overhead of multiple db queries, but it is still an overhead.
>>> My current situation is I am doing multiple detail queries - and the
>>> report is taking about 30 seconds to complete for ~200 records, and my
>>> boss wants it to go faster, and I don't know how to that without telling
>>> him we need to re-architect our data structure.
>>> I haven't got the exact breakdown of what is taking the time - so can't
>>> tell you how much "overhead" the multiple db connections is introducing,
>>> but I'll work on some logging to break down exactly where time is being
>>> spend in the report.
>>> Thanks
>>> On Tuesday, November 13, 2012 12:32:19 PM UTC, Sammaye wrote:
>>>> "You have got to consider that in MySQL"
>>>> Sorry I mean any SQL tech there.
>>>> On 13 November 2012 12:31, Sam Millman <sam.m...@gmail.com> wrote:
>>>>> "If I do a query to the db for each detail record then I get 100s of
>>>>> db connections which is also a big overhead."
>>>>> You only get one, and it should be a persistent one.
>>>>> "If I pull all the detail records in one query - then the client side
>>>>> "detail" look up will be inefficient as there is no "index" concept in the
>>>>> client side code."
>>>>> Indexes on JOINs are a questionable matter, I have seen many many
>>>>> times in SQL where a simple JOIN has resulted in the index not being used
>>>>> for some unknown reason even if that JOIN is specifically defined within
>>>>> the internals of MySQL (which is what I was using at the time) using pk and
>>>>> fk format, most specifically on non-standard ranges of data. You have also
>>>>> gotta consider that JOINs can become a pain to scale in certain situations.
>>>>> You have got to consider that in MySQL the client side "detail" lookup
>>>>> for each record would be slow because of what it takes to make a SQL query
>>>>> and result set for every row instead of a large JOINed result set and most
>>>>> likely not because of the "index" concept.
>>>>> MongoDB is quite different in how it queries, it is used to querying
>>>>> in this way.
>>>>> There is one caveat you will get into soon however, which I think
>>>>> needs fixing, which is that MongoDB has no server-side sub selects, so if
>>>>> you wanna do an $in based on another query you must pull the data over the
>>>>> wire and operate on it client side even if you don't need those initial
>>>>> rows, which is a little painful.
>>>>> On 13 November 2012 11:10, Brent Gracey <brent...@gmail.com> wrote:
>>>>>> Hi William - thanks again for the detailed reply
>>>>>> I can see how doing this for a query that requires information for
>>>>>> one record is efficient, I however need to do a look up for multiple
>>>>>> records.
>>>>>> If I pull all the detail records in one query - then the client side
>>>>>> "detail" look up will be inefficient as there is no "index" concept in the
>>>>>> client side code.
>>>>>> If I do a query to the db for each detail record then I get 100s of
>>>>>> db connections which is also a big overhead.
>>>>>> So my understanding is I would need to update my data structure to
>>>>>> address this efficiently.
>>>>>> On Monday, October 29, 2012 5:40:26 PM UTC, William Zola wrote:
>>>>>>> Hi Brent!
>>>>>>> Sorry for taking so long to get back to you.
>>>>>>> The answer to your question is: the database is not going to be any
>>>>>>> more efficient than doing the queries client side -- IF you have designed
>>>>>>> your schema properly.
>>>>>>> 1) Let's begin by looking at how you'd do this in SQL:
>>>>>>> First off: this creates two indexes "behind your back": one on
>>>>>>> dept.deptno, one on emp.empid.
>>>>>>> Second: consider how you would query these tables:
>>>>>>> SELECT empid, ename FROM emp, dept
>>>>>>> WHERE
>>>>>>> dept.dname = 'SALES'
>>>>>>> JOIN
>>>>>>> dept.deptno = emp.deptno
>>>>>>> ORDER BY ename;
>>>>>>> Note that unless you index 'emp.deptno', this query is going to be
>>>>>>> slow (since it will otherwise perform a full table scan on 'emp'). Unless
>>>>>>> you index 'dept.dname', it will have to perform a full table scan on 'dept'.
>>>>>>> You're now at no less than FOUR indexes.
>>>>>>> This query is going to force the RDBMS to do the following:
>>>>>>> A) Read the 'dept.dname' index to find the location of the row(s)
>>>>>>> where "dname = 'SALES'"
>>>>>>> B) Fetch the matching row(s) from the 'dept' table, and save off
>>>>>>> the set of 'deptno' values
>>>>>>> C) Read the 'emp.deptno' index to find the locations of the
>>>>>>> matching rows in the 'emp' table
>>>>>>> D) Fetch the matching rows from the 'emp' table
>>>>>>> E) Sort the result set in-memory
>>>>>>> 2) Now, let's consider how to do this in MongoDB. For legibility,
>>>>>>> I'm going to use integers for the _id field, but everything I will do will
>>>>>>> work with ObjectIDs as well:
>>>>>>> To make this efficient, I'll create a secondary index on the 'dept'
>>>>>>> field in the 'emp' collection, and the 'name' field in the 'dept'
>>>>>>> collection:
I'm not sure I understand your issues here. You're right that doing the lookup on the client side is expensive. I don't understand your concerns about running multiple queries, though.
Here is some sample Python code: it only uses one database connection, and a maximum of two database cursors:
As I said before: this is no different than what an RDBMS engine would do. The only difference is that you're running the query on the client side, instead of having the server run it for you invisibly on the server side.
If you're running into a separate performance issue, then you'll need to diagnose that.
It's not clear from your application logs what language you're using, or how you've structured your code. If it doesn't look (roughly) like the sample above, then you're probably doing something inefficient.
On Tuesday, November 13, 2012 3:10:15 AM UTC-8, Brent Gracey wrote:
> Hi William - thanks again for the detailed reply
> I can see how doing this for a query that requires information for one > record is efficient, I however need to do a look up for multiple records.
> If I pull all the detail records in one query - then the client side > "detail" look up will be inefficient as there is no "index" concept in the > client side code.
> If I do a query to the db for each detail record then I get 100s of db > connections which is also a big overhead.
> So my understanding is I would need to update my data structure to address > this efficiently.
> On Monday, October 29, 2012 5:40:26 PM UTC, William Zola wrote:
>> Hi Brent!
>> Sorry for taking so long to get back to you.
>> The answer to your question is: the database is not going to be any more >> efficient than doing the queries client side -- IF you have designed your >> schema properly.
>> 1) Let's begin by looking at how you'd do this in SQL:
>> First off: this creates two indexes "behind your back": one on >> dept.deptno, one on emp.empid.
>> Second: consider how you would query these tables:
>> SELECT empid, ename FROM emp, dept >> WHERE >> dept.dname = 'SALES' >> JOIN >> dept.deptno = emp.deptno >> ORDER BY ename;
>> Note that unless you index 'emp.deptno', this query is going to be slow >> (since it will otherwise perform a full table scan on 'emp'). Unless you >> index 'dept.dname', it will have to perform a full table scan on 'dept'.
>> You're now at no less than FOUR indexes.
>> This query is going to force the RDBMS to do the following: >> A) Read the 'dept.dname' index to find the location of the row(s) where >> "dname = 'SALES'" >> B) Fetch the matching row(s) from the 'dept' table, and save off the >> set of 'deptno' values >> C) Read the 'emp.deptno' index to find the locations of the matching >> rows in the 'emp' table >> D) Fetch the matching rows from the 'emp' table >> E) Sort the result set in-memory
>> 2) Now, let's consider how to do this in MongoDB. For legibility, I'm >> going to use integers for the _id field, but everything I will do will work >> with ObjectIDs as well:
>> To make this efficient, I'll create a secondary index on the 'dept' field >> in the 'emp' collection, and the 'name' field in the 'dept' collection:
>> In terms of the actual work done, there are the same number of indexes >> involved, and the same number of operations involved as with the SQL >> example. The only difference is that I had to build one portion of the >> query in the client side instead of on the server side.
>> Let me know if you have further questions.
>> -William
>> On Thursday, October 18, 2012 3:19:54 PM UTC-7, Brent Gracey wrote:
>>> Hi - thanks for the comprehensive reply - I'm still working through most >>> of the links you have suggested. One thing I would like to clarify - for >>> your comment "suggest staying with the simple strategy of doing joins in >>> your client application. The database would have to do these anyway:"
>>> Yes the database application would have to join - but assuming it uses >>> an index, isn't the database operation not much more efficient than
>>> for all x in array X >>> Look up Y[x]
>>> My assumptions for this are:
>>> There is overhead in maintaining DB indexes - this has already been >>> "expended effort" at the time of getting data >>> Getting all the data from the DB to the client application loses the >>> index info >>> Repeating the effort of generating an index (not sure how many languages >>> would support that - any Scala specific thoughts welcome) is duplicated >>> effort >>> Not doing something to replicate index functionality will make the look >>> up in Y much less efficient.
>>> So I can't see how client side processing can be as efficient as getting >>> the db to do it - and as one scales - you either get to a point of having >>> to do a large data restructure or live with the extra overhead.
>>> Main point of the post if for people to point out if any of my >>> assumptions are incorrect.
>>> Thanks
>>> On Sunday, October 7, 2012 11:06:38 AM UTC+1, Brent Gracey wrote:
>>>> If a data set & use cases are focused on "relational" work - would that >>>> reduce the fit of Mongodb for the project? For example a "friendship" data >>>> set where questions like please get me the detailed info of users (held in >>>> person_info) who interacted with each other (held in person interaction) >>>> who bought product X ( held in product and purchase_history collections) at >>>> event Y.
>>>> Would it be possible to use a map reduce function which operates over >>>> multiple tables?
>>>> I guess foursquare have some way of answering those sorts of questions >>>> efficiently with Mongo!
Its a scala application - and I'm a few layers of abstraction above the actual db connection. But it comes down to /mongo-java-driver-2.6.5.jar!/com/mongodb/DBTCPConnector.class and if this is the right documentation http://www.mongodb.org/display/DOCS/Java+Driver+Concurrency "The Mongo object maintains an internal pool of connections to the database (default pool size of 10)"
On Wednesday, November 14, 2012 7:24:26 PM UTC, William Zola wrote:
> Hi Brent!
> I'm not sure I understand your issues here. You're right that doing the > lookup on the client side is expensive. I don't understand your concerns > about running multiple queries, though.
> Here is some sample Python code: it only uses one database connection, and > a maximum of two database cursors:
> As I said before: this is no different than what an RDBMS engine would > do. The only difference is that you're running the query on the client > side, instead of having the server run it for you invisibly on the server > side.
> If you're running into a separate performance issue, then you'll need to > diagnose that.
> It's not clear from your application logs what language you're using, or > how you've structured your code. If it doesn't look (roughly) like the > sample above, then you're probably doing something inefficient.
> Let me know if you have further questions.
> -William
> On Tuesday, November 13, 2012 3:10:15 AM UTC-8, Brent Gracey wrote:
>> Hi William - thanks again for the detailed reply
>> I can see how doing this for a query that requires information for one >> record is efficient, I however need to do a look up for multiple records.
>> If I pull all the detail records in one query - then the client side >> "detail" look up will be inefficient as there is no "index" concept in the >> client side code.
>> If I do a query to the db for each detail record then I get 100s of db >> connections which is also a big overhead.
>> So my understanding is I would need to update my data structure to >> address this efficiently.
>> On Monday, October 29, 2012 5:40:26 PM UTC, William Zola wrote:
>>> Hi Brent!
>>> Sorry for taking so long to get back to you.
>>> The answer to your question is: the database is not going to be any more >>> efficient than doing the queries client side -- IF you have designed your >>> schema properly.
>>> 1) Let's begin by looking at how you'd do this in SQL:
>>> First off: this creates two indexes "behind your back": one on >>> dept.deptno, one on emp.empid.
>>> Second: consider how you would query these tables:
>>> SELECT empid, ename FROM emp, dept >>> WHERE >>> dept.dname = 'SALES' >>> JOIN >>> dept.deptno = emp.deptno >>> ORDER BY ename;
>>> Note that unless you index 'emp.deptno', this query is going to be slow >>> (since it will otherwise perform a full table scan on 'emp'). Unless you >>> index 'dept.dname', it will have to perform a full table scan on 'dept'.
>>> You're now at no less than FOUR indexes.
>>> This query is going to force the RDBMS to do the following: >>> A) Read the 'dept.dname' index to find the location of the row(s) >>> where "dname = 'SALES'" >>> B) Fetch the matching row(s) from the 'dept' table, and save off the >>> set of 'deptno' values >>> C) Read the 'emp.deptno' index to find the locations of the matching >>> rows in the 'emp' table >>> D) Fetch the matching rows from the 'emp' table >>> E) Sort the result set in-memory
>>> 2) Now, let's consider how to do this in MongoDB. For legibility, I'm >>> going to use integers for the _id field, but everything I will do will work >>> with ObjectIDs as well:
>>> To make this efficient, I'll create a secondary index on the 'dept' >>> field in the 'emp' collection, and the 'name' field in the 'dept' >>> collection:
>>> In terms of the actual work done, there are the same number of indexes >>> involved, and the same number of operations involved as with the SQL >>> example. The only difference is that I had to build one portion of the >>> query in the client side instead of on the server side.
>>> Let me know if you have further questions.
>>> -William
>>> On Thursday, October 18, 2012 3:19:54 PM UTC-7, Brent Gracey wrote:
>>>> Hi - thanks for the comprehensive reply - I'm still working through >>>> most of the links you have suggested. One thing I would like to clarify - >>>> for your comment "suggest staying with the simple strategy of doing joins >>>> in your client application. The database would have to do these anyway:"
>>>> Yes the database application would have to join - but assuming it uses >>>> an index, isn't the database operation not much more efficient than
>>>> for all x in array X >>>> Look up Y[x]
>>>> My assumptions for this are:
>>>> There is overhead in maintaining DB indexes - this has already been >>>> "expended effort" at the time of getting data >>>> Getting all the data from the DB to the client application loses the >>>> index info >>>> Repeating the effort of generating an index (not sure how many >>>> languages would support that - any Scala specific thoughts welcome) is >>>> duplicated effort >>>> Not doing something to replicate index functionality will make the look >>>> up in Y much less efficient.
>>>> So I can't see how client side processing can be as efficient as >>>> getting the db to do it - and as one scales - you either get to a point of >>>> having to do a large data restructure or live with the extra overhead.
>>>> Main point of the post if for people to point out if any of my >>>> assumptions are incorrect.
>>>> Thanks
>>>> On Sunday, October 7, 2012 11:06:38 AM UTC+1, Brent Gracey wrote:
>>>>> If a data set & use cases are focused on "relational" work - would >>>>> that reduce the fit of Mongodb for the project? For example a "friendship" >>>>> data set where questions like please get me the detailed info of users >>>>> (held in person_info) who interacted with each other (held in person >>>>> interaction) who bought product X ( held in product and purchase_history >>>>> collections) at event Y.
>>>>> Would it be possible to use a map reduce function which operates over >>>>> multiple tables?
>>>>> I guess foursquare have some way of answering those sorts of questions >>>>> efficiently with Mongo!
I don't know Scala at all, but I believe your concern to be unfounded.
The Scala driver, like the Java driver, does *not* use a new connection for each query. If you've got a connection from the MongoConnection() method (or whatever your framework wraps around that structure), then you can have as many queries as you like multiplexed on that single connection.
In my code sample from before:
def query_dept_emp(db) :
> dept = db.dept > emp = db.emp > for ddoc in dept.find().sort("name",1) : > print "Employees in", ddoc["name"] > for edoc in emp.find( {"dept": ddoc["_id"] } ).sort("name",1) : > print " ", edoc["name"]
This code opens two cursors but only uses a single connection. The same would be true of the equivalent code in Scala or Java. The dept.find() opens a new cursor using the existing connection, and the emp.find() opens a different cursor using the same connection. Queries to the two different cursors can be interleaved on a single connection without any problem.
On Friday, November 16, 2012 4:22:26 AM UTC-8, Brent Gracey wrote:
> Hi,
> Its a scala application - and I'm a few layers of abstraction above the > actual db connection. But it comes down to > /mongo-java-driver-2.6.5.jar!/com/mongodb/DBTCPConnector.class and if this > is the right documentation > http://www.mongodb.org/display/DOCS/Java+Driver+Concurrency "The Mongo > object maintains an internal pool of connections to the database (default > pool size of 10)"
> On Wednesday, November 14, 2012 7:24:26 PM UTC, William Zola wrote:
>> Hi Brent!
>> I'm not sure I understand your issues here. You're right that doing the >> lookup on the client side is expensive. I don't understand your concerns >> about running multiple queries, though.
>> Here is some sample Python code: it only uses one database connection, >> and a maximum of two database cursors:
>> As I said before: this is no different than what an RDBMS engine would >> do. The only difference is that you're running the query on the client >> side, instead of having the server run it for you invisibly on the server >> side.
>> If you're running into a separate performance issue, then you'll need to >> diagnose that.
>> It's not clear from your application logs what language you're using, or >> how you've structured your code. If it doesn't look (roughly) like the >> sample above, then you're probably doing something inefficient.
>> Let me know if you have further questions.
>> -William
>> On Tuesday, November 13, 2012 3:10:15 AM UTC-8, Brent Gracey wrote:
>>> Hi William - thanks again for the detailed reply
>>> I can see how doing this for a query that requires information for one >>> record is efficient, I however need to do a look up for multiple records.
>>> If I pull all the detail records in one query - then the client side >>> "detail" look up will be inefficient as there is no "index" concept in the >>> client side code.
>>> If I do a query to the db for each detail record then I get 100s of db >>> connections which is also a big overhead.
>>> So my understanding is I would need to update my data structure to >>> address this efficiently.
>>> On Monday, October 29, 2012 5:40:26 PM UTC, William Zola wrote:
>>>> Hi Brent!
>>>> Sorry for taking so long to get back to you.
>>>> The answer to your question is: the database is not going to be any >>>> more efficient than doing the queries client side -- IF you have designed >>>> your schema properly.
>>>> 1) Let's begin by looking at how you'd do this in SQL:
>>>> First off: this creates two indexes "behind your back": one on >>>> dept.deptno, one on emp.empid.
>>>> Second: consider how you would query these tables:
>>>> SELECT empid, ename FROM emp, dept >>>> WHERE >>>> dept.dname = 'SALES' >>>> JOIN >>>> dept.deptno = emp.deptno >>>> ORDER BY ename;
>>>> Note that unless you index 'emp.deptno', this query is going to be slow >>>> (since it will otherwise perform a full table scan on 'emp'). Unless you >>>> index 'dept.dname', it will have to perform a full table scan on 'dept'.
>>>> You're now at no less than FOUR indexes.
>>>> This query is going to force the RDBMS to do the following: >>>> A) Read the 'dept.dname' index to find the location of the row(s) >>>> where "dname = 'SALES'" >>>> B) Fetch the matching row(s) from the 'dept' table, and save off the >>>> set of 'deptno' values >>>> C) Read the 'emp.deptno' index to find the locations of the matching >>>> rows in the 'emp' table >>>> D) Fetch the matching rows from the 'emp' table >>>> E) Sort the result set in-memory
>>>> 2) Now, let's consider how to do this in MongoDB. For legibility, I'm >>>> going to use integers for the _id field, but everything I will do will work >>>> with ObjectIDs as well:
>>>> To make this efficient, I'll create a secondary index on the 'dept' >>>> field in the 'emp' collection, and the 'name' field in the 'dept' >>>> collection:
>>>> In terms of the actual work done, there are the same number of indexes >>>> involved, and the same number of operations involved as with the SQL >>>> example. The only difference is that I had to build one portion of the >>>> query in the client side instead of on the server side.
>>>> Let me know if you have further questions.
>>>> -William
>>>> On Thursday, October 18, 2012 3:19:54 PM UTC-7, Brent Gracey wrote:
>>>>> Hi - thanks for the comprehensive reply - I'm still working through >>>>> most of the links you have suggested. One thing I would like to clarify - >>>>> for your comment "suggest staying with the simple strategy of doing joins >>>>> in your client application. The database would have to do these anyway:"
>>>>> Yes the database application would have to join - but assuming it uses >>>>> an index, isn't the database operation not much more efficient than
>>>>> for all x in array X >>>>> Look up Y[x]
>>>>> My assumptions for this are:
>>>>> There is overhead in maintaining DB indexes - this has already been >>>>> "expended effort" at the time of getting data >>>>> Getting all the data from the DB to the client application loses the >>>>> index info >>>>> Repeating the effort of generating an index (not sure how many >>>>> languages would support that - any Scala specific thoughts welcome) is >>>>> duplicated effort >>>>> Not doing something to replicate index functionality will make the >>>>> look up in Y much less efficient.
>>>>> So I can't see how client side processing can be as efficient as >>>>> getting the db to do it - and as one scales - you either get to a point of >>>>> having to do a large data restructure or live with the extra overhead.
>>>>> Main point of the post if for people to point out if any of my >>>>> assumptions are incorrect.
>>>>> Thanks
>>>>> On Sunday, October 7, 2012 11:06:38 AM UTC+1, Brent Gracey wrote:
>>>>>> If a data set & use cases are focused on "relational" work - would >>>>>> that reduce the fit of Mongodb for the project? For example a "friendship" >>>>>> data set where questions like please get me the detailed info of users >>>>>> (held in person_info) who interacted with each other (held in person >>>>>> interaction) who bought product X ( held in product and purchase_history >>>>>> collections) at event Y.
>>>>>> Would it be possible to use a map reduce function which operates over >>>>>> multiple tables?
>>>>>> I guess foursquare have some way of answering those sorts of >>>>>> questions efficiently with Mongo!
>>>>>> (this is a shortened version of half of post