Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion any alternative to client side processing in place of what would be a RDMS join?
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Brent Gracey  
View profile  
 More options Nov 14 2012, 5:51 am
From: Brent Gracey <brentgra...@gmail.com>
Date: Wed, 14 Nov 2012 02:51:17 -0800 (PST)
Local: Wed, Nov 14 2012 5:51 am
Subject: Re: [mongodb-user] Re: any alternative to client side processing in place of what would be a RDMS join?

Hi - so I done some logging to work out what is taking the time

First, a big correction - I'm actually hitting the db 2000 times, not 200,
I forgot that 1800 records don't actually do anything in this report, but I
have to go look for a detailed record for them before I know that.

Logging from my application code

09:27:31.409 [qtp741338313-114 - /man/json/counts] DEBUG
bootstrap.liftweb.RequestAnalyzer - Total request time on /json/counts: 0 ms
09:27:31.421 [qtp741338313-114 - /man/json/counts] DEBUG
c.b.sojo_manager.api.RestAPI - Starting to generate report
09:27:32.446 [qtp741338313-114 - /man/json/counts] DEBUG
c.b.sojo_manager.api.RestAPI - Got all users from the database -> This is
the first select that returns 2000 users
09:27:41.531 [qtp741338313-114 - /man/json/counts] DEBUG
c.b.sojo_manager.api.RestAPI - Have looped through all users and checked
which ones have devices with connections -> This is after looping through
all 2000 and doing a detail lookup in mongo, which takes about 10 seconds.
09:27:41.719 [qtp741338313-114 - /man/json/counts] DEBUG
c.b.sojo_manager.api.RestAPI - Have created the from to mapping for the
force diagram
09:27:41.719 [qtp741338313-114 - /man/json/counts] DEBUG
bootstrap.liftweb.RequestAnalyzer - Total request time on /json/counts:
10301 ms

I enabled  db.setProfilingLevel(2) - and after capturing commands for just
this report

> db.system.profile.find().count()

2845

Then I ran a summary - to just group everything and give a total "db" run
time - which comes in at 1794 milli sec, so I don't think the main part of
the 10 seconds is db execution. I am working on refining the application
logging to get a better view of what components of the query are taking the
remaining 8 seconds.

db.system.profile.group(
{ cond: {   }
, key: {client: "127.0.0.1"}
, initial: {count: 0, total_time:0}  
, reduce: function(doc, out){ out.count++; out.total_time+=doc.millis }
, finalize: function(out){ out.avg_time = out.total_time / out.count }

} );

OUTPUT

[
{
"client" : "127.0.0.1",
"count" : 2845,
"total_time" : 1794,
"avg_time" : 0.6305799648506151

}

]
On Tuesday, November 13, 2012 2:22:58 PM UTC, Sammaye wrote:

> At 200 records it should be giving 0 overhead.

> You are right that the initial post was a little tl;dr. I definitely
> missed it.

> So let's start fresh: if you turn on logging for MongoDB and check the log
> for slow queries do you see any?

> On 13 November 2012 13:39, Brent Gracey <brent...@gmail.com <javascript:>>wrote:

>> Thanks for the reply,

>> I know very well that joins in SQL can have their own nuances, but feel
>> that is a separate discussion. And yes, connection pooling will reduce the
>> overhead of multiple db queries, but it is still an overhead.

>> My current situation is I am doing multiple detail queries - and the
>> report is taking about 30 seconds to complete for  ~200 records, and my
>> boss wants it to go faster, and I don't know how to that without telling
>> him we need to re-architect our data structure.

>> I haven't got the exact breakdown of what is taking the time - so can't
>> tell you how much "overhead" the multiple db connections is introducing,
>> but I'll work on some logging to break down exactly where time is being
>> spend in the report.

>> Thanks

>> On Tuesday, November 13, 2012 12:32:19 PM UTC, Sammaye wrote:

>>> "You have got to consider that in MySQL"

>>> Sorry I mean any SQL tech there.

>>> On 13 November 2012 12:31, Sam Millman <sam.m...@gmail.com> wrote:

>>>> "If I do a query to the db for each detail record then I get 100s of db
>>>> connections which is also a big overhead."

>>>> You only get one, and it should be a persistent one.

>>>> "If I pull all the detail records in one query - then the client side
>>>> "detail" look up will be inefficient as there is no "index" concept in the
>>>> client side code."

>>>> Indexes on JOINs are a questionable matter, I have seen many many times
>>>> in SQL where a simple JOIN has resulted in the index not being used for
>>>> some unknown reason even if that JOIN is specifically defined within the
>>>> internals of MySQL (which is what I was using at the time) using pk and fk
>>>> format, most specifically on non-standard ranges of data. You have also
>>>> gotta consider that JOINs can become a pain to scale in certain situations.

>>>> You have got to consider that in MySQL the client side "detail" lookup
>>>> for each record would be slow because of what it takes to make a SQL query
>>>> and result set for every row instead of a large JOINed result set and most
>>>> likely not because of the "index" concept.

>>>> MongoDB is quite different in how it queries, it is used to querying in
>>>> this way.

>>>> There is one caveat you will get into soon however, which I think needs
>>>> fixing, which is that MongoDB has no server-side sub selects, so if you
>>>> wanna do an $in based on another query you must pull the data over the wire
>>>> and operate on it client side even if you don't need those initial rows,
>>>> which is a little painful.

>>>> On 13 November 2012 11:10, Brent Gracey <brent...@gmail.com> wrote:

>>>>> Hi William - thanks again for the detailed reply

>>>>> I can see how doing this for a query that requires information for one
>>>>> record is efficient, I however need to do a look up for multiple records.

>>>>> If I pull all the detail records in one query - then the client side
>>>>> "detail" look up will be inefficient as there is no "index" concept in the
>>>>> client side code.

>>>>> If I do a query to the db for each detail record then I get 100s of db
>>>>> connections which is also a big overhead.

>>>>> So my understanding is I would need to update my data structure to
>>>>> address this efficiently.

>>>>> On Monday, October 29, 2012 5:40:26 PM UTC, William Zola wrote:

>>>>>> Hi Brent!

>>>>>> Sorry for taking so long to get back to you.  

>>>>>> The answer to your question is: the database is not going to be any
>>>>>> more efficient than doing the queries client side -- IF you have designed
>>>>>> your schema properly.

>>>>>> 1) Let's begin by looking at how you'd do this in SQL:

>>>>>> CREATE TABLE dept (deptno NUMBER CONSTRAINT pk_dept PRIMARY KEY,
>>>>>>         dname VARCHAR2(30) );
>>>>>> CREATE TABLE emp (
>>>>>>         empid NUMBER CONSTRAINT pk_emp PRIMARY KEY,
>>>>>>         ename VARCHAR(20),
>>>>>>         deptno NUMER CONSTRAINT fk_deptno FOREIGN KEY REFERENCES
>>>>>> dept(deptno) );

>>>>>> First off: this creates two indexes "behind your back": one on
>>>>>> dept.deptno, one on emp.empid.

>>>>>> Second: consider how you would query these tables:

>>>>>>  SELECT empid, ename FROM emp, dept
>>>>>>     WHERE
>>>>>>         dept.dname = 'SALES'
>>>>>>     JOIN
>>>>>>         dept.deptno = emp.deptno
>>>>>>     ORDER BY ename;

>>>>>> Note that unless you index 'emp.deptno', this query is going to be
>>>>>> slow (since it will otherwise perform a full table scan on 'emp').  Unless
>>>>>> you index 'dept.dname', it will have to perform a full table scan on 'dept'.

>>>>>> You're now at no less than FOUR indexes.

>>>>>> This query is going to force the RDBMS to do the following:
>>>>>>   A) Read the 'dept.dname' index to find the location of the row(s)
>>>>>> where "dname = 'SALES'"
>>>>>>   B) Fetch the matching row(s) from the 'dept' table, and save off
>>>>>> the set of 'deptno' values
>>>>>>   C) Read the 'emp.deptno' index to find the locations of the
>>>>>> matching rows in the 'emp' table
>>>>>>   D) Fetch the matching rows from the 'emp' table
>>>>>>   E) Sort the result set in-memory

>>>>>> 2) Now, let's consider how to do this in MongoDB.  For legibility,
>>>>>> I'm going to use integers for the _id field, but everything I will do will
>>>>>> work with ObjectIDs as well:

>>>>>> > db.dept.save( {_id: 1, name: 'Sales'} );
>>>>>> > db.dept.save( {_id: 2, name: 'Marketing'} );
>>>>>> > db.dept.save( {_id: 3, name: 'Support'} );

>>>>>> > db.emp.save( { _id: 1, name: 'Ben', dept: 1 } );
>>>>>> > db.emp.save( { _id: 2, name: 'William', dept: 3 } );
>>>>>> > db.emp.save( { _id: 3, name: 'Jenna', dept: 3 } );
>>>>>> > db.emp.save( { _id: 4, name: 'Steven', dept: 3 } );

>>>>>> To make this efficient, I'll create a secondary index on the 'dept'
>>>>>> field in the 'emp' collection, and the 'name' field in the 'dept'
>>>>>> collection:

>>>>>> > db.emp.ensureIndex( dept: 1 );
>>>>>> > db.dept.ensureIndex( name: 1 );

>>>>>> By default, there are already indexes on "dept._id" and "emp._id".

>>>>>> To find all of the employees in the "Support" department, I need to
>>>>>> do two queries:

>>>>>>     > result = db.dept.findOne({name: 'Support'}, {_id:1} );
>>>>>>     { "_id" : 3 }
>>>>>>     > desired_dept = result["_id"];
>>>>>>     3
>>>>>>     > db.emp.find( { dept: desired_dept }).sort({name:1}).pretty();
>>>>>>     { "_id" : 3, "name" : "Jenna", "dept" : 3 }
>>>>>>     { "_id" : 4, "name" : "Steven", "dept" : 3 }
>>>>>>     { "_id" : 2, "name" : "William", "dept" : 3 }

>>>>>> In terms of the actual work done, there are the same number of
>>>>>> indexes involved, and the same number of operations involved as with the
>>>>>> SQL example.  The only difference is that I had to build one portion of the
>>>>>> query in the client side instead of on the server side.

>>>>>> Let me know if you have further questions.

>>>>>>  -William

>>>>>> On Thursday, October 18, 2012 3:19:54 PM UTC-7, Brent Gracey wrote:

>>>>>>> Hi - thanks for the comprehensive reply - I'm still working through
>>>>>>> most of the links you have suggested. One thing I would like to clarify -
>>>>>>> for your comment "suggest staying with the simple strategy of doing joins
>>>>>>> in your client application.  The database would have to do these anyway:"

>>>>>>> Yes the database application would have to join - but assuming it
>>>>>>> uses an index, isn't the database operation not much more efficient than

>>>>>>> for all x in array X
>>>>>>>     Look up Y[x]

>>>>>>> My assumptions for this are:

>>>>>>> There is overhead in maintaining DB indexes - this has already been
>>>>>>> "expended effort" at the time of getting data
>>>>>>> Getting all the data from the DB to the client application loses the
>>>>>>> index info
>>>>>>> Repeating the effort of generating an index (not sure how many
>>>>>>> languages would support that - any Scala specific thoughts welcome) is
>>>>>>> duplicated effort
>>>>>>> Not doing something to replicate index functionality will make the
>>>>>>> look up in Y much less efficient.

>>>>>>> So I can't see how client side processing can be as efficient as
>>>>>>> getting the db to do it - and as one scales - you either get to a point of
>>>>>>> having to do a large data restructure or live with the extra overhead.

>>>>>>> Main point of the post if for people to point out if any of my
>>>>>>> assumptions are incorrect.

>>>>>>> Thanks

>>>>>>> On Sunday, October 7, 2012 11:06:38 AM UTC+1, Brent Gracey wrote:

>>>>>>>> If a data set & use cases are focused on "relational" work - would
>>>>>>>> that reduce the fit of Mongodb for the project? For example a "friendship"
>>>>>>>> data set where questions like please get me the detailed info of users
>>>>>>>> (held in person_info) who interacted with each other (held in person
>>>>>>>> interaction) who bought product X ( held in product and purchase_history
>>>>>>>> collections) at event Y.

>>>>>>>> Would it be possible to use a map reduce function which operates
>>>>>>>> over multiple tables?

>>>>>>>> I guess foursquare have some way of answering those sorts of
>>>>>>>> questions efficiently with  Mongo!

>>>>>>>> (this is a shortened version of half of post https://groups.google.
>>>>>>>> **com**/forum/?fromgroups=#!**searchin/**mongodb-user/**
>>>>>>>> guidance/mongodb-**user/**dDEynRftg8E/IluWzJcABwQJ<https://groups.google.com/forum/?fromgroups=#!searchin/mongodb-user/g...>
>>>>>>>>  **- maybe the length of that post was too much!)

>>>>>>>> Regards,
>>>>>>>> Brent

>>>>>>>  --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "mongodb-user" group.
>>>>> To post to this group, send email to mongod...@googlegroups.com

>>>>> To unsubscribe from this group, send email to
>>>>> mongodb-user...@**googlegroups.com

>>>>> See also the IRC channel -- freenode.net#mongodb

>>>  --
>> You received this message because you are subscribed to the Google
>> Groups "mongodb-user" group.
>> To post to this group, send email to mongod...@googlegroups.com<javascript:>
>> To unsubscribe from this group, send email to
>> mongodb-user...@googlegroups.com <javascript:>
>> See also the IRC channel -- freenode.net#mongodb


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.