MongoDB chunk migration - "Steps" detailed explanation (in Changlog)

83 views
Skip to first unread message

JOE YU

unread,
Sep 9, 2015, 1:36:01 PM9/9/15
to mongod...@googlegroups.com
Hello list,

I am analyzing a chunk migration performance problem of mongodb(v 2.4.6) sharding cluster.
After check the changelog of config db, i found there are some useful information that can help me to diagnose the longest time used in detail steps.
First i checked "moveChunk.from" and found below result:
    "step1 of 6" : 0,
                "step2 of 6" : 196,
                "step3 of 6" : 18234,
                "step4 of 6" : 1551243,
                "step5 of 6" : 262,
                "step6 of 6" : 0
        }
It seems step4(F4:T1-T5) used lot of time, then i check the corresponding moveChunk.To and try to figure out which step in T1-T5 consume crazy time, here is the result:
    "step1 of 5" : 2,
                "step2 of 5" : 0,
                "step3 of 5" : 1448987,
                "step4 of 5" : 0,
                "step5 of 5" : 741
It seems step3(T3) is the most time-consuming operation.
So i tried to understand what is "step3 of 5" (T3) exactly doing.
I have check the http://docs.mongodb.org/manual/core/sharding-chunk-migration/#chunk-migration but there is no detailed information about "step3 of 5".
And then i found a similar post on SO as here: http://stackoverflow.com/questions/18103220/mongodb-chunk-migration-steps-in-changelog-collection , still no enough information on it.
Could any one give a clue where can i find the accurate description on step3 of 5 and other "stepX on 6" and "stepY on 5" or provide a answer here?
I have also check the source code of mongodb's chunk migration and the corresponding code is as below
    // 3. initial bulk clone
        state = CLONE;
        while ( true ) {
            BSONObj res;
            if ( ! conn->runCommand( "admin" , BSON( "_migrateClone" << 1 ) , res ) ) { // gets array of objects to copy, in disk order
                state = FAIL;
                conn.done();
                return;
            }
            BSONObj arr = res["objects"].Obj();
            int thisTime = 0;
            BSONObjIterator i( arr );
            while( i.more() ) {
                BSONObj o = i.next().Obj();
                {
                    PageFaultRetryableSection pgrs;
                    while ( 1 ) {
                        try {
                            Lock::DBWrite lk( ns );
                            Helpers::upsert( ns, o, true );
                            break;
                        }
                        catch ( PageFaultException& e ) {
                            e.touch();
                        }
                    }
                }
                thisTime++;
                numCloned++;
                clonedBytes += o.objsize();
                if ( secondaryThrottle ) {
                    if ( ! waitForReplication( cc().getLastOp(), 2, 60 /* seconds to wait */ ) ) {
                    }
                }
            }
            if ( thisTime == 0 )
                break;
        }
    }
It seems T3 has three potential time-consuming processes
1) read data from F(rom) side (network I/O)
2) insert into T(o) side (disk I/O
3) deal with "secondaryThrottle" related process with the replica
So in my case, does ["step3 of 5" :1448987] means a lot of disk I/O (or poor I/O )? ( i did not monitoring the large network I/O and also i have set secondaryThrottle=false)
Thanks,
Joe

Asya Kamsky

unread,
Sep 9, 2015, 6:29:45 PM9/9/15
to mongodb-user
I'm not sure why the SO answer(s) are not adequate - of course the
source code is the canonical proof of what the time is being spent on.

A couple of clarifications though: the time is recorded in
Milliseconds. So when you say "consume crazy time" and show 1,551,243
- that's 1550 seconds - that's 25 minutes. Depending on how many
documents are in your chunk, and whether secondary throttle is on, it
could be a pretty close to expected amount of time to find and copy
all of those documents from the donor shard to the recipient shard.

3 is not the "deal with secondary throttle" part, 3 is the actual
copying over the documents part, the reading on the from shard and the
writing on the to shard.
So if your from shard is very busy and doesn't have enough RAM for the
working set *plus* this chunk, then the reading will be very slow and
have to wait for lots of disk IO. If the "to" shard is very busy
then the writing will compete with all the writing that's already
"normally" going on on this shard.

I would look at the loads on the shards and see if you simply don't
have enough capacity for migration on top of your normal operations.

Asya
> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user"
> group.
>
> For other MongoDB technical support options, see:
> http://www.mongodb.org/about/support/.
> ---
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mongodb-user...@googlegroups.com.
> To post to this group, send email to mongod...@googlegroups.com.
> Visit this group at http://groups.google.com/group/mongodb-user.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/mongodb-user/80A1F782AAC5B665.B0D8138F-25F4-405F-98C7-A2B78345E349%40mail.outlook.com.
> For more options, visit https://groups.google.com/d/optout.

Joe,Yu

unread,
Sep 14, 2015, 11:26:32 AM9/14/15
to mongod...@googlegroups.com
Thanks Asya for your valuable information.

Does the total time for step 3 of 5  also include creating index after each object copy and write to the destination shard?

From the mongodb's source code(v 2.4.5) :

before the "3. initial bulk clone" step
there are

 "0. copy system.namespaces entry if collection doesn't already exist" (start from line 1621 of  d_migrate.cpp)
 "1. copy indexes" (start from line 1637 of d_migrate.cpp)

whithin step"1. copy indexes" , the destination shard clone indexes structures from the source shard

And i try to understand in step 3 "3.initial bulk clone" after each object write to the destination shard, will that also trigger an index update(or re-create) too? If so , i should also should consider the total time step 3 used are actually make by three part : 1) the time copy data from source shard 2) the time write to the destination shard 3) the time update index immediately after each object write to the shard.

Thanks,

Joe 



For more options, visit https://groups.google.com/d/optout.



--
jOe

Asya Kamsky

unread,
Sep 15, 2015, 12:25:42 PM9/15/15
to mongodb-user
Does the total time for step 3 of 5  also include creating index after each object copy and write to the destination shard?

The index already exists before step 3 starts running, so inserting each record includes updating the appropriate indexes.

So the way you described it is accurate.

Asya


Joe,Yu

unread,
Sep 16, 2015, 2:00:36 AM9/16/15
to mongod...@googlegroups.com
​Thank you very much Asya ! 


For more options, visit https://groups.google.com/d/optout.



--
jOe
Reply all
Reply to author
Forward
0 new messages