mongodb sharding. migration is slow

389 views

Skip to first unread message

Azat Khuzhin

unread,

May 3, 2012, 8:16:20 AM5/3/12

to mongod...@googlegroups.com

Hi all

I have a collection with ~165m documents, and start sharding it but it is doing very slow.

I have next configuration:

First machine:

mongod, mongos, mongod --configsrv

Second machine:

mongod

Thereis ~3120 chunks, moveChunk tooks about ~350-400 secs (so to move all chunks takes ~10 days), speed between shards are about 70MiB/s

Shard by "{key: 1, _id: 1}", where "key" - is md5 hash, and "_id" - is ObjectId

Index on "{key: 1, _id: 1}" exists

I had a suspicion that it is because there is a concurrent read/write to db(I use 2.1 develop build, that have not full support locking per db), but when I stop read/write from/to db speed not increased

Why it is so slow? And maybe I can manually run "sh.moveChunk()" for parallel migration?

step1 (copy indexes) - fast

step2 (delete any data already in range) took ~10 secs

step3 (initial bulk clone) took ~128 secs

step4 (do bulk of mods) - fast

step5 (wait for commit) took ~200 secs

summary ~338 secs

I use developer version

$ git describe

r2.1.0-2093-g31bdfd7

Azat Khuzhin

unread,

May 3, 2012, 9:37:45 AM5/3/12

to mongod...@googlegroups.com

And on both servers there is a raid10 from 6 sas hdd

Azat Khuzhin

unread,

May 3, 2012, 10:01:35 AM5/3/12

to mongod...@googlegroups.com

BTW on both machines 48GiB of memory.

And one interesting moment:

"show dbs" show that my db is 173.8232421875GB

And "/proc/`pgrep mongod`/io" shows

read_bytes: 1014962442240 (945 GiB)

write_bytes: 568515821568 (529 GiB)

How can it write 529 GiB if all db is 173 GiB only?

The mongo daemon is working about 2 days, not more

On Thursday, May 3, 2012 4:16:20 PM UTC+4, Azat Khuzhin wrote:

Eliot Horowitz

unread,

May 3, 2012, 11:06:27 AM5/3/12

to mongod...@googlegroups.com

How big are your documents?
Can you send iostat 0x 2 output as well.

> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/mongodb-user/-/2X8IxhWvpyMJ.
>
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to
> mongodb-user...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/mongodb-user?hl=en.

Azat Khuzhin

unread,

May 3, 2012, 11:50:47 AM5/3/12

to mongod...@googlegroups.com

what mean 0x2 ?

Azat Khuzhin.
Send from phone.

Eliot Horowitz

unread,

May 3, 2012, 11:51:32 AM5/3/12

to mongod...@googlegroups.com

iostat -x 2, sorry

Azat Khuzhin

unread,

May 3, 2012, 1:17:35 PM5/3/12

to mongod...@googlegroups.com

"avgObjSize" : 680.34193653284

$iostat -x 2 (from first machine, where mongos)

Linux 2.6.32-5-amd64 (na4) 05/03/2012 _x86_64_ (16 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle

0.59 0.00 0.20 1.21 0.00 98.00

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util

sda 0.07 228.55 65.51 121.88 1900.34 1633.66 37.72 4.66 24.88 5.70 35.19 0.63 11.79

avg-cpu: %user %nice %system %iowait %steal %idle

7.36 0.00 2.27 4.94 0.00 85.43

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util

sda 0.50 0.00 570.50 9.50 13560.00 620.00 48.90 3.17 5.70 5.78 1.26 1.61 93.60

avg-cpu: %user %nice %system %iowait %steal %idle

3.20 0.00 1.24 5.25 0.00 90.30

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util

sda 0.00 16.00 1349.00 9.50 14784.00 1768.00 24.37 5.24 3.83 3.84 3.16 0.70 95.60

avg-cpu: %user %nice %system %iowait %steal %idle

2.59 0.00 1.02 13.50 0.00 82.89

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util

sda 0.00 448.50 580.00 885.00 7002.00 5990.00 17.74 133.02 89.86 8.90 142.92 0.68 99.40

avg-cpu: %user %nice %system %iowait %steal %idle

2.56 0.00 0.99 6.45 0.00 90.00

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util

sda 0.00 90.50 1007.00 281.50 10314.00 3174.00 20.94 29.29 23.81 5.62 88.87 0.74 95.80

avg-cpu: %user %nice %system %iowait %steal %idle

2.83 0.00 0.98 6.03 0.00 90.16

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util

sda 0.50 85.50 1188.00 57.50 12788.00 2178.00 24.03 9.63 7.74 4.18 81.15 0.77 95.80

avg-cpu: %user %nice %system %iowait %steal %idle

2.39 0.00 1.09 6.25 0.00 90.27

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util

sda 0.00 280.00 935.00 266.50 10620.00 3594.00 23.66 41.56 34.59 5.73 135.84 0.81 97.00

avg-cpu: %user %nice %system %iowait %steal %idle

3.26 0.00 1.14 4.96 0.00 90.64

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util

sda 0.00 9.00 1345.50 18.00 13276.00 2152.00 22.63 5.54 4.08 4.11 1.67 0.69 94.20

avg-cpu: %user %nice %system %iowait %steal %idle

1.88 0.00 0.71 5.95 0.00 91.45

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util

sda 0.00 110.00 1261.00 50.50 13502.00 2302.00 24.10 7.75 5.91 4.01 53.23 0.74 96.80

avg-cpu: %user %nice %system %iowait %steal %idle

3.21 0.00 1.19 6.19 0.00 89.40

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util

sda 0.00 130.50 1105.00 113.50 17318.00 2772.00 32.97 35.13 28.83 4.19 268.69 0.79 96.20

avg-cpu: %user %nice %system %iowait %steal %idle

2.12 0.00 0.75 5.36 0.00 91.77

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util

sda 0.00 137.00 1649.50 67.00 15358.00 3774.00 22.29 6.87 4.00 2.91 30.90 0.54 93.40

avg-cpu: %user %nice %system %iowait %steal %idle

3.33 0.00 1.02 10.39 0.00 85.27

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util

sda 0.00 938.50 334.00 1353.50 6394.00 10196.00 19.66 146.55 80.64 13.68 97.17 0.59 100.00

avg-cpu: %user %nice %system %iowait %steal %idle

3.24 0.00 1.15 10.10 0.00 85.50

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util

sda 0.00 384.50 569.50 670.00 10012.00 5076.00 24.35 147.15 121.32 7.92 217.71 0.81 100.00

Azat Khuzhin

Azat Khuzhin

unread,

May 4, 2012, 4:18:49 AM5/4/12

to mongod...@googlegroups.com

Avg time to moveChunk grows up to 450

--
Azat Khuzhin

Azat Khuzhin

unread,

May 5, 2012, 6:24:08 AM5/5/12

to mongod...@googlegroups.com

I'm sorry for put all iostat output here in previous message.

I run some tests, at amazon m1.small instance, which have next configuration:

1.7 GB memory

1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit)

160 GB instance storage

64-bit platform

And it gave next results:

No concurrent read/write in all tests

By id on 40 million rows, shard by "_id":

step3 ~ X, 7, 28, 7, 7 secs

step5 ~ 53,97,70,43,55 secs

iostat -x 2:

http://pastebin.com/BHiEw5LP

By id on 1 million rows, shard by "_id"

step3 ~ 8 9 9 secs

step5 ~ 12 15 13 secs

Not run iostat

By id on 1 million rows, shard by "{key: 1, _id: 1}", where key is md5 hash (so.. string from 32 chars)

step3 ~ 14 8 7 secs

step5 ~ 36 36 40 secs

Not run iostat

And we can see here, moveChunk time increase if key is "complicated" or we have a "large" (I don`t really think that 40 million of documents is huge) number of documents

--
Azat Khuzhin

Greg Studer

unread,

May 9, 2012, 10:50:22 PM5/9/12

to mongodb-user

> And we can see here, moveChunk time increase if key is "complicated" or we
have a "large" (I don`t really think that 40 million of documents is
huge)
number of documents

I suspect this is just to do with working set - with a larger number
of documents, the documents to move are more scattered on disk
(there's no guarantee they're ordered like the shard key index). With
a larger key size, the index will be larger (2x?) and so less likely
to be in memory. The filesystem will also have a significant impact -
what filesystem are you using?

On May 5, 6:24 am, Azat Khuzhin <a3at.m...@gmail.com> wrote:
> I'm sorry for put all iostat output here in previous message.
>

> I *run some tests*, at amazon *m1.small* instance, which have next

> configuration:
> 1.7 GB memory
> 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit)
> 160 GB instance storage
> 64-bit platform
>
> And it gave next results:
>

> *No concurrent read/write in all tests*
>
> *By id on 40 million rows, shard by "_id":*

> step3 ~ X, 7, 28, 7, 7 secs
> step5 ~ 53,97,70,43,55 secs
>
> iostat -x 2:http://pastebin.com/BHiEw5LP
>

> *By id on 1 million rows, shard by "_id"*

> step3 ~ 8 9 9 secs
> step5 ~ 12 15 13 secs
>
> Not run iostat
>

> *By id on 1 million rows, shard by "{key: 1, _id: 1}", where key is md5
> hash (so.. string from 32 chars)*

> step3 ~ 14 8 7 secs
> step5 ~ 36 36 40 secs
>
> Not run iostat
>
> And we can see here, moveChunk time increase if key is "complicated" or we
> have a "large" (I don`t really think that 40 million of documents is huge)
> number of documents
>
>
>
>
>
>
>

> On Fri, May 4, 2012 at 12:18 PM, Azat Khuzhin <a3at.m...@gmail.com> wrote:
> > Avg time to *moveChunk* grows up to *450*
>
> > On Thu, May 3, 2012 at 9:17 PM, Azat Khuzhin <a3at.m...@gmail.com> wrote:
>
> >> "*avgObjSize*" : 680.34193653284
> >> $*iostat* -x 2 (from first machine, where mongos)

> >>> On Thu, May 3, 2012 at 11:50 AM, Azat Khuzhin <a3at.m...@gmail.com>

> >>> wrote:
> >>> > what mean 0x2 ?
>
> >>> > Azat Khuzhin.
> >>> > Send from phone.
>
> >>> > On May 3, 2012 7:06 PM, "Eliot Horowitz" <el...@10gen.com> wrote:
>
> >>> >> How big are your documents?
> >>> >> Can you send iostat 0x 2 output as well.
>

> >>> >> On Thu, May 3, 2012 at 10:01 AM, Azat Khuzhin <a3at.m...@gmail.com>

> ...
>
> read more »

Azat Khuzhin

unread,

May 10, 2012, 2:54:17 AM5/10/12

to mongod...@googlegroups.com

Filesystem: ext3(raid10 from 6 sas hdd), but I don`t think that filesystem can have a significant impact, because we don`t have huge number of files, or what else

Any way if such speed on huge number of documents will be always, you can`t use it in production

Also when I canceling sharding, I run "removeshard", which also very slow https://groups.google.com/forum/?fromgroups#!topic/mongodb-user/R7XS1tDME8g

But when I stop this process, and migrate data myself (using find()/insert()/remove()) it was much faster than this command (~20-50x times, or greater than 50x), so I guess if I run query to migrate all my data, instead of "moveChunk", it will be faster ~20-50x times (or greater than 50x)

I understand that "moveChunk" do some other job, but not so huge

And can you explain me why ot write so many bytes to disk? Seems like it rewrite all database each week or something like this

Azat Khuzhin

Greg Studer

unread,

May 10, 2012, 1:28:55 PM5/10/12

to mongodb-user

You should move to ext4 and/or xfs - there's definitely impact,
particularly when migrating data to new shards, which almost always is
something you do because of data growth. It's in the
www.mongodb.org/display/DOCS/Production+Notes.

Not sure what happened earlier with removeShard, but it seems like
your migrations got hung up on something else - hard to say though
without logs there.

> I understand that "moveChunk" do some other job, but not so huge

In general, we try to be as low-impact as possible with moveChunk,
yielding whenever possible, which sacrifices speed.

> And can you explain me why ot write so many bytes to disk? Seems like it
rewrite all database each week or something like this

Hard to say without knowing more about your traffic and schema. Multi-
key indices, frequent updates to indexed fields, journaling, etc, can
cause lots of repeated I/O. You can also change your disk flush
parameters if you'd like with syncdelay, if you have lots of updates
to the same data - this may also impact step5 of the migrate.

On May 10, 2:54 am, Azat Khuzhin <a3at.m...@gmail.com> wrote:
> Filesystem: ext3(raid10 from 6 sas hdd), but I don`t think that filesystem
> can have a significant impact, because we don`t have huge number of files,
> or what else
>
> Any way if such speed on huge number of documents will be always, you can`t
> use it in production
>

> Also when I canceling sharding, I run "removeshard", which also very slowhttps://groups.google.com/forum/?fromgroups#!topic/mongodb-user/R7XS1...

> ...
>
> read more »

Azat Khuzhin

unread,

May 10, 2012, 1:44:16 PM5/10/12

to mongod...@googlegroups.com

Azat Khuzhin.
Send from phone.

On May 10, 2012 9:29 PM, "Greg Studer" <gr...@10gen.com> wrote:
>
> You should move to ext4 and/or xfs - there's definitely impact,
> particularly when migrating data to new shards, which almost always is
> something you do because of data growth. It's in the
> www.mongodb.org/display/DOCS/Production+Notes.

I'll check in hour what fs in Amazon ec2 instances
Ext 3 in my dedicated server

>
> Not sure what happened earlier with removeShard, but it seems like
> your migrations got hung up on something else - hard to say though
> without logs there.
>
> > I understand that "moveChunk" do some other job, but not so huge
> In general, we try to be as low-impact as possible with moveChunk,
> yielding whenever possible, which sacrifices speed.

Is it possible to add some config option for this stuff

>
> > And can you explain me why ot write so many bytes to disk? Seems like it
> rewrite all database each week or something like this
> Hard to say without knowing more about your traffic and schema. Multi-
> key indices, frequent updates to indexed fields, journaling, etc, can
> cause lots of repeated I/O. You can also change your disk flush
> parameters if you'd like with syncdelay, if you have lots of updates

Many inserts, but thanks for hint, I'll try it

> to the same data - this may also impact step5 of the migrate.

I wrote before, that I tried stop all other queries, and this didn't help

Azat Khuzhin

unread,

May 10, 2012, 2:03:59 PM5/10/12

to mongod...@googlegroups.com

On Thu, May 10, 2012 at 9:44 PM, Azat Khuzhin <a3at...@gmail.com> wrote:

Azat Khuzhin.
Send from phone.
On May 10, 2012 9:29 PM, "Greg Studer" <gr...@10gen.com> wrote:
>
> You should move to ext4 and/or xfs - there's definitely impact,
> particularly when migrating data to new shards, which almost always is
> something you do because of data growth. It's in the
> www.mongodb.org/display/DOCS/Production+Notes.
I'll check in hour what fs in Amazon ec2 instances
Ext 3 in my dedicated server

On amazon ec2 - ext3 too

Yes I know that ext4 have extends, so prealocation works more fast with it,

But I remember that when I copy all data to my dedicated server(current), from another the replication took ~ 24 hours

So I don't think that prealocation is slowdown migration

BTW chunk size is 64m, and if prealocation slowdowns than It won't slowdown at every moveChunk command, because prealocation size ~ 2 G

--
Azat Khuzhin

Greg Studer

unread,

May 11, 2012, 5:38:42 PM5/11/12

to mongodb-user

> Is it possible to add some config option for this stuff

Yeah, we're working on this, though ideally this would be as seamless
as possible

One way forward here would be to get a timed iostat for step 5 -
correlate that with the log timing, to see definitively if it's io or
something else causing the problem. If it's not I/O, it potentially
could be config server negotiation taking a long time for some reason
- though we'd need to see the full logs of all involved shards and the
config server to track deeper.

On May 10, 2:03 pm, Azat Khuzhin <a3at.m...@gmail.com> wrote:

> On Thu, May 10, 2012 at 9:44 PM, Azat Khuzhin <a3at.m...@gmail.com> wrote:
> > Azat Khuzhin.
> > Send from phone.

> ...
>
> read more »

Azat Khuzhin

unread,

May 12, 2012, 5:45:58 AM5/12/12

to mongod...@googlegroups.com

I didn't have possibility to measure iostat

But I could say, that iowait wasn't greater than 1, in 2 days (in which migration is running)

But when I run some MR job, it can up to 15

So I think that migration not used all resources, like MR job in this example

> ...
>
> read more »

--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.

--
Azat Khuzhin

Greg Studer

unread,

May 14, 2012, 10:45:29 PM5/14/12

to mongodb-user

If you want to track this deeper, feel free to open a SUPPORT or
SERVER ticket with logs during the migration periods - think there's
too much context here to handle via the newsgroup. We're happy to
help you track this down further - but we'd need to start correlating
the ops in the logs and seeing where the delay actually is detected,
which is probably easier to do via a SUPPORT ticket.

> ...
>
> read more »

Azat Khuzhin

unread,

May 23, 2012, 10:25:19 AM5/23/12

to mongod...@googlegroups.com

I create a ticket in you jira - do some tests with sharding

https://jira.mongodb.org/browse/SERVER-5910

> ...
>
> read more »

--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com

To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com

See also the IRC channel -- freenode.net#mongodb

--
Azat Khuzhin

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu