Some sharding questions

Keith Branton

unread,

Dec 13, 2010, 12:06:17 AM12/13/10

to mongod...@googlegroups.com

Hi,

My 3 gridfs boxes are 64bit centos, mongo 1.6.3.

I have a shard set using several replica sets of 2 mongod processes each having a dedicated 2TB hard disk. If I wanted to add some 3TB hard disks to the shard set, will sharding use all the available space? Is it smart enough to realize that the 3TB disks can hold an extra TB?

I want to set up a batch job to monitor disk utilization of my shard set (so I know when to add more disks). Can you suggest a good way to measure that?

Is there a way to do a repair using a different disk? (if all 2TB are already allocated to mongo data) - Can I have a spare disk somewhere in my network and use that for repairs? Or should I just drop the data files and rebuild using --fastsync?

Tuning: Is there any advantage to pre-allocating an entire 2TB disk to mongo (thinking I could maybe prevent the pauses caused by allocating). If so is there an optimally efficient way to lay out the data files - or can I use a single file?

Sorry if these questions have already been answered in the docs and I didn't find them.

Thanks in advance.

Keith.

Andreas Jung

unread,

Dec 13, 2010, 12:13:37 AM12/13/10

to mongod...@googlegroups.com

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Keith Branton wrote:

>
> Tuning: Is there any advantage to pre-allocating an entire 2TB disk to
> mongo (thinking I could maybe prevent the pauses caused by allocating).

Use a filesystem like ext4 with "instant" pre-allocation (through
fallocate()).

- -aj
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQGUBAEBAgAGBQJNBauBAAoJEADcfz7u4AZj/H8Lv0nstKxC9qOlL6ZPH43WJAK/
Wg0E8rVG3YyxyzI5MDNc9vh0o6StGLz/cy8oHkUju1lpCk3zVLHcMja0VMJMCAbP
i8OAWRD+7zw06BzN9fs5SYcM8oCN/7pJFc+aI+IcteYmeBCdpllXeQjImm5zxRhq
y5zQejYrIZJOyCKZw5iUyLcvioMekS/7YcyY0N1ThHbKedezyp2Lv+wZsV+/m8xF
3oEWVwiwqXufzdxWA1Z751jGGbB3FoREmivATI6zuBaXwPkn7swKfCfXLDtDJITe
LbGWBkObnfRkqUn4/yD2ZdS4mvmqMkAdW0QSY0r2KD1RjjCIRpkv7YrGSxxqlIHC
LCtJBhlxcFESAzXIhubXo0GxApyrtrmc6rK2hgwgf2AsVHoDK4mYULzlmBNzaSUX
27rKYZvXVDgEWaccN6HUf4BFYUAjx6Z0495NLo3fUIw4nzqm6V+Ff3Ds2c4vNroL
zXhq8pIR4TYIsztaGBxY1Ns20gvLa8o=
=ZaMM
-----END PGP SIGNATURE-----

lists.vcf

Alvin Richards

unread,

Dec 13, 2010, 3:31:32 AM12/13/10

to mongodb-user

If you want to have one disk available for repairs, then create the
file-system (e.g. EXT4 or XFS) and mount it on the host where you want
to run the repair. You can then use the "--repairpath path" argument
to mongod and set the "path" to be the disk you just mounted.

Disk utilization. Use Munin, Nagios, Cacti or other system monitoring
or altering tools. MongoDB plugins exists for the three tools I
mentioned.

Andreas is correct, you need to use a file-system which implements the
posix file allocation methods (e.g. EXT4 or XFS). In this way you will
always get fast file allocations when MongoDB allocates the nest file.

-Alvin

On Dec 13, 5:13 am, Andreas Jung <li...@zopyx.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Keith Branton wrote:
>
> > Tuning: Is there any advantage to pre-allocating an entire 2TB disk to
> > mongo (thinking I could maybe prevent the pauses caused by allocating).
>
> Use a filesystem like ext4 with "instant" pre-allocation (through
> fallocate()).
>
> - -aj
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (Darwin)

> Comment: Using GnuPG with Mozilla -http://enigmail.mozdev.org/

>
> iQGUBAEBAgAGBQJNBauBAAoJEADcfz7u4AZj/H8Lv0nstKxC9qOlL6ZPH43WJAK/
> Wg0E8rVG3YyxyzI5MDNc9vh0o6StGLz/cy8oHkUju1lpCk3zVLHcMja0VMJMCAbP
> i8OAWRD+7zw06BzN9fs5SYcM8oCN/7pJFc+aI+IcteYmeBCdpllXeQjImm5zxRhq
> y5zQejYrIZJOyCKZw5iUyLcvioMekS/7YcyY0N1ThHbKedezyp2Lv+wZsV+/m8xF
> 3oEWVwiwqXufzdxWA1Z751jGGbB3FoREmivATI6zuBaXwPkn7swKfCfXLDtDJITe
> LbGWBkObnfRkqUn4/yD2ZdS4mvmqMkAdW0QSY0r2KD1RjjCIRpkv7YrGSxxqlIHC
> LCtJBhlxcFESAzXIhubXo0GxApyrtrmc6rK2hgwgf2AsVHoDK4mYULzlmBNzaSUX
> 27rKYZvXVDgEWaccN6HUf4BFYUAjx6Z0495NLo3fUIw4nzqm6V+Ff3Ds2c4vNroL
> zXhq8pIR4TYIsztaGBxY1Ns20gvLa8o=
> =ZaMM
> -----END PGP SIGNATURE-----
>

> lists.vcf
> < 1KViewDownload

Keith Branton

unread,

Dec 13, 2010, 11:07:27 AM12/13/10

to mongod...@googlegroups.com

Thanks Andreas and Alvin for your answers.

Looks like we need to reformat in ext4 (I'm pretty sure we are using ext3 at present). Here's another thought - would having one file per drive somehow be better than having 1000 files per drive (i.e. using 2GB chunks) from an efficiency/open file handles point of view or would it be worse. I have 24 drive bays on these boxes so when they get fully populated it seems like they would need 24,000 (or more if I start using 3TB drives) file handles just to access the database?

I did a search for monitoring tools for Munin, Nagios and Cacti (and also Ganglia as that is what I'm currently running), but none of the ones I found appear to report % of available storage currently in use which is really what I'd need.

http://tag1consulting.com/blog/mongodb-cacti-graphs

https://github.com/quiiver/mongodb-ganglia

http://www.google.com/search?aq=f&sourceid=chrome&ie=UTF-8&q=nagios+mongodb

https://github.com/erh/mongo-munin

Am I missing any? I suppose if I do stick with the 2GB files then I could simply monitor the free space on the file systems, which would be close enough.

Still keen to get an answer to sharding with different disk sizes question.

Thanks. Keith.

Alvin Richards

unread,

Dec 13, 2010, 6:33:31 PM12/13/10

to mongodb-user

The standard Munin plugin will show disk utilization, so if you map a
device and file-system for use by MongoDB then this is easy to
visualize.

Given the number of volumes you have, I would strip them together into
a single file-system. If means that if one DB files get "hot" then you
have aggregated all the IOPS for all devices... a single disks cannot
sustain high loads of read and write for any length of time.

The trouble with larger disks, is that the can sustain the same number
of IOPS as a smaller sixed disk. The nest effect is that you have a
lower IOPS per GB of data... size you DB for the workload of your apps
and available IOPS not just size...

-Alvin

On Dec 13, 4:07 pm, Keith Branton <ke...@branton.co.uk> wrote:
> Thanks Andreas and Alvin for your answers.
>
> Looks like we need to reformat in ext4 (I'm pretty sure we are using ext3 at
> present). Here's another thought - would having one file per drive somehow
> be better than having 1000 files per drive (i.e. using 2GB chunks) from an
> efficiency/open file handles point of view or would it be worse. I have 24
> drive bays on these boxes so when they get fully populated it seems like
> they would need 24,000 (or more if I start using 3TB drives) file handles
> just to access the database?
>
> I did a search for monitoring tools for Munin, Nagios and Cacti (and also
> Ganglia as that is what I'm currently running), but none of the ones I found
> appear to report % of available storage currently in use which is really
> what I'd need.
>
> http://tag1consulting.com/blog/mongodb-cacti-graphs
>

> <http://tag1consulting.com/blog/mongodb-cacti-graphs>https://github.com/quiiver/mongodb-ganglia
>
> <https://github.com/quiiver/mongodb-ganglia>http://www.google.com/search?aq=f&sourceid=chrome&ie=UTF-8&q=nagios+m...
>
> <http://www.google.com/search?aq=f&sourceid=chrome&ie=UTF-8&q=nagios+m...>https://github.com/erh/mongo-munin

Keith Branton

unread,

Dec 13, 2010, 11:21:51 PM12/13/10

to mongod...@googlegroups.com

Alvin,

I suspect our Ganglia installation can be configured to do that too.

As you bring it up, my web site is a social network site, and I am using the GridFS for photos. I anticipating having lots of photos, but like most social networks, I also anticipate having a fairly small working set - probably considerably less than 5% of the storage, and that percentage reducing as volume and users increase. I am using 24 port Areca raid cards - so arranging in stripe sets is possible, and would be fairly easy to do later if I start experiencing problems. In fact I originally planned 6 stripes of 4 drives each, but my sysadmin has serious reservations about rebuilding 8TB volumes when I have to replace a disk.

As I don't delete or update photos, I don't anticipate the write loads for a single db will get that hot, and the more shards in the shard set, the more the write load should be distributed across many disks without the need for striping. I believe this is reasonably well designed for the workload of my app, but I've assumed only time will tell. If you see a major flaw in my reasoning, please let me know.

Any thoughts on the mixed volume size question?

Thanks for your input,

Keith.

--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.

Eliot Horowitz

unread,

Dec 14, 2010, 2:13:04 AM12/14/10

to mongod...@googlegroups.com

Balancing isn't terribly smart when it comes to disk right now and
generally assumes each shard has the same capabilities.
We'll be enhancing this over time, but for now if you can keep shards
consistent, that would be better.

Keith Branton

unread,

Dec 14, 2010, 9:58:12 AM12/14/10

to mongod...@googlegroups.com

Thanks Eliot - is there a jira I should vote for that covers this? I anticipate it being a big benefit to me (and so others doing projects like mine on a budget) to be able to buy progressively bigger disks as they become available and as they are required.

Eliot Horowitz

unread,

Dec 14, 2010, 10:03:43 AM12/14/10

to mongod...@googlegroups.com

Not sure there is a case just for that, feel free to add an explicit one.

Keith Branton

unread,

Dec 14, 2010, 10:35:08 AM12/14/10

to mongod...@googlegroups.com

http://jira.mongodb.org/browse/SERVER-2218

Reply all

Reply to author

Forward