MongoDB Journal on a separate disk?

Abhi

unread,

Jul 16, 2014, 3:03:20 AM7/16/14

to mongod...@googlegroups.com

Hi,

I am trying to explore if I can have journal files on separate disk than the data files and how it will affect my MongoDB setup?

Can someone please highlight the benefits of doing this? Is there any reason why I wouldn't want to separate my journal and data files?

Thanks,

Abhi

Stephen Steneker

unread,

Jul 16, 2014, 5:51:03 AM7/16/14

to mongod...@googlegroups.com

On Wednesday, 16 July 2014 19:03:20 UTC+12, Abhi wrote:

I am trying to explore if I can have journal files on separate disk than the data files and how it will affect my MongoDB setup?
Can someone please highlight the benefits of doing this? Is there any reason why I wouldn't want to separate my journal and data files?

Hi Abhi,

The main benefit of separating the journal volume from the data volume is to remove write contention (if you are seeing this as an issue, and your deployment is I/O bound). The journal does have different access patterns as compared to data files, so you could provision this on a separate lower IOPS volume if you are using a hosting service like AWS. For example, you could have a look at how the volumes on the MongoDB AMIs are configured: https://aws.amazon.com/marketplace/pp/B00CO7GSTY/.

If you separate your journal and data volumes, a significant caveat is that you can no longer use a filesystem or ec2 snapshot to get a consistent backup so this will likely change/complicate your backup strategy.

For example, you can compare the instructions on http://docs.mongodb.org/manual/tutorial/backup-with-filesystem-snapshots/. If you do not have journal and data on the same volume, you will need to fSyncLock your database to quiesce the writes: http://docs.mongodb.org/manual/tutorial/backup-with-filesystem-snapshots/#create-backups-on-instances-that-do-not-have-journaling-enabled.

Unless you have strong motivation for separating the journal volume, I would recommend keeping data + journal on the same volume as the backup strategy with snapshot is more straightforward and less operationally disruptive.

Regards,

Stephen

Abhi

unread,

Jul 16, 2014, 7:06:15 AM7/16/14

to mongod...@googlegroups.com

Thanks for the reply Stephen.

What would be the recommendation in case of using MongoDB with NFS? Should I keep data file and journal files separate? Will it have any impact on performance?

In this post William points out that NFS should not be used with MongoDB owing to performance issues due to remapping of all data files 10 times per sec. Will the performance benefit (if any) gained by separating out journal files balance out the penalty due to remapping of data files.(IIUC remapping is triggered due to journal flush)?

Thanks,

Abhi

Stephen Steneker

unread,

Jul 16, 2014, 5:31:58 PM7/16/14

to mongod...@googlegroups.com

On Wednesday, 16 July 2014 23:06:15 UTC+12, Abhi wrote:

Thanks for the reply Stephen.
What would be the recommendation in case of using MongoDB with NFS? Should I keep data file and journal files separate? Will it have any impact on performance?
In this post William points out that NFS should not be used with MongoDB owing to performance issues due to remapping of all data files 10 times per sec. Will the performance benefit (if any) gained by separating out journal files balance out the penalty due to remapping of data files.(IIUC remapping is triggered due to journal flush)?

Hi Abhi,

NFS does have significant performance/usage caveats as detailed by William Zola, and in our production notes:

http://docs.mongodb.org/manual/administration/production-notes/#remote-filesystems

For NFS in particular, our recommendation is to separate the journal from the data volume (and if possible, have the journal on a local or isci volume rather than NFS) . There are also recommended NFS mount options in the production notes link above.

Another general recommendation to reduce I/O contention is to have your logs on a separate volume.

I wouldn't consider separating the journal as a "balance" for the penalty of NFS performance, but rather best practice if you must use NFS (which is discouraged for all the reasons detailed by William Zola in the post you referenced).

Regards,

Stephen

Asya Kamsky

unread,

Jul 16, 2014, 7:17:47 PM7/16/14

to mongodb-user

Just to emphasize what's already been said about NFS:

Don't use NFS for data files, and REALLY *never* put the journal on NFS.

Asya

--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/2fd06d6a-b4d1-4d01-88dd-19b23d47d06e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Abhi

unread,

Jul 17, 2014, 3:14:51 AM7/17/14

to mongod...@googlegroups.com

Thanks Stephen and Asya.

I understand the caveats associated with using NFS and MongoDB. I just want understand a few more things regarding separating out journal and data files.

Say I have data files on NFS and journal local disk then:-

1. How it will improve performance? As per my understanding (by reading William's post here) journal operations will still cause remapping 10 times per sec between shared view and private view which will still cause lot of data transfer over NFS thus in my understanding not improving any performance. I would appreciate if you can provide some insight into this.

2. How will this affect my backup strategy? On NFS I can take timely snapshots but I read that without journal it may not be consistent. Is it worth to sacrifice having a simple backup strategy for insignificant performance gain?

3. Are there any other caveats/issues associated with having journal on separate storage than data files? and why it is considered as best practice?

Thanks,

Abhi

unread,

Jul 18, 2014, 9:23:42 AM7/18/14

to mongod...@googlegroups.com

Humble ping!

Thanks,

Abhi

Asya Kamsky

unread,

Jul 18, 2014, 2:27:56 PM7/18/14

to mongodb-user

Abhi,

Probably best thing would be for you to benchmark the performance for your sample workload on different configurations and see what you get.

Frankly, I don't think you're giving enough weight to what Williams said ("I'd never ever put any data I cared about on an NFS filesystem -- no matter what database I was using.")

Is this critical data that you plan on storing in this DB, or is it data that can easily be recreated from another data source?

Asya

--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.

To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/4adb19e4-863e-4f90-98c0-e14a2ba9b2b5%40googlegroups.com.

William Zola

unread,

Jul 19, 2014, 12:47:27 PM7/19/14

to mongod...@googlegroups.com

Hi Abhi!

Number one: don't use NFS! If you've got a filer, you can use either iSCSI or FCoE, and neither one of them will have the problems.

Number two: don't use NFS in a way that's visible to MongoDB. In an absolute pinch you can mount a loopback file system that is stored on the NFS device:

$ dd if=/dev/zero bs=1G count=500 of=/nfs/mongodb/mongofile.0

$ sudo losetup /dev/loop0 /nfs/mongodb/mongofile.0

$ sudo mkfs.ext4 /dev/loop0

$ sudo mkdir /data/db

$ sudo mount /dev/loop0 /data/db

Number three: the point of separating the journal and data directories is to maximize write throughput in an environment where disk I/O write capacity is the limiting factor.

In general, when optimizing a system, it only makes sense to optimize where the bottleneck is. If you add capacity someplace that's not the bottleneck, you won't improve system performance.

There are some environments where disk write capacity is the bottleneck. These are write-heavy environments, typically using spinning disk. You can tell you're in one of those environments by looking at the output of 'iostat -xmt' -- '%util', 'await', and 'avgqq-sz' will be high, and 'wMB/s' will be at or near the rated capacity of the disk device.

If you're in such an environment, moving the journal files to a separate *physical* device -- NOT a separate logical device -- can help overall performance. This works because you now have two physical devices performing your write operations.

For example, if in your environment both the data and journal files are writing to a single spinning disk rated at a maximum of 50 MBps for random I/O, then that might limit the number of writes you can do per second. If you then changed your environment so that the data files were using the original hard disk, but the journal files were using a separate physical disk, you now have a maximum of 100 MBps of I/O capacity -- twice as much -- and the disk I/O may no longer be a bottleneck. (At the very least, it will bottleneck at a higher number of Mongo operations per second.)

In addition, moving the journal to a separate physical device helps because the type of I/O is different between the journal and the data files. In MongoDB, the journal file I/O is entirely sequential, which the data file I/O is random. On spinning disk, there is a large performance difference between these two types of I/O -- a high-end disk will top out at 50 MBps for random I/O, but be able to sustain 150 MBps for sequential I/O.

Putting the journal files on a separate device will not only off-load that I/O from the device handing the data files, it will also change the nature of the workload from mixed random and sequential to entirely sequential. This in turn will mean that journal writes are faster, since they don't have to contend with the random writes to the data files.

Remember that this will only improve things if you have a write-heavy environment AND if disk write speed is the bottleneck for your system! It is incorrect to think of separate journal and data files as "best practice". Think of it more as a "common optimization for a particular type of workload". This will have *no* effect if disk I/O write load is not your bottleneck.

Number four: If you have your journal files and data files on different devices, and you're using filesystem snapshots to back up your data, you must change your backup procedures. Stephen has mentioned this to you already and has given you links: please re-read the third paragraph in his response here:

Number five: If you're concerned about performance, the most likely bottleneck is the filer itself. MongoDB at scale can overload any filer or SAN I've ever run across. The problem is worse if the filer/SAN is shared among multiple applications.

If you're looking for "best practice" for performance for MongoDB disk subsystems, the answer is simple: SSD as DAS. Always. Every time.

Let me know if you have any questions.

-William

Abhi

unread,

Jul 21, 2014, 7:00:43 AM7/21/14

to mongod...@googlegroups.com

Thanks for the reply William and Asya, really appreciate it.

William,

I want to clear some understanding, In this post here and here, you said that MongoDB does remapping of data files using mremap() 10 times per sec which causes lot of data transfer over network hurting performance if using NFS. As I understand by reading this http://www.kchodorow.com/blog/2012/10/04/how-mongodbs-journaling-works/, the journal flush causes the remapping operation between shared view data files and private view data files in RAM. IIUC from earlier mention posts, the data files will be transferred again over network for this remapping operation because NFS is stateless. Am I correct? And this will happen no matter where journal is. That is, even if journal is on local disk and data files are on NFS, the same issues with remapping will apply here also.

As I understand, this problem is specific to NFS only and will not be there with iSCI or FCoE. Please comment if there is something incorrect with my understanding above.

I understand that NFS + MongoDB should be avoided. I am trying to convince my sys-admins that even by keeping data files on NFS and journal on local disk(SSD) the same problems exist and there won’t be any significant performance gains unless we avoid NFS.

The confirmation of above remapping issues will help. Are there any other issues which I can run into if I separate out journal and data files onto separate physical disks? I couldn't find anything with google search and would like to know any caveats before trying this out.

Once again thanks and really appreciate your patience in answering my queries.

Thanks,

Abhi

William Zola

unread,

Jul 21, 2014, 2:07:44 PM7/21/14

to mongod...@googlegroups.com

Hi Abhi!

Thanks for your follow-up. Let me address your questions in line.

> The journal flush causes the remapping operation between shared view data files and private view data files in RAM. IIUC from earlier mention posts, the data files will be transferred again over network for this remapping operation because NFS is stateless. Am I correct?

I've been looking into this a bit deeper. What we know for a fact is that if you have data files stored on NFS and journaling enabled, you'll have massive data transfer rates, looking as if the entire set of data files gets shipped over the network multiple times. We also know for a fact that if you disable journaling this behavior stops. (Don't try that one at home -- running without journaling is a recipe for data loss!)

However, nobody I can find has actually figured out *why* NFS files have this behavior. Our current working theory is that the double-mapping of the data files (one for the shared view and one for the private view) confuses the NFS client into thinking that there are two processes accessing the same data files. This theory says that the NFS cache gets invalidated at each remapping, and so the NFS client re-requests the file at each remapping.

But to be clear -- nobody has diagnosed this definitively. All we have is a theory.

> And this will happen no matter where journal is. That is, even if journal is on local disk and data files are on NFS, the same issues with remapping will apply here also.

There's a piece of MongoDB folklore that says that you can alleviate NFS problems by putting the journal on local storage. I cannot find anyone who has tested this. To test it, you'd have to build a NFS-based system which *did* experience the problem (meaning that it would have to be running a significant load -- see my previous comment in a different thread for details ) and then move the journal files to local storage. Depending on whether or not moving the journal files fixed the high network overhead, we'd know whether or not this folklore is correct or not.

My personal opinion -- absolutely unsupported by any evidence one way or another -- is that moving just the journal file would not help, for exactly the reason you state. I look forward to someone someday doing the experiment, at which point we will all know.

> As I understand, this problem is specific to NFS only and will not be there with iSCI or FCoE. Please comment if there is something incorrect with my understanding above.

This is completely correct. The reason this works is that NFS is at a different layer of the stack: NFS implements a networked *filesystem*, while iSCSI and FCoE implement networked *storage*, with the filesystem running locally.

> that even by keeping data files on NFS and journal on local disk(SSD) the same problems exist and there won’t be any significant performance gains unless we avoid NFS.

As discussed above, I believe this to be the case, but nobody I know of has properly tested this hypothesis.

I'd like to note that even if moving the journal to local disk would fix the "massive network traffic" problem, it wouldn't do *anything* about the other known problems with NFS storage -- entire data files disappearing, corruption within the data files themselves, etc.

> Are there any other issues which I can run into if I separate out journal and data files onto separate physical disks?

The only issue with separating journal and data files onto separate disks is that you can no longer take a consistent filesystem snapshot without using db.fsyncLock().

I appreciate that you're fighting a technical battle with your sysadmins. One way that you could win the fight is to do the tests yourself and let your sysadmins see the results for themselves. If you're looking for something that can load up a cluster, Socialite [https://github.com/10gen-labs/socialite] can probably generate more load than you need.