MongoDB, XFS, and SSDs

918 views
Skip to first unread message

Greg Banks

unread,
Sep 19, 2016, 7:27:43 PM9/19/16
to mongodb-user

We have run into an issue with XFS’s FITRIM ioctl implementation (see: https://github.com/torvalds/linux/blob/master/fs/xfs/xfs_discard.c#L155) (used by the fstrim command (see:https://github.com/karelzak/util-linux/blob/master/sys-utils/fstrim.c#L87)) when running against local SSDs that is severely impacting IO in general and MongoDB specifically.

Essentially, XFS is iterating over every allocation group and issuing TRIM s for all free extents every time this ioctl is called. This, coupled with the facts that Linux’s interface to the TRIM command is both synchronous and does not support a vectorized list of ranges (see: https://github.com/torvalds/linux/blob/3fc9d690936fb2e20e180710965ba2cc3a0881f8/block/blk-lib.c#L112), is leading to a large number of extraneous TRIM commands (each of which have been observed to be slow, see: http://oss.sgi.com/archives/xfs/2011-12/msg00311.html) being issued to the disk for ranges that both the filesystem and the disk know to be free. In practice, we have seen IO disruptions of up to 2 minutes. I realize that the duration of these disruptions may be controller dependent. Unfortunately, when running on a platform like AWS, one does not have the luxury of choosing specific hardware.

EXT4, on the other hand, tracks blocks that have been deleted since the previous FITRIM ioctl and targets subsequent TRIM s to the appropriate block ranges (see: http://blog.taz.net.au/2012/01/07/fstrim-and-xfs/). In real-world tests this significantly reduces the impact of fstrim to the point that it is un-noticeable to the database / application. We are currently switching back to EXT4 as a result.

Alternatively, we could mount the filesystem with the discard option (as AWS suggests here: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-instance-store.html), however, our confidence in this performing better is not high given XFS developer comments on the subject (see: http://oss.sgi.com/archives/xfs/2014-08/msg00465.html):

It was introduced into XFS as a checkbox feature. We resisted as
long as we could, but too many people were shouting at us that we
needed realtime discard because ext4 and btrfs had it. Of course,
all those people shouting for it realised that we were right in that
it sucked the moment they tried to use it and found that performance
was woeful. Not to mention that SSD trim implementations were so bad
that they caused random data corruption by trimming the wrong
regions, drives would simply hang randomly and in a couple of cases
too many trims too fast would brick them...

So, yeah, it was implement because lots of people demanded it, not
because it was a good idea.

I am aware that MongoDB strongly recommends using XFS (see: https://docs.mongodb.com/manual/administration/production-notes/#kernel-and-file-systems) and that this is because EXT4 journaling could impact Wired Tiger checkpointing under heavy write load (https://groups.google.com/forum/#!msg/mongodb-user/diGdooN_2Sw/4H7t5JTDcpAJ). Can anybody elaborate on this? Is this the only concern that drove the strong recommendation to go with XFS and, in MongoDB’s opinion, is this still valid given the performance issues with TRIM on Linux when running XFS on SSDs? We are currently running the MMAPv1 storage engine on MongoDB 2.6 and, as mentioned above, we have reverted to EXT4 without apparent consequence. Any more info would really help us in weighing the pros and cons while we work toward Wired Tiger.

Also, any more general recommendations for mitigating the disruption incurred by running fstrim would be more than welcome.

Greg Banks

unread,
Sep 20, 2016, 12:36:17 PM9/20/16
to mongodb-user

John Murphy

unread,
Oct 16, 2016, 7:03:22 PM10/16/16
to mongodb-user

Hi Greg,

I am aware that MongoDB strongly recommends using XFS (see: https://docs.mongodb.com/manual/administration/production-notes/#kernel-and-file-systems) and that this is because EXT4 journaling could impact Wired Tiger checkpointing under heavy write load (https://groups.google.com/forum/#!msg/mongodb-user/diGdooN_2Sw/4H7t5JTDcpAJ). Can anybody elaborate on this?

The recommendation to use XFS relates to our investigation on SERVER-18314 and similar performance issues reported with EXT4 (periodic stalls during WiredTiger checkpoints). However, as noted on SERVER-26131: if you have tested your workload and server configuration with WiredTiger on EXT4 and see better results than XFS you may choose to deploy differently.

We are currently running the MMAPv1 storage engine on MongoDB 2.6 and, as mentioned above, we have reverted to EXT4 without apparent consequence.

For MMAPv1 we currently recommend either EXT4 or XFS. As per the MongoDB production notes, XFS generally performs better with MongoDB (including MMAPv1).

The production notes include recommendations based on aggregate user experience, but factors such as workload and server resources may result in a different outcome for your deployment.

The considerations and stalls you’ve seen as a result of filesystem TRIM support are definitely interesting, but we haven’t had any other reports to correlate this with yet. Can you share some more information on your AWS instance types and storage configuration?

Many thanks,
John Murphy

Sicabol

unread,
Oct 17, 2016, 12:39:35 PM10/17/16
to mongodb-user
Sorry if the question may be irrelevant, but I've always installed my servers with EXT4 partitions. I now want to install a new one on Ubuntu 16.04, with two partitions - one dedicated to /home. The WiredTiger files will be on /home. Should only /home be on XFS or should both partitions be ?

Thanks !

John Murphy

unread,
Oct 26, 2016, 2:59:48 AM10/26/16
to mongodb-user

Hi Sicabol,

The MongoDB Production Notes state that generally XFS is the preferred filesystem if you are using the WiredTiger storage engine. 

In relation to your other partition that is not hosting the WiredTiger files you may use whichever filesystem you like.

Regards,
John Murphy

Patrig "Sicabol" Droumaguet

unread,
Oct 26, 2016, 4:25:01 AM10/26/16
to mongod...@googlegroups.com

Hi John,

perfect, thanks !

--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
 
For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to a topic in the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mongodb-user/Mj0x6m-02Ms/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/8152c9cc-8d07-40c7-b1b6-c88b12ef05af%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Patrig "Sicabol" Droumaguet
Watermarque 138
100 Browning Street
Birmingham B16 8GZ
+33 (0)6.61.99.58.71
+44 (0)7533.969556
http://www.sicabol.com/
http://soundcloud.com/sicabol

MARK CALLAGHAN

unread,
Oct 26, 2016, 11:03:04 AM10/26/16
to mongod...@googlegroups.com
I am sorry to learn about your troubles with XFS and discard. It just works for me so I don't have much experience with other file systems, or with AWS.

I do a lot of performance evaluation for MongoDB & MySQL and then use MySQL in production. I always use XFS and always with the discard option for a variety of SSDs. I use Samsung 850 EVO at home, but won't name the vendors I use at work. I have no experience with fstrim and strongly prefer to run with the discard option. It (XFS + discard) has always worked, well except for:
1) one storage device was too limited in the number of file drops per second it could do. RocksDB likes to drop files quickly when doing a lot of compaction
2) Linux needs more better QoS throttling to avoid too many writes (or trims) from starving other requests. We use slowrm as a workaround - https://dom.as/2016/05/02/on-removing-files/

For problem #2, steady state performance is OK. But if you do rm -rf /big/data on a directory with too much data then there might be lousy response time from the device for seconds or tens of seconds.

AFAIK the ext-X family of filesystems still suffers from per-inode mutexes. Sadly this isn't widely known. But it should only be an issue for direct IO and WT uses buffered IO.
https://www.facebook.com/notes/mysql-at-facebook/xfs-ext-and-per-inode-mutexes/10150210901610933/
https://news.ycombinator.com/item?id=2683862

I am not sure about ext-4, but ext-3 suffered a lot from stalls for log files -- write to end of file, eventually call fsync. I was able to switch from ext-3 to XFS many years ago and never had to re-evaluate ext-4. But ext-X hasn't been good to me in the past for IO-heavy database workloads.


--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
 
For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+unsubscribe@googlegroups.com.

To post to this group, send email to mongod...@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.

For more options, visit https://groups.google.com/d/optout.



--
Mark Callaghan
mdca...@gmail.com
Reply all
Reply to author
Forward
0 new messages