Flush to Disk decreases operation throughput

500 views
Skip to first unread message

Francesco Rivola

unread,
Jul 9, 2018, 7:03:52 PM7/9/18
to mongodb-user
Hi All,

We are running MongoDB 3.4.9 using WiredTiger Storage Engine. 
  • Window 2016 server in Azure
  • 64GB of RAM
  • Disk 
    • SSD Premium
    • 5000 IOPS
    • 200 MB/s
    • 1 TB size

Couple of weeks ago we started to have more traffic in our production application and also started to experiment the following issue:
  • in mongostat we discovered that, right after disk flush, we got an high peak in DISK I/O and during 10 or more seconds the stats show 0 operations (0 insert, 0 query, 0 update, 0 delete, etc..)
After some reading and investigation we decided to try decreasing the WiredTiger cache from 31GB to 1GB. The result was impressive and the operation throughput issue has been almost resolved completely.

The dirty cache is now around 7-10%, while before was constantly increasing up to 2.5-3% at the moment of the disk flush.

We would like to ask the following:
  • What could be the side effect of our change in WiredTiger cache?
  • Do we have other alternatives to avoid the flush High DISK I/O?
    • i.e.:
    • Increase page eviction thread config min and max
    • Change the checkpoints internal to be lower than 60 seconds
  • Is this a normal behavior or could be due to some specific application requirements? i.e.: heavy write load vs read load (I have attached a screenshot of MongoStat after our wiredTiger change to 1GB to have an understanding of our load)

Please, let me know if you need more information or if you need further clarification.

Thank you so much.

Best Regards,
Francesco Rivola
MongoStat.png

Kevin Adistambha

unread,
Jul 19, 2018, 2:34:34 AM7/19/18
to mongodb-user

Hi Francesco

It seems to me that the hardware can barely cope with the workload you’re asking it to do. It appears that by reducing the size of the WiredTiger cache allows you to basically throttle your workload so that the disk can handle the load.

Specifically to answer your questions:

What could be the side effect of our change in WiredTiger cache?

A smaller WiredTiger cache means that there are less documents and indexes that WiredTiger can handle within a particular time. The WiredTiger cache forms the working memory of WiredTiger, containing the uncompressed indexes and documents and comprises your working set.

Thus, a small WiredTiger cache will typically be detrimental to a mostly-read workload. This is because WiredTiger needs to load and uncompress documents not in its cache before it is able to process them. This means that you’ll be hitting disk a lot more often.

However, this depends on your use case. In a write-heavy workload, this usually have less impact since your bottleneck would be how fast your disk can process the writes you’re telling it to do.

Increase page eviction thread config min and max

Since it appears that your workload is disk-bound, increasing the number of eviction threads would have minimal impact. The additional threads would just be sitting idle waiting for disk and not being productive.

Change the checkpoints internal to be lower than 60 seconds

This may help to spread out the disk writes and allow WiredTiger to write less data but more often to disk (again, depending on your specific use case). I would encourage you to do your own testing regarding this parameter.

Is this a normal behavior or could be due to some specific application requirements?

This is a normal behaviour. As mentioned above, I believe your disk appears to be not fast enough for your use case. More concurrency e.g. more clients could make this worse. Having said that, one thing you can try is to separate your dbpath and your journal into separate physical disks. This way, WiredTiger journal writes would not compete with data writes. Please see the Production Notes section for recommendations.

Finally, for production environment, it is strongly recommended to deploy a replica set with a minimum of 3 data-bearing nodes. Please see the Production Notes for more details.

Best regards
Kevin

Francesco Rivola

unread,
Jul 19, 2018, 6:51:02 AM7/19/18
to mongodb-user
Hi Kevin,

Thank you very much for your response. Really appreciated.

This helps me a lot to clarify the issue and we will study how to approach your suggestions.
 
I have just few more questions related to your answer:

"reducing the WiredTiger cache is throttling our workflow". Does this happen as the default 5% eviction_dirty_target is now reached, so the eviction thread starts to write to disk reducing the amount of work that need to be done in the checkpoint? Is it right? 
If it is right, could be tuning the eviction_dirty_target and trigger parameters another approach to minimize the checkpoint issue?

What could be the factor that limits the disk in the checkpoint scenario? The 200MiB/s throughput or the 5000 IOPS? Or both. I am asking this in case we could vertically scale the hardware.

Thank you so much.

Best Regards,
Francesco Rivola

Kevin Adistambha

unread,
Jul 22, 2018, 10:31:43 PM7/22/18
to mongodb-user

Hi Francesco

“reducing the WiredTiger cache is throttling our workflow”. Does this happen as the default 5% eviction_dirty_target is now reached, so the eviction thread starts to write to disk reducing the amount of work that need to be done in the checkpoint? Is it right?

Correct. The full explanation of the tunables are described in Cache and eviction tuning page, specifically under the heading Eviction tuning. By default (in MongoDB 3.4) this value is set to 5% of the WiredTiger cache size (see https://github.com/mongodb/mongo/blob/r3.4.9/src/third_party/wiredtiger/dist/api_data.py#L426)

I believe the mechanism at work here by lowering the WiredTiger cache from 31GB to 1GB is that it allows the disk to keep up writing dirty data. 5% of 31GB is ~1.5GB, while 5% of 1GB is ~50MB, a much smaller amount of data to write. In other words, when your cache size is 31GB, you write faster to memory but also wait longer for the writes to be flushed to disk, leading to “stalls”. When this value is lowered to 1GB, the flushes are smaller but more regular and faster, thus the “stalls” are smoothed out over time. The overall time it took to write the data should be the same, since the workload appears to be disk-bound.

What could be the factor that limits the disk in the checkpoint scenario? The 200MiB/s throughput or the 5000 IOPS? Or both.

I think it’s both, since what you described so far sounds like the disk is struggling to fulfil the work required of it. Larger throughput + IOPS should generally provide you with a better performance.

You may be able to find a suitable setting for the eviction_dirty_target parameter that is optimal for your workload and your provisioned hardware. However this is an advanced tuning that is best reserved when all other options are exhausted. Please make sure that you have recent backups before doing advanced maintenance on your deployment, and that you have tested the new parameters on test deployments before implementing it in production.

Best regards
Kevin

Francesco Rivola

unread,
Jul 23, 2018, 3:08:42 AM7/23/18
to mongodb-user
Hi Kevin,

Thank you so much for your help and your detailed answers. I think everthing is clear now :). As you suggested, we will address the issue testing any of the possible solution in our Dev and staging environment.

Thank you again.

Best Regards,
Francesco Rivola

Francesco Rivola

unread,
Nov 6, 2018, 6:25:35 AM11/6/18
to mongodb-user
Hi Kevin,

We have been playing with the WiredTiger cache settings: eviction tuning and cache size in order to minimize the impact of the problem. However, we haven't found a setup that allow us to fully avoid the disk flush stall.

Finally we are considering to scale the disk to one with more IOPS and MB/s.

We are running with MongoDB 3.4.9 (standalone) using WiredTiger Storage Engine. 
    • Window 2016 server in Azure
    • 64GB of RAM
    • Disk 
      • SSD Premium (azure disk type name: P30)
      • 5000 IOPS
      • 200 MB/s
      • 1 TB size

      We plan to resize the disk to P40, this is 7500 IOPS, 250 MB/s and 2TB size, this is the next azure disk available (https://docs.microsoft.com/en-us/azure/virtual-machines/windows/premium-storage#premium-storage-disk-limits)

      Bfore performing this resize, we have tried to confirm that we are I/O Bound. We have been monitoring the disk with PerfMon and this has been the result:
      Those values are far behind the promised limit.

      With iometer.org (see https://blogs.technet.microsoft.com/andrewc/2016/09/09/understanding-azure-virtual-machine-iops-throughput-and-disk-latency/) we have been stressing the disk and we reached the disk promised limit in IOPS and Throughput.

      Reviewing mongo, we found that our max dirty bytes in cache is around 500MB right before the flush (I have attached the result of db.serverStatus().wiredTiger right before the flush to disk ServerStatusWiredTiger.txt). Finally in mongostat we can see that the flush provokes drop in performance around 10 seconds

      See attached image MongoStat.png (note: the used% is not at 80% in the screenshot because we recently increased to the defult wiredTiger cache and the memory is slowly growing until the 80%)

      We are feeling we are missing some point, with the provided data do you still think the problem is the Disk and that we are currently I/O bound? If we are I/O bound why we do not see in PerfMon values around the disk limit?

      Finally, do you know somebody that could help us with this? We are thinking in 1 hour remote meeting that of course will be payed. In this case you know contact me in private at francesco.rivola @ xepient.com.

      Thank you in advanced,

      Best Regards,
      Francesco Rivola
      DiskBytesSec.png
      DiskTransfersSec.png
      MongoStat.png
      ServerStatusWiredTiger.txt

      Kevin Adistambha

      unread,
      Nov 7, 2018, 8:22:57 PM11/7/18
      to mongodb-user

      Hi Francesco,

      We are feeling we are missing some point, with the provided data do you still think the problem is the Disk and that we are currently I/O bound?

      There are a lot of information there, but it is curious to see the apparent performance cap that you’re seeing. Is it possible that it’s being throttled in some way by Azure? The Throttling Section in Azure Premium Storage page says:

      Throttling might occur, if your application IOPS or throughput exceeds the allocated limits for a premium storage disk. Throttling also might occur if your total disk traffic across all disks on the VM exceeds the disk bandwidth limit available for the VM.

      So my understanding is, throttling can occur not only from disk limits, but also from VM limits. If my understanding is correct, this implies that a single disk is only part of the story.

      Looking into the mongostat output you provided, the flush was finished at 11:54:01 (in mongostat, flushes are recorded after they’re done) and the stall appear at 11:54:11, a full 10 seconds after the flush. The stall lasted until 11:54:25. To me this seems curious, since if the disk is struggling to fulfil the flush, it should stall immediately, not 10 seconds later. This may imply that you’re being throttled.

      Finally, do you know somebody that could help us with this? We are thinking in 1 hour remote meeting that of course will be payed.

      Since you’re already on Azure, have you considered looking at Atlas Pro? Atlas is configured to avoid these situations, and if it happens for some reason, there are avenues you can use to get help.

      Best regards,
      Kevin

      Francesco Rivola

      unread,
      Nov 8, 2018, 12:23:18 PM11/8/18
      to mongodb-user
      Hi Kevin,

      First of all, thank you so much for you quick reply.

      We are using a L8 Azure VM. Based on its documentation (https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-storage#ls-series) the machine should be hable to handle 10K IOPS and 250 MB/s. We will open a ticket to Azure Microsoft Support to get some help to confirm that the VM is throttling our disk operations.

      BTW: the L-serie VM is offering a 40K IOPS disk that is temporally (on reboot all data is lost) (see top blue note on https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-storage). They suggest to use this disk to store the database data for high throughput. Then having a replica in place, so even if the VM restart the high availability of the replica guarantee no data loss.
      • What do you think about this kind of setup? The fact that the disk is volatile, would be fine when a replica set is in place?
      • What could be the impact of re-sync a secondary from the beginning in case one VM goes down?
      • Backups: that disk does not have azure disk snapshot backups. So I guess backups should be done by us using mongodump or other similar tools. I have read in mongodb documentation that this is not the most recommended way to backup data. What are your thoughts on this?

      Finally, we are already customer of Mongo Atlas, our app is running on Mongo Atlas. But for bussiness reason we currenlty have this specific environment hosted outside Atlas.

      Thank you so much again, as always your help in this group is awesome and really appreciated :).

      Best Regards,
      Francesco Rivola

      Kevin Adistambha

      unread,
      Nov 20, 2018, 10:34:23 PM11/20/18
      to mongodb-user

      Hi Francesco,

      Sorry for the delay in responding. Have you had a chance to confirm if this is a throttling issue?

      Regarding your questions:

      What do you think about this kind of setup? The fact that the disk is volatile, would be fine when a replica set is in place?

      I’m not sure I can recommend for or against this setup, since it depends on your use case and your goal. If a volatile disk use is one of your primary concern, I would suggest you to take a look at MongoDB Enterprise Server, which contains the in-memory storage engine that is designed for this use case.

      What could be the impact of re-sync a secondary from the beginning in case one VM goes down?

      An initial sync will require a workload similar to a collection scan on all databases on the sync source (typically this is the primary). This could impact your primary since it would require it to (eventually) load all documents into its cache, which is a change in workload that could potentially be disruptive for your queries. During this period, I would expect your typical queries to be slow since MongoDB needs to juggle between servicing your query and an initial sync. For some use case, this might result in an unacceptable dip in performance.

      Backups: that disk does not have azure disk snapshot backups. So I guess backups should be done by us using mongodump or other similar tools. I have read in mongodb documentation that this is not the most recommended way to backup data. What are your thoughts on this?

      Backing up by mongodump and mongorestore is a perfectly acceptable method as mentioned in the MongoDB Backup Methods page. However, this method would require a change in workload similar to an initial sync; that is, MongoDB would need to load the documents from disk to its cache to be able to dump them. Thus the tradeoff would be similar to your second question above. You might want to experiment with the methods outlined in the page above to see which one would best suit your need, since there is no correct answer to this question.

      Best regards,
      Kevin

      Francesco Rivola

      unread,
      Nov 21, 2018, 9:33:13 AM11/21/18
      to mongodb-user
      Hi Kevin,

      Don't worry for the delay, your answer is welcome at any time :)

      We haven't confirmed yet is a throttling issue with Azure Support, we will do that shortly.

      Finally, we have mount a replica set using the L8 disk and the flusk disk problem gone. So I guess the issue was, as you said, that we were I/O bound in that VM with that disk.

      The new replica is composed by 2 L8 using the fast volatile disk and the old server with the old disk as replica member with low priority. The good news is that the old server is able to keep in sync with the fast primary. We guess that the I/O bound was coming from the number of writes + the number of page faults (reads).

      Having a member with that persistent disk allows us to create backups from the disk snapshot and use that snapshot to add new replica member avoiding a full re-sync.

      I have to said that convert the standalone to a replica set has been very smooth process with almost no downtime. Kudu for Mongodb :)

      Thank you for you help, really appreciated.

      Best Regards,
      Francesco

      Robert Cochran

      unread,
      Nov 21, 2018, 10:05:26 AM11/21/18
      to mongodb-user
      Hi!

      I have no special expertise in this area -- in fact I'm learning from both of you. I do want to add my vote for Kevin's advice that you should back up all your data and test any changes before doing them in a production environment. I am a Tier 1 developer on IBM mainframes and the administrators for one application that all the developers use -- it is not MongoDB -- didn't perform any backups on the data for that application. Then someone made some untested changes and put them in production. There was loss of data that could not be recovered. It impacts every developer in our enterprise. While I'm not discussing a MongoDB installation, this does underline the need for good data backups.

      I hope you take the advice to back up very seriously and do that before making even tiny changes to your MongoDB infrastructure. It is important, and backups are worth the time and money invested.

      Thanks 

      Bob

      Francesco Rivola

      unread,
      Nov 21, 2018, 3:30:29 PM11/21/18
      to mongodb-user
      Hi Robert,

      Thank you for your advice. 

      I am with you and Kevin 100%, Indeed backups are very important before perform any change or maintenance operation in the database infrastructure (no matter which database you are working on).

      In fact, all the changes noted in this thread has been tested first in dev and staging environment and before applied in production we created a backup of our data (if you are on Azure I recommend you https://docs.microsoft.com/en-us/azure/backup/backup-introduction-to-azure-backup).
      Reply all
      Reply to author
      Forward
      0 new messages