Hi Francesco
It seems to me that the hardware can barely cope with the workload you’re asking it to do. It appears that by reducing the size of the WiredTiger cache allows you to basically throttle your workload so that the disk can handle the load.
Specifically to answer your questions:
What could be the side effect of our change in WiredTiger cache?
A smaller WiredTiger cache means that there are less documents and indexes that WiredTiger can handle within a particular time. The WiredTiger cache forms the working memory of WiredTiger, containing the uncompressed indexes and documents and comprises your working set.
Thus, a small WiredTiger cache will typically be detrimental to a mostly-read workload. This is because WiredTiger needs to load and uncompress documents not in its cache before it is able to process them. This means that you’ll be hitting disk a lot more often.
However, this depends on your use case. In a write-heavy workload, this usually have less impact since your bottleneck would be how fast your disk can process the writes you’re telling it to do.
Increase page eviction thread config min and max
Since it appears that your workload is disk-bound, increasing the number of eviction threads would have minimal impact. The additional threads would just be sitting idle waiting for disk and not being productive.
Change the checkpoints internal to be lower than 60 seconds
This may help to spread out the disk writes and allow WiredTiger to write less data but more often to disk (again, depending on your specific use case). I would encourage you to do your own testing regarding this parameter.
Is this a normal behavior or could be due to some specific application requirements?
This is a normal behaviour. As mentioned above, I believe your disk appears to be not fast enough for your use case. More concurrency e.g. more clients could make this worse. Having said that, one thing you can try is to separate your dbpath and your journal into separate physical disks. This way, WiredTiger journal writes would not compete with data writes. Please see the Production Notes section for recommendations.
Finally, for production environment, it is strongly recommended to deploy a replica set with a minimum of 3 data-bearing nodes. Please see the Production Notes for more details.
Best regards
Kevin
Hi Francesco
“reducing the WiredTiger cache is throttling our workflow”. Does this happen as the default 5% eviction_dirty_target is now reached, so the eviction thread starts to write to disk reducing the amount of work that need to be done in the checkpoint? Is it right?
Correct. The full explanation of the tunables are described in Cache and eviction tuning page, specifically under the heading Eviction tuning. By default (in MongoDB 3.4) this value is set to 5% of the WiredTiger cache size (see https://github.com/mongodb/mongo/blob/r3.4.9/src/third_party/wiredtiger/dist/api_data.py#L426)
I believe the mechanism at work here by lowering the WiredTiger cache from 31GB to 1GB is that it allows the disk to keep up writing dirty data. 5% of 31GB is ~1.5GB, while 5% of 1GB is ~50MB, a much smaller amount of data to write. In other words, when your cache size is 31GB, you write faster to memory but also wait longer for the writes to be flushed to disk, leading to “stalls”. When this value is lowered to 1GB, the flushes are smaller but more regular and faster, thus the “stalls” are smoothed out over time. The overall time it took to write the data should be the same, since the workload appears to be disk-bound.
What could be the factor that limits the disk in the checkpoint scenario? The 200MiB/s throughput or the 5000 IOPS? Or both.
I think it’s both, since what you described so far sounds like the disk is struggling to fulfil the work required of it. Larger throughput + IOPS should generally provide you with a better performance.
You may be able to find a suitable setting for the eviction_dirty_target parameter that is optimal for your workload and your provisioned hardware. However this is an advanced tuning that is best reserved when all other options are exhausted. Please make sure that you have recent backups before doing advanced maintenance on your deployment, and that you have tested the new parameters on test deployments before implementing it in production.
Best regards
Kevin
Thank you so much for your help and your detailed answers. I think everthing is clear now :). As you suggested, we will address the issue testing any of the possible solution in our Dev and staging environment.
Thank you again.
Best Regards,
Francesco Rivola
Hi Francesco,
We are feeling we are missing some point, with the provided data do you still think the problem is the Disk and that we are currently I/O bound?
There are a lot of information there, but it is curious to see the apparent performance cap that you’re seeing. Is it possible that it’s being throttled in some way by Azure? The Throttling Section in Azure Premium Storage page says:
Throttling might occur, if your application IOPS or throughput exceeds the allocated limits for a premium storage disk. Throttling also might occur if your total disk traffic across all disks on the VM exceeds the disk bandwidth limit available for the VM.
So my understanding is, throttling can occur not only from disk limits, but also from VM limits. If my understanding is correct, this implies that a single disk is only part of the story.
Looking into the mongostat output you provided, the flush was finished at 11:54:01 (in mongostat, flushes are recorded after they’re done) and the stall appear at 11:54:11, a full 10 seconds after the flush. The stall lasted until 11:54:25. To me this seems curious, since if the disk is struggling to fulfil the flush, it should stall immediately, not 10 seconds later. This may imply that you’re being throttled.
Finally, do you know somebody that could help us with this? We are thinking in 1 hour remote meeting that of course will be payed.
Since you’re already on Azure, have you considered looking at Atlas Pro? Atlas is configured to avoid these situations, and if it happens for some reason, there are avenues you can use to get help.
Best regards,
Kevin
Hi Francesco,
Sorry for the delay in responding. Have you had a chance to confirm if this is a throttling issue?
Regarding your questions:
What do you think about this kind of setup? The fact that the disk is volatile, would be fine when a replica set is in place?
I’m not sure I can recommend for or against this setup, since it depends on your use case and your goal. If a volatile disk use is one of your primary concern, I would suggest you to take a look at MongoDB Enterprise Server, which contains the in-memory storage engine that is designed for this use case.
What could be the impact of re-sync a secondary from the beginning in case one VM goes down?
An initial sync will require a workload similar to a collection scan on all databases on the sync source (typically this is the primary). This could impact your primary since it would require it to (eventually) load all documents into its cache, which is a change in workload that could potentially be disruptive for your queries. During this period, I would expect your typical queries to be slow since MongoDB needs to juggle between servicing your query and an initial sync. For some use case, this might result in an unacceptable dip in performance.
Backups: that disk does not have azure disk snapshot backups. So I guess backups should be done by us using mongodump or other similar tools. I have read in mongodb documentation that this is not the most recommended way to backup data. What are your thoughts on this?
Backing up by mongodump and mongorestore is a perfectly acceptable method as mentioned in the MongoDB Backup Methods page. However, this method would require a change in workload similar to an initial sync; that is, MongoDB would need to load the documents from disk to its cache to be able to dump them. Thus the tradeoff would be similar to your second question above. You might want to experiment with the methods outlined in the page above to see which one would best suit your need, since there is no correct answer to this question.
Best regards,
Kevin