eviction_dirty_target configuration value is the overall dirty byte target for eviction, expressed as a percentage of total cache size". eviction_dirty_target configuration value is ignored until eviction is triggered."What is missing IMO is the explanation of how lazy this eviction activity is. I.e. say there is a single evicting thread in WT eviction=(threads_max=1),
Is there anything else that can be done about it?
A
Hi Yuri,
Thanks for your interest in WiredTiger and insightful questions. I’ll do my best to answer.
I have a question about WT write activity and methods to influence it. I was watching system calls (using perf trace -p pid which shows both the name of the system call, its arguments, it's return code, and its latency) made by WT under workload that inserts 50K objects with fairly random keys into a single WT table (storage is commodity SATA SSD). MongoDB is configured with syncDelay=20sec.
I was seeing 2 types of pwrite calls. One type is happening continuously and that is expected - the journal writes.
Good - journal writes should happen as appends to the journal files.
The other kind happens in burst that last ~100ms and are 20 sec apart - that is obviously the syncing activity on WT index files terminating with fdatasync.
I’d say that this activity is the cost of creating a checkpoint in all the collections and indexes - not just index files.
The surprising part is that the second type was predominantly for one 4K size block, in small percentage - on 2 blocks, and very rarely on 3 blocks.
I suspect what you are seeing here is writes to the WiredTiger metadata. The way a checkpoint works in WiredTiger is that each table (collection or index in MongoDB) has a checkpoint created. The process of creating a checkpoint in a table is to write all the dirty content from cache at a point in time, fsync the content of the data file and then to write and fsync an entry into the metadata for that table.
If your workload contains a lot of collections and indexes - this could explain why a single block (or small set of blocks) are being updated many times when creating a checkpoint.
All in all the checkpointing results in a burst of a very large number of small writes
I suspect the small writes are somewhat a consequence of your applications usage pattern and the frequent checkpoints that are configured. WiredTiger uses variable size pages - so if the content being flushed to satisfy the requirements of a checkpoint only requires 4k of space, that’s the page size we’ll use. Compression also comes into play when understanding how small the resulting writes are.
with pretty random locations as far as I could tell. I.e. this is generally what's called random io. My question is - is this expected by design?
My thinking process is that it makes very little sense to delay large number of random IO write operations and perform them in large burst because obviously this impacts latency of concurrent reads.
There are two reasons data is written back to the database files in WiredTiger:
Is in service of creating checkpoints.
Is in service of maintaining cache usage at the configured levels.
When creating a checkpoint WiredTiger has relatively little influence over what data is being written - it needs to write a certain set of data in order to satisfy the data consistency requirements of a storage engine. I believe it’s checkpoint write activity you are observing.
When writing content out to maintain the cache (eviction), WiredTiger has more control over which content is written, and we have algorithms in place to prefer writing content that won’t result in small random writes.
Indeed, I was seeing MongoDB log slow read queries exactly at the time of these bursts.The reason why LSM engines delay syncing of dirty memory to disk is because by doing this they are able to replace lots of small writes with a single large write resulting in sequential IO. This has less impact on concurrent IO activity.
I agree - one of the primary benefits of LSM trees is that they allow writes to be effectively append, and data files become immutable once they are written. There are down sides of LSM trees for workloads that aren’t write heavy, including the fact that search operations need to look in multiple locations (mitigated by bloom filters), and that cursor traversal becomes more expensive because it requires co-ordination between the different segments of the tree.
Given the btree nature of WT, perhaps the best strategy is to not delay dirty page syncing and perform it continuously, and only write the checkpoint file with a syncDelay.
Is there a way to achieve this by manipulating WT configuration? I was looking at cache eviction policy config.eviction_dirty_target looks like a relevant parameter. The manual says: " Theeviction_dirty_targetconfiguration value is the overall dirty byte target for eviction, expressed as a percentage of total cache size".But what I don't understand is the meaning of this sentence:"Note theeviction_dirty_targetconfiguration value is ignored until eviction is triggered."
The eviction_dirty_target and eviction_dirty_trigger are a pair of configuration options. One (eviction_dirty_trigger) controls the maximum proportion of the cache can be full of dirty data, the other (eviction_dirty_target) configures the minimum proportion of the cache that needs to be dirty before eviction starts working to reclaim that space. I admit that the naming of the options is confusing - they are named that way for historical reasons.
Adjusting that pair of configuration options will alter how much write activity your workload has due to eviction, but won’t change how much write activity is generated by checkpoints.
What is missing IMO is the explanation of how lazy this eviction activity is. I.e. say there is a single evicting thread in WT eviction=(threads_max=1),and it takes Xms to achieve target dirty ratio. Will this eviction thread sleep for some time Yms before examining the current dirty size ratio and starting the next round of eviction? Or the evictor tries to achieve a lower target on dirty size ratio and then waits until dirty size grows back to the target size? I would think there needs to be 2 config parameters - low and high water mark so to speak. Low water mark would determine the target dirty size ratio that evictor will achieve after each pass, and the high water mark would determine when the eviction process is triggered. In my case, the eviction activity lasts 100ms as is triggered every 20 sec - clearly this is not optimal if one cares about query latency @ 99%. It does not look like I can make syncing more frequent and idelly continuous by manipulating only eviction_dirty_target. On the other hand, increasing frequency of checkpoints will result in increase to write amplification, CPU usage, and locking as creation of a checkpoint has a substantial cost along these 3 dimensions based on my understanding.
storage:
journal:
enabled: true
commitIntervalMs: 50
dbPath: "{{{data_path}}}"
syncPeriodSecs: 60
directoryPerDB: false
wiredTiger:
engineConfig:
configString: "eviction_dirty_trigger=5,eviction_dirty_target=5,eviction_chemonckpoint_target=5,eviction=(threads_min=1,threads_max=4)"
cacheSizeGB: 1
journalCompressor: snappy
directoryForIndexes: true
collectionConfig:
blockCompressor: snappy
indexConfig:
prefixCompression: true
My goal was to set eviction_dirty_target= eviction_dirty_trigger to make eviction more or less continuous and hence reduce the burst effect that I'm describing above. (I understand your point about the benefits of deferring eviction in the hope that small writes may become coalesced into bigger ones, but my writes are pretty random and I'd rather have higher io and write amplicifcaton than big bursts of writes.)
However, I did not observe much of a change: every 60 sec there was a big burst of writes similar in size to one without my custom settings.
Is there a mistake on my part somewhere here?
Thank you.
--
You received this message because you are subscribed to the Google Groups "wiredtiger-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wiredtiger-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
>My first question is why you are configuring the syncDelay at 20 seconds?The reason for reduced syncDelay is observation that for write-intensive workloads, at the time of checkpoint WT produces massive amount of write IO lasting ~90-100ms in our case. If at that time WT receives a Read query that is not served from cache the probability of this request taking a longer time to execute is much higher than when it is served outside of this 90-100ms window. In a simple experiment, I was keeping two windows open side by side: dstat output showing the amount of data mongod is writing in one window and mongod log showing slow queries (those that take >100ms to execute) in another window. I was observing that the timing of slow queries coincides with checkpoint writing activity. After reducing syncDelay to 20sec and even 5 sec in some experiment the duration of checkpoint activity was greatly reduced, total amount of data written to disk during checkpoint activity was reduced proportionally to syncDelay reduction, and non-surprisingly - slow read queries were virtually gone.Why am I worried about occasionally slow read queries? Because they break my 99-percentile read latency ....
Thanks for pointing out to eviction_dirty_target and eviction_dirty_trigger. I have realized that I've overlooked the second of them. So, I did try to configure mongodb and overwrite the values of these parameters in the following way:storage:
journal:
enabled: true
commitIntervalMs: 50
dbPath: "{{{data_path}}}"
syncPeriodSecs: 60
directoryPerDB: false
wiredTiger:
engineConfig:
configString: "eviction_dirty_trigger=5,eviction_dirty_target=5,eviction_chemonckpoint_target=5,eviction=(threads_min=1,threads_max=4)"
cacheSizeGB: 1
journalCompressor: snappy
directoryForIndexes: true
collectionConfig:
blockCompressor: snappy
indexConfig:
prefixCompression: trueMy goal was to set eviction_dirty_target= eviction_dirty_trigger to make eviction more or less continuous and hence reduce the burst effect that I'm describing above. (I understand your point about the benefits of deferring eviction in the hope that small writes may become coalesced into bigger ones, but my writes are pretty random and I'd rather have higher io and write amplicifcaton than big bursts of writes.)However, I did not observe much of a change: every 60 sec there was a big burst of writes similar in size to one without my custom settings.Is there a mistake on my part somewhere here?
--
I tried to change the eviction params toeviction_dirty_trigger=3,eviction_dirty_target=1,eviction_checkpoint_target=1,eviction=(threads_min=4,threads_max=4).While there was some visible eviction activity at times other than checkpointing the burst of writes is still there.The problem with achieving the effect that I want - spread dirty page eviction activity evenly between checkpoint writing intervals - by manipulating just the eviction_* params is that they are integers in WT. The smallest unit of increment is 1% of the heap size. If the heap is 10GB, 1% of it is 100MB which is a large increment. During the test I'm using my insert rate (at ) is 10MB/sec. This basically means there is no way to force dirty page eviction more frequent than smaller bursts every 10 sec.Why can't these settings have decimal precision in WT? This would allow me achieve higher precision.
But more importantly, I feel the simplest and most natural approach to accomplish this would be to have a new modus operandi in WT: analyze how much dirty pages is being produced between subsequent checkpoints and try to spread dirty page eviction evenly over this period - ignoring eviction_* altogether in this mode. I believe this mode will be popular among users who value predictable engine performance when looking at 99 percentile measurements. This is an intentional sacrifice of the benefits of deferred eviction. But if you think about it, these benefits are only relevant for achieving higher throughput. If the user can achieve target throughput with continuous eviction - and with a hefty margin which is the case with my workload - it seems a fairly reasonable line of thinking.
I am revisiting the insert benchmark for WiredTiger so this is relevant to me again.
On Thu, Oct 5, 2017 at 9:11 PM, alexander.gorrod via wiredtiger-users <wiredtig...@googlegroups.com> wrote:Not trying to sell InnoDB migrations here but it has innodb_adaptive_flushing, and supporting code, to do what Yuri has suggested. Alas, InnoDB was built to do fuzzy checkpoints and it based on Alex's comment above I don't know whether that would work here.
But for me the first problem I see for an in-memory workload is mutex contention from eviction code.
--
You received this message because you are subscribed to the Google Groups "wiredtiger-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wiredtiger-use...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to wiredtiger-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the Google Groups "wiredtiger-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/wiredtiger-users/zNVzB6ZrYt4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to wiredtiger-users+unsubscribe@googlegroups.com.