WiredTiger write patterns and cache eviction configuration

717 views
Skip to first unread message

Yuri Finkelstein

unread,
Sep 29, 2017, 12:57:43 PM9/29/17
to wiredtiger-users
Hello, 
I have a question about WT write activity and methods to influence it. I was watching system calls (using perf trace -p pid which shows both the name of the system call, its arguments, it's return code, and its latency) made by WT under workload that inserts 50K objects with fairly random keys  into a single WT table (storage is commodity SATA SSD). MongoDB is configured with syncDelay=20sec. I was seeing 2 types of pwrite calls. One type is happening continuously and that is expected - the journal writes. The other kind happens in burst  that last ~100ms and are 20 sec apart - that is obviously the syncing activity on WT index files terminating with fdatasync. The surprising part is that the second type was predominantly for one 4K size block, in small percentage - on 2 blocks, and very rarely on 3 blocks. All in all the checkpointing results in a burst of a very large number of small writes with pretty random locations as far as I could tell. I.e. this is generally what's called random io. My question is - is this expected by design? My thinking process is that it makes very little sense to delay large number of random IO write operations and perform them in large burst because obviously this impacts latency of concurrent reads. Indeed, I was seeing MongoDB log slow read queries exactly at the time of these bursts. 
The reason why LSM engines delay syncing of dirty memory to disk is because by doing this they are able to replace lots of small writes with a single large write resulting in sequential IO. This has less impact on concurrent IO activity. 

Given the btree nature of WT, perhaps the best strategy is to not delay dirty page syncing and perform it continuously, and only write the checkpoint file with a syncDelay.  Is there a way to achieve this by manipulating WT configuration? I was looking at cache eviction policy config.
eviction_dirty_target looks like a relevant parameter. The manual says: " The eviction_dirty_target configuration value is the overall dirty byte target for eviction, expressed as a percentage of total cache size". 
But what I don't understand is the meaning of this sentence: 

"Note the eviction_dirty_target configuration value is ignored until eviction is triggered."

What is missing IMO is the explanation of how lazy this eviction activity is. I.e. say there is a single evicting thread in WT eviction=(threads_max=1),
and it takes Xms to achieve target dirty ratio. Will this eviction thread sleep for some time Yms before examining the current dirty size ratio and starting the next round of eviction? Or the evictor tries to achieve a lower target on dirty size ratio and then waits until dirty size grows back to the target size? I would think there needs to be 2 config parameters - low and high water mark so to speak. Low water mark would determine the target dirty size ratio that evictor will achieve after each pass, and the high water mark would determine when the eviction process is triggered. In my case, the eviction activity lasts 100ms as is triggered every 20 sec - clearly this is not optimal if one cares about query latency @ 99%. It does not look like I can make syncing more frequent and idelly continuous by manipulating only eviction_dirty_target. On the other hand, increasing frequency of checkpoints will result in increase to write amplification, CPU usage, and locking as creation of a checkpoint has a substantial cost along these 3 dimensions based on my understanding. 
 
Is there anything else that can be done about it?
A
Thank you,
Yuri 



alexande...@10gen.com

unread,
Oct 4, 2017, 12:59:23 AM10/4/17
to wiredtiger-users

Hi Yuri,


Thanks for your interest in WiredTiger and insightful questions. I’ll do my best to answer.


On Saturday, September 30, 2017 at 2:57:43 AM UTC+10, Yuri Finkelstein wrote:
I have a question about WT write activity and methods to influence it. I was watching system calls (using perf trace -p pid which shows both the name of the system call, its arguments, it's return code, and its latency) made by WT under workload that inserts 50K objects with fairly random keys  into a single WT table (storage is commodity SATA SSD). MongoDB is configured with syncDelay=20sec.

My first question is why you are configuring the syncDelay at 20 seconds? For MongoDB with the WiredTiger storage engine the only benefit of using a short syncDelay when journaling is also enabled is a reduction in the amount of time required to recover after a crash. The default value of 60 seconds is already fairly aggressive. The syncDelay setting configures how often MongoDB triggers a full checkpoint in WiredTiger - a checkpoint will flush all dirty data from cache to disk.
 
I was seeing 2 types of pwrite calls. One type is happening continuously and that is expected - the journal writes.

Good - journal writes should happen as appends to the journal files.

 
The other kind happens in burst  that last ~100ms and are 20 sec apart - that is obviously the syncing activity on WT index files terminating with fdatasync.


I’d say that this activity is the cost of creating a checkpoint in all the collections and indexes - not just index files.

 
The surprising part is that the second type was predominantly for one 4K size block, in small percentage - on 2 blocks, and very rarely on 3 blocks.

I suspect what you are seeing here is writes to the WiredTiger metadata. The way a checkpoint works in WiredTiger is that each table (collection or index in MongoDB) has a checkpoint created. The process of creating a checkpoint in a table is to write all the dirty content from cache at a point in time, fsync the content of the data file and then to write and fsync an entry into the metadata for that table.


If your workload contains a lot of collections and indexes - this could explain why a single block (or small set of blocks) are being updated many times when creating a checkpoint.


All in all the checkpointing results in a burst of a very large number of small writes

I suspect the small writes are somewhat a consequence of your applications usage pattern and the frequent checkpoints that are configured. WiredTiger uses variable size pages - so if the content being flushed to satisfy the requirements of a checkpoint only requires 4k of space, that’s the page size we’ll use. Compression also comes into play when understanding how small the resulting writes are.

 
with pretty random locations as far as I could tell. I.e. this is generally what's called random io. My question is - is this expected by design?

I’d say it is. WiredTiger implements a copy-on-write design for it’s data files, and implements a block manager to reuse free space from redundant pages. Depending on your data access patterns, the result can be that content from the in-memory btree ends up mapped fairly randomly across the on-disk file.
 
My thinking process is that it makes very little sense to delay large number of random IO write operations and perform them in large burst because obviously this impacts latency of concurrent reads.

There are two reasons data is written back to the database files in WiredTiger:

  1. Is in service of creating checkpoints.

  2. Is in service of maintaining cache usage at the configured levels.


When creating a checkpoint WiredTiger has relatively little influence over what data is being written - it needs to write a certain set of data in order to satisfy the data consistency requirements of a storage engine. I believe it’s checkpoint write activity you are observing.


When writing content out to maintain the cache (eviction), WiredTiger has more control over which content is written, and we have algorithms in place to prefer writing content that won’t result in small random writes.

 
Indeed, I was seeing MongoDB log slow read queries exactly at the time of these bursts. 
The reason why LSM engines delay syncing of dirty memory to disk is because by doing this they are able to replace lots of small writes with a single large write resulting in sequential IO. This has less impact on concurrent IO activity. 


I agree - one of the primary benefits of LSM trees is that they allow writes to be effectively append, and data files become immutable once they are written. There are down sides of LSM trees for workloads that aren’t write heavy, including the fact that search operations need to look in multiple locations (mitigated by bloom filters), and that cursor traversal becomes more expensive because it requires co-ordination between the different segments of the tree.

 

Given the btree nature of WT, perhaps the best strategy is to not delay dirty page syncing and perform it continuously, and only write the checkpoint file with a syncDelay.

I’d postulate that you actually want something different - it would be better if most of the data write-back happens as part of eviction rather than in the service of checkpoints - because that gives WiredTiger the opportunity to group data in a more disk sensitive fashion.
 
Is there a way to achieve this by manipulating WT configuration? I was looking at cache eviction policy config.
eviction_dirty_target looks like a relevant parameter. The manual says: " The eviction_dirty_target configuration value is the overall dirty byte target for eviction, expressed as a percentage of total cache size". 
But what I don't understand is the meaning of this sentence: 

"Note the eviction_dirty_target configuration value is ignored until eviction is triggered."


The eviction_dirty_target and eviction_dirty_trigger are a pair of configuration options. One (eviction_dirty_trigger) controls the maximum proportion of the cache can be full of dirty data, the other (eviction_dirty_target) configures the minimum proportion of the cache that needs to be dirty before eviction starts working to reclaim that space. I admit that the naming of the options is confusing - they are named that way for historical reasons.

Adjusting that pair of configuration options will alter how much write activity your workload has due to eviction, but won’t change how much write activity is generated by checkpoints.

 
What is missing IMO is the explanation of how lazy this eviction activity is. I.e. say there is a single evicting thread in WT eviction=(threads_max=1),
and it takes Xms to achieve target dirty ratio. Will this eviction thread sleep for some time Yms before examining the current dirty size ratio and starting the next round of eviction? Or the evictor tries to achieve a lower target on dirty size ratio and then waits until dirty size grows back to the target size? I would think there needs to be 2 config parameters - low and high water mark so to speak. Low water mark would determine the target dirty size ratio that evictor will achieve after each pass, and the high water mark would determine when the eviction process is triggered. In my case, the eviction activity lasts 100ms as is triggered every 20 sec - clearly this is not optimal if one cares about query latency @ 99%. It does not look like I can make syncing more frequent and idelly continuous by manipulating only eviction_dirty_target. On the other hand, increasing frequency of checkpoints will result in increase to write amplification, CPU usage, and locking as creation of a checkpoint has a substantial cost along these 3 dimensions based on my understanding. 

I believe I’ve answered this question above. One thing I’ll note is that configuring a single eviction thread isn’t recommended. The ideal configuration will have enough eviction worker threads to handle the eviction load for a workload. If the eviction load is beyond the capacity of the eviction worker threads then application threads will be used to maintain the content of the cache - which can lead to spikes in operation latency. For a high throughput workload on a machine with more than 32 cores I’d recommend configuring eviction=(threads_min=8,threads_max=8) for smaller machines I’d recommend configuring both values to 4 (which is the MongoDB default).
 
- Alex

Yuri Finkelstein

unread,
Oct 4, 2017, 2:51:02 AM10/4/17
to wiredtig...@googlegroups.com
Hello Alex,
thanks for your response.
>My first question is why you are configuring the syncDelay at 20 seconds?
The reason for reduced syncDelay is observation that for write-intensive workloads, at the time of checkpoint WT produces massive amount of write IO lasting ~90-100ms in our case. If at that time WT receives a Read query that is not served from cache the probability of this request taking a longer time to execute is much higher than when it is served outside of this 90-100ms window. In a simple experiment, I was keeping two windows open side by side: dstat output showing the amount of data mongod is writing in one window and mongod log showing slow queries (those that take >100ms to execute) in another window. I was observing that the timing of slow queries coincides with checkpoint writing activity. After reducing syncDelay to 20sec and even 5 sec in some experiment the duration of checkpoint activity was greatly reduced, total amount of data written to disk during checkpoint activity was reduced proportionally to syncDelay reduction, and non-surprisingly - slow read queries were virtually gone.
Why am I worried about occasionally slow read queries? Because they break my 99-percentile read latency ....


Thanks for pointing out to eviction_dirty_target and eviction_dirty_trigger. I have realized that I've overlooked the second of them. So, I did try to configure mongodb and overwrite the values of these parameters in the following way:
storage:
journal:
enabled: true
commitIntervalMs: 50
dbPath: "{{{data_path}}}"
syncPeriodSecs: 60
directoryPerDB: false
wiredTiger:
engineConfig:
configString: "eviction_dirty_trigger=5,eviction_dirty_target=5,eviction_chemonckpoint_target=5,eviction=(threads_min=1,threads_max=4)"
cacheSizeGB: 1
journalCompressor: snappy
directoryForIndexes: true
collectionConfig:
blockCompressor: snappy
indexConfig:
prefixCompression: true
My goal was to set eviction_dirty_target= eviction_dirty_trigger to make eviction more or less continuous and hence reduce the burst effect that I'm describing above. (I understand your point about the benefits of deferring eviction in the hope that small writes may become coalesced into bigger ones, but my writes are pretty random and I'd rather have higher io and write amplicifcaton than big bursts of writes.) 
However, I did not observe much of a change: every 60 sec there was a big burst of writes similar in size to one without my custom settings.
Is there a mistake on my part somewhere here? 

Thank you. 
Yuri

--
You received this message because you are subscribed to the Google Groups "wiredtiger-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wiredtiger-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alex

unread,
Oct 4, 2017, 8:11:01 PM10/4/17
to wiredtiger-users
Hi Yuri,


On Wednesday, October 4, 2017 at 5:51:02 PM UTC+11, Yuri Finkelstein wrote:
>My first question is why you are configuring the syncDelay at 20 seconds?
The reason for reduced syncDelay is observation that for write-intensive workloads, at the time of checkpoint WT produces massive amount of write IO lasting ~90-100ms in our case. If at that time WT receives a Read query that is not served from cache the probability of this request taking a longer time to execute is much higher than when it is served outside of this 90-100ms window. In a simple experiment, I was keeping two windows open side by side: dstat output showing the amount of data mongod is writing in one window and mongod log showing slow queries (those that take >100ms to execute) in another window. I was observing that the timing of slow queries coincides with checkpoint writing activity. After reducing syncDelay to 20sec and even 5 sec in some experiment the duration of checkpoint activity was greatly reduced, total amount of data written to disk during checkpoint activity was reduced proportionally to syncDelay reduction, and non-surprisingly - slow read queries were virtually gone.
Why am I worried about occasionally slow read queries? Because they break my 99-percentile read latency ....

Thanks for the explanation. I understand what you are seeing. Unfortunately there is a tension when creating checkpoints between finishing them quickly - which is beneficial because checkpoints hold some resources pinned, and having the additional IO potentially slow other operations. The WiredTiger team has been putting a lot of resources into making checkpoints less invasive on both fronts, and I expect that work to continue in the future. 


Thanks for pointing out to eviction_dirty_target and eviction_dirty_trigger. I have realized that I've overlooked the second of them. So, I did try to configure mongodb and overwrite the values of these parameters in the following way:
storage:
journal:
enabled: true
commitIntervalMs: 50
dbPath: "{{{data_path}}}"
syncPeriodSecs: 60
directoryPerDB: false
wiredTiger:
engineConfig:
configString: "eviction_dirty_trigger=5,eviction_dirty_target=5,eviction_chemonckpoint_target=5,eviction=(threads_min=1,threads_max=4)"
cacheSizeGB: 1
journalCompressor: snappy
directoryForIndexes: true
collectionConfig:
blockCompressor: snappy
indexConfig:
prefixCompression: true
My goal was to set eviction_dirty_target= eviction_dirty_trigger to make eviction more or less continuous and hence reduce the burst effect that I'm describing above. (I understand your point about the benefits of deferring eviction in the hope that small writes may become coalesced into bigger ones, but my writes are pretty random and I'd rather have higher io and write amplicifcaton than big bursts of writes.) 
However, I did not observe much of a change: every 60 sec there was a big burst of writes similar in size to one without my custom settings.
Is there a mistake on my part somewhere here? 

Setting eviction_dirty_target and eviction_dirty_trigger to the same value generally results in very poor performance, since the eviction_dirty_trigger is actually a hard limit on how much dirty content is allowed in the cache (like I said before - it's poorly named for historical reasons), so what will happen is that the eviction process won't run at all until the cache reaches the hard limit - then user operations will be stalled until the cache usage is lowered, but eviction will only aim to lower it a very small amount because the target is equivalent to the trigger. Stepping back to describe how the pair of options works: the current algorithm will start evicting content once cache usage is above eviction_dirty_target, and become increasingly aggressive about keeping dirty content low as usage approaches eviction_dirty_trigger. The settings I recommend you using are:
eviction_dirty_target=1,eviction_dirty_trigger=20,eviction_checkpoint_target=1, though given that you are only configuring a 1GB cache - that will make the WiredTiger cache essentially ineffective for keeping updates, since it will aim to keep only 10MB of dirty content in cache. I also recommend keeping the eviction=(threads_min=X,threads_max=X) settings the same at the moment - the algorithm WiredTiger uses to dynamically adjust the number of eviction worker threads is still evolving and can cause brief performance interruptions.

If those settings don't help I recommend opening a JIRA ticket, where we can do analysis of the particular workload you are running and provide better/more specific advice.

- Alex

Yuri Finkelstein

unread,
Oct 5, 2017, 3:13:24 AM10/5/17
to wiredtig...@googlegroups.com
Thank you Alex, really appreciate these details. I will try your suggestions right away. One last question on these params. I understand it's possible to set engineConfig string from within Mongo shell. Can these params be altered at run time without process restart? It seems that WT low level API does allow modifying them after DB is open and sessions created. 

--

alexande...@10gen.com

unread,
Oct 5, 2017, 6:32:44 AM10/5/17
to wiredtiger-users
It is possible to reconfigure cache settings at runtime using:
db.adminCommand( { "setParameter": 1, "wiredTigerEngineRuntimeConfig": "eviction=(threads_min=4,threads_max=4)"})

You can replace the final configuration string with the settings you want to update. Having said that it can be very disruptive to throughput adjusting eviction settings at runtime - so I generally recommend not doing it.

Yuri Finkelstein

unread,
Oct 5, 2017, 12:53:37 PM10/5/17
to wiredtig...@googlegroups.com
I tried to change the eviction params to
eviction_dirty_trigger=3,eviction_dirty_target=1,eviction_checkpoint_target=1,eviction=(threads_min=4,threads_max=4).
While there was some visible eviction activity at times other than checkpointing the burst of writes is still there. 

The problem with achieving the effect that I want - spread dirty page eviction activity evenly between checkpoint writing intervals - by manipulating just the eviction_* params is that they are integers in WT. The smallest unit of increment is 1% of the heap size. If the heap is 10GB, 1% of it is 100MB which is a large increment. During the test I'm using my insert rate (at ) is 10MB/sec. This basically means there is no way to force dirty page eviction more frequent than smaller bursts every 10 sec.

Why can't these settings have decimal precision in WT? This would allow me achieve higher precision. 

But more importantly, I feel the simplest and most natural approach to accomplish this would be to have a new modus operandi in WT: analyze how much dirty pages is being produced between subsequent checkpoints and try to spread dirty page eviction evenly over this period - ignoring eviction_* altogether in this mode. I believe this mode will be popular among users who value predictable engine performance when looking at 99 percentile measurements. This is an intentional sacrifice of the benefits of deferred eviction. But if you think about it, these benefits are only relevant for achieving higher throughput. If the user can achieve target throughput with continuous eviction - and with a hefty margin which is the case with my workload - it seems a fairly reasonable line of thinking. 

 

MARK CALLAGHAN

unread,
Oct 5, 2017, 1:21:13 PM10/5/17
to wiredtig...@googlegroups.com
There has been a lot of work to fix the impact of buffered IO writeback in Linux. This was motivated by the suffering inflicted on buffered IO databases (mostly Postgres, but now WiredTiger, mmapv1 and RocksDB). I hope this eventually helps to reduce read stalls after write bursts.

https://lwn.net/Articles/682582/
--
Mark Callaghan
mdca...@gmail.com

alexande...@10gen.com

unread,
Oct 6, 2017, 12:11:24 AM10/6/17
to wiredtiger-users
Hi Yuri,


On Friday, October 6, 2017 at 3:53:37 AM UTC+11, Yuri Finkelstein wrote:
I tried to change the eviction params to
eviction_dirty_trigger=3,eviction_dirty_target=1,eviction_checkpoint_target=1,eviction=(threads_min=4,threads_max=4).
While there was some visible eviction activity at times other than checkpointing the burst of writes is still there. 

The problem with achieving the effect that I want - spread dirty page eviction activity evenly between checkpoint writing intervals - by manipulating just the eviction_* params is that they are integers in WT. The smallest unit of increment is 1% of the heap size. If the heap is 10GB, 1% of it is 100MB which is a large increment. During the test I'm using my insert rate (at ) is 10MB/sec. This basically means there is no way to force dirty page eviction more frequent than smaller bursts every 10 sec.

Why can't these settings have decimal precision in WT? This would allow me achieve higher precision. 


That's a very reasonable feature request, I've opened a JIRA ticket to track it:

But more importantly, I feel the simplest and most natural approach to accomplish this would be to have a new modus operandi in WT: analyze how much dirty pages is being produced between subsequent checkpoints and try to spread dirty page eviction evenly over this period - ignoring eviction_* altogether in this mode. I believe this mode will be popular among users who value predictable engine performance when looking at 99 percentile measurements. This is an intentional sacrifice of the benefits of deferred eviction. But if you think about it, these benefits are only relevant for achieving higher throughput. If the user can achieve target throughput with continuous eviction - and with a hefty margin which is the case with my workload - it seems a fairly reasonable line of thinking. 

I'm not entirely following - it seems as though the behavior you want is what WiredTiger currently has when not creating checkpoints - which is why I thought creating less checkpoints would be beneficial for you - since it would lead to many fewer interruptions. I know that doesn't entirely resolve the issue, but creating a checkpoint once per hour instead of once per 20 seconds will result in 180x less checkpoints being created - and presumably 180x less long latency operations for your application. 

The way a checkpoint is created in WiredTiger is to create a traversable tree in the on-disk table that represent a point in time view of the data. In order to achieve that it's necessary to write out a specific version of the data, and the corresponding internal nodes. The only way we can achieve that is by traversing the content in cache at a point in time, and writing out the correct version of the data. Dribbling that data out via eviction won't result in there being a consistent view of the data in the tables - which is what's necessary for a checkpoint.

Having said that, I can definitely see that having checkpoints complete more slowly would benefit your situation and have created: https://jira.mongodb.org/browse/WT-3633 to consider possible solutions.

Please add any context and workloads you have to those tickets - that will help ensure the final work addresses your needs.

- Alex

MARK CALLAGHAN

unread,
Jul 5, 2018, 7:32:30 PM7/5/18
to wiredtig...@googlegroups.com
I am revisiting the insert benchmark for WiredTiger so this is relevant to me again. 
Not trying to sell InnoDB migrations here but it has innodb_adaptive_flushing, and supporting code, to do what Yuri has suggested. Alas, InnoDB was built to do fuzzy checkpoints and it based on Alex's comment above I don't know whether that would work here.

https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.html#sysvar_innodb_adaptive_flushing

But for me the first problem I see for an in-memory workload is mutex contention from eviction code.


--
Mark Callaghan
mdca...@gmail.com

Alex

unread,
Jul 6, 2018, 12:26:05 AM7/6/18
to wiredtiger-users
Hi Mark/Yuri,

We have been continually working on reducing how disruptive checkpoint creation is in WiredTiger. In response to the request from Yuri above - we have implemented an API change in WT-3632 that allows for eviction values to be set as byte counts, as well as percentages. That gives the flexibility of configuring small values within very large caches.
  
I am revisiting the insert benchmark for WiredTiger so this is relevant to me again. 

Thanks! We always appreciate your efforts and feedback.

The original workload here had a low rate of updates - and wanted to avoid checkpoints interfering with query performance. We spent time attempting to craft a standalone WiredTiger workload that reproduced the symptoms in WT-3633 without success, so closed the ticket.

We have recently made a change that I believe will help address the issue - which was described in WT-4111. That change is only available in the latest development release of WiredTiger - but I expect it to become available in a MongoDB 4.0 release in the future.
 
On Thu, Oct 5, 2017 at 9:11 PM, alexander.gorrod via wiredtiger-users <wiredtig...@googlegroups.com> wrote:

Not trying to sell InnoDB migrations here but it has innodb_adaptive_flushing, and supporting code, to do what Yuri has suggested. Alas, InnoDB was built to do fuzzy checkpoints and it based on Alex's comment above I don't know whether that would work here.

WiredTiger does require strict checkpoints, our log file isn't idempotent, so it's necessary to have a consistent starting point when running recovery. We have also been more tightly coupling the MongoDB replication durability mechanism with WiredTiger durability in recent releases - which relies on having strict checkpoints too.

Having said that - I did read a bit about the adaptive flushing algorithm which is based on getting more aggressive about flushing as usage approaches an upper bound and I believe WiredTiger eviction is implemented similarly. WiredTiger exposes a pair of thresholds for both dirty and clean cache content. One is a low water mark eviction_{dirty_}target, if cache usage is below those thresholds WiredTiger does not attempt to reclaim or flush cache. The other upper bounds are configured via a poorly named trigger, once cache usage gets to eviction_trigger application threads are required to contribute to cache maintenance prior to completing their operation, while eviction_dirty_trigger is an upper bound on dirty cache usage. We have been tuning heuristics to more aggressively flush/free content from cache as usage approaches the upper bounds - and that's an ongoing focus.

The goal with the WiredTiger cache management algorithm is that we have server threads maintaining the cache, and application threads only ever need to pay the cost of reading the pages needed for their operations (writes should not need to be flushed from the buffer cache). If an application has a highly concurrent I/O bound workload those server threads can't keep up, and the mechanism we use to throttle the workload is to use the application threads to contribute to cache management directly.

InnoDB uses another factor to control how aggressively it flushes, which is log write volume - that seems to be due to a choice to use circular log buffers. WiredTiger doesn't use circular log buffers that - so doesn't require the same mechanism.

But for me the first problem I see for an in-memory workload is mutex contention from eviction code.

We understand this point of contention - finding and queuing candidate pages to evict can become a bottleneck in high throughput workloads. We've been working on reducing that constraint - and intend to keep doing so. If you've got data/workloads that show such bottlenecks we would appreciate if you can share.

Yuri Finkelstein

unread,
Jul 6, 2018, 1:50:34 AM7/6/18
to wiredtig...@googlegroups.com
Hi Alex, 
thanks for implementing the features above. They are still needed. 
But what would be even better is if these settings were not exposed at all because they would be auto managed with adaptive eviction strategy - something along the lines of what Mark is describing in InnoDB. What WT has is not quite that. Once the settings are chosen - default ones or custom - WT uses background and then active request threads to achieve eviction rate goal as you describe above. This is clear. 

But unfortunately there is no one set of optimal settings for eviction trigger and target that works for all workloads. For a write intensive workload where write latency is more important than read latency eviction trigger needs to be chosen to be a high percentage of the cache. For a read intensive workload were read tail latency is more important that write latency - it's the other way around: the trigger needs to be small (we actually run with 1%: while it reduces throughput the read latencies are most consistent in this case).
 
It's not a very exciting proposition to customize these settings per use case manually. I wonder if WT team knows about this and has plan along the lines of adaptive eviction ideas.

Thanks!
Yuri


--
You received this message because you are subscribed to the Google Groups "wiredtiger-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wiredtiger-use...@googlegroups.com.

Yuri Finkelstein

unread,
Jul 6, 2018, 2:53:47 AM7/6/18
to wiredtig...@googlegroups.com
Alex, 
I also updated https://jira.mongodb.org/browse/WT-3633 with description of how to reproduce it - sorry I did not see the question earlier.
Thanks again.

Alexander Gorrod

unread,
Jul 9, 2018, 2:34:36 AM7/9/18
to wiredtig...@googlegroups.com
Thanks - I've reopened the ticket.

To unsubscribe from this group and stop receiving emails from it, send an email to wiredtiger-users+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "wiredtiger-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/wiredtiger-users/zNVzB6ZrYt4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to wiredtiger-users+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages