MySQL/OSv performance and scalability

Huawei DBERC

unread,

Feb 29, 2016, 10:47:02 AM2/29/16

to osv...@googlegroups.com, anthony.i...@huawei.com

Hi all,

I am working on scaling MySQL on top of OSv on large VMs (32 vcpus), and I wanted to share with you several observations and performance/scalability issues that I have encountered so far.

In general, the performance of MySQL on top of OSv out of the box is about 7x worse for a read-only workloads and about 5x worse for a read-write workloads, compared to running the same on top of Linux as a guest (exact details on the setup below). In most cases it appears that there is very little concurrency permitted by OSv and as a result most of the MySQL server threads are blocked and the OSv vcpus are mosty idle.

More specifically, one of the first issues is the mempool. As MySQL does memory allocations (via std malloc), which frequently exceed the size of a page, these end up on the malloc_large() codepath which takes a lfmutex (WITH_LOCK), and effectively permits no concurrency. This hampers performance significantly (even on the free() path). It looks like that MySQL would greatly benefit from having underneath a memory allocator which also pools memory for large allocations, and provides lock-free fast-paths to that end.

Now, I have worked-around the issue temporarily (by pooling memory on the MySQL side), in order to see if this removes the bottleneck. Unfortunately, there are many other places in the code that prevent MySQL from scaling.

The next issue is related with the pthread-based rwlocks. OSv implements those via the std::lock_guard<mutex>, which in turn leverages the OSv lfmutex. This means that even read-acquisitions (pthread_rwlock_rdlock()) need to block (as they have to acquire the lfmutex), and thus again the MySQL server threads cannot proceed concurrently.

Similar issues manifest after working around instances of the above problem. For example, concurrent calls to the vfs access() also end up taking a lock and putting the server threads to sleep (via namei/dentry_lookup mutex lock).

In all cases that I have encountered, it was beneficial for the performance to either avoid the code paths that involve acquiring the lfmutex where possible, or simply to switch to a spinlock-based locking scheme. As this is clearly in contrast to the design decisions behind not using spinlocks in OSv (due to the lock holder preemption issue), your input and suggestions would be most welcome here.

The setup is as follows:

- OSv tag 0.24 from git

- qemu/kvm 2.0.0+dfsg-2ubuntu1.22 amd64

- OSv image includes mysql56 (from apps), cli

- host is a 2P Xeon machine (48 cores in total)

kvm/qemu invoked with following params:

qemu-system-x86_64 \

-device virtio-blk-pci,id=blk0,bootindex=0,drive=hd0,scsi=off \

-drive file=/home/devel/src/osv/build/last/usr.img,if=none,id=hd0,cache=none,aio=native \

-cpu host,+x2apic \

-m 200G \

-vnc :0 \

-enable-kvm \

-smp 32 \

-netdev type=tap,id=guest0,vhost=on \

-device virtio-net-pci,netdev=guest0,romfile= \

-gdb tcp::1234,server,nowait \

-redir tcp:2222::22 \

-device virtio-rng-pci \

-chardev stdio,mux=on,id=stdio,signal=off \

-mon chardev=stdio,mode=readline,default \

-device isa-serial,chardev=stdio

I am using sysbench (v0.5) benchmark suite, with the following params:

- bench type: oltp

- table count: 10

- table size: 25000

- db driver: mysql

- db engine: innodb

- nr threads: 32

- duration: 60s,

and two different oltp test flavors (read-only, read-write).

The client (sysbench) is run on a dedicated machine, which is connected point-to-point over a 1GbE interface to the machine that hosts kvm/OSv (which uses a bridge).

I compare throughput (in terms of requests per second as reported by sysbench), and contrast it to an identical MySQL setup (same binary and configuration), running within a Linux guest (3.16.0-60-generic) on top of the kvm hypervisor. Everything is compiled on the host with gcc (Ubuntu 4.8.4-2ubuntu1~14.04.1).

Some sysbench sample results:

oltp read-only OSv (default): 10961.94 requests/sec

oltp read-only OSv (+mysql workarounds): 43308.88 requests/sec

oltp read-only Linux: 75094.85 requests/sec

oltp read-write OSv (default): 6056.86 requests/sec

oltp read-write OSv (+mysql workarounds): 6547.62 requests/sec

oltp read-write Linux: 26684.47 requests/sec

(where +workarounds denote modifications on the mysql side in order to avoid code paths that induce locking on OSv).

Regards,

Antonis

Avi Kivity

unread,

Feb 29, 2016, 11:22:54 AM2/29/16

to Huawei DBERC, osv...@googlegroups.com, anthony.i...@huawei.com

On 02/29/2016 05:47 PM, Huawei DBERC wrote:

Hi all,

I am working on scaling MySQL on top of OSv on large VMs (32 vcpus), and I wanted to share with you several observations and performance/scalability issues that I have encountered so far.

In general, the performance of MySQL on top of OSv out of the box is about 7x worse for a read-only workloads and about 5x worse for a read-write workloads, compared to running the same on top of Linux as a guest (exact details on the setup below). In most cases it appears that there is very little concurrency permitted by OSv and as a result most of the MySQL server threads are blocked and the OSv vcpus are mosty idle.

More specifically, one of the first issues is the mempool. As MySQL does memory allocations (via std malloc), which frequently exceed the size of a page, these end up on the malloc_large() codepath which takes a lfmutex (WITH_LOCK), and effectively permits no concurrency. This hampers performance significantly (even on the free() path). It looks like that MySQL would greatly benefit from having underneath a memory allocator which also pools memory for large allocations, and provides lock-free fast-paths to that end.

Now, I have worked-around the issue temporarily (by pooling memory on the MySQL side), in order to see if this removes the bottleneck. Unfortunately, there are many other places in the code that prevent MySQL from scaling.

The next issue is related with the pthread-based rwlocks. OSv implements those via the std::lock_guard<mutex>, which in turn leverages the OSv lfmutex. This means that even read-acquisitions (pthread_rwlock_rdlock()) need to block (as they have to acquire the lfmutex), and thus again the MySQL server threads cannot proceed concurrently.

Similar issues manifest after working around instances of the above problem. For example, concurrent calls to the vfs access() also end up taking a lock and putting the server threads to sleep (via namei/dentry_lookup mutex lock).

In all cases that I have encountered, it was beneficial for the performance to either avoid the code paths that involve acquiring the lfmutex where possible, or simply to switch to a spinlock-based locking scheme. As this is clearly in contrast to the design decisions behind not using spinlocks in OSv (due to the lock holder preemption issue), your input and suggestions would be most welcome here.

Probably the best approach is a specialized implementation for a rwlock that uses cmpxchg when there are no lock owners, or only readers, and falls back to the mutex if it detects a writer (or when acquiring the rwlock for write).

Very thorough report, thanks. Did those tests also stress the disk? I'm worried about zfs being a little heavyweight here.

Huawei DBERC

unread,

Feb 29, 2016, 12:05:13 PM2/29/16

to Avi Kivity, osv...@googlegroups.com, anthony.i...@huawei.com

Hi Avi,

Many thanks for the prompt response. Please see my comments inline.

On Mon, Feb 29, 2016 at 5:22 PM, Avi Kivity <a...@scylladb.com> wrote:

On 02/29/2016 05:47 PM, Huawei DBERC wrote:

Hi all,

I am working on scaling MySQL on top of OSv on large VMs (32 vcpus), and I wanted to share with you several observations and performance/scalability issues that I have encountered so far.

In general, the performance of MySQL on top of OSv out of the box is about 7x worse for a read-only workloads and about 5x worse for a read-write workloads, compared to running the same on top of Linux as a guest (exact details on the setup below). In most cases it appears that there is very little concurrency permitted by OSv and as a result most of the MySQL server threads are blocked and the OSv vcpus are mosty idle.

More specifically, one of the first issues is the mempool. As MySQL does memory allocations (via std malloc), which frequently exceed the size of a page, these end up on the malloc_large() codepath which takes a lfmutex (WITH_LOCK), and effectively permits no concurrency. This hampers performance significantly (even on the free() path). It looks like that MySQL would greatly benefit from having underneath a memory allocator which also pools memory for large allocations, and provides lock-free fast-paths to that end.

Now, I have worked-around the issue temporarily (by pooling memory on the MySQL side), in order to see if this removes the bottleneck. Unfortunately, there are many other places in the code that prevent MySQL from scaling.

The next issue is related with the pthread-based rwlocks. OSv implements those via the std::lock_guard<mutex>, which in turn leverages the OSv lfmutex. This means that even read-acquisitions (pthread_rwlock_rdlock()) need to block (as they have to acquire the lfmutex), and thus again the MySQL server threads cannot proceed concurrently.

Similar issues manifest after working around instances of the above problem. For example, concurrent calls to the vfs access() also end up taking a lock and putting the server threads to sleep (via namei/dentry_lookup mutex lock).

In all cases that I have encountered, it was beneficial for the performance to either avoid the code paths that involve acquiring the lfmutex where possible, or simply to switch to a spinlock-based locking scheme. As this is clearly in contrast to the design decisions behind not using spinlocks in OSv (due to the lock holder preemption issue), your input and suggestions would be most welcome here.

Probably the best approach is a specialized implementation for a rwlock that uses cmpxchg when there are no lock owners, or only readers, and falls back to the mutex if it detects a writer (or when acquiring the rwlock for write).

Yes that would make good sense for the case of the rwlock (in fact this is how I would expect any typical rwlock implementation to function internally).

I would be highly interested though to hear your comments and suggestions regarding the lfmutex in general; it appears to be really impacting scalability on many other cases (as per my notes above). Would it be a possibility to have an OSv build option where the lfmutex becomes a spinlock? From a performance perspective this appears to make some sense in such cases of large VMs and highly concurrent workloads. Any other alternatives?

This is also one of our worries as well. In fact I started experiments by putting the mysql data in a ramfs partition, but so far I have not seen much if any difference, as potentially the mutex locking issues are overshadowing any fs/block IO overheads.

Best regards,

Antonis

Benoît Canet

unread,

Feb 29, 2016, 1:39:52 PM2/29/16

to Huawei DBERC, Avi Kivity, Osv Dev, anthony.i...@huawei.com

I honestly think OSv would fare better at networking code than at filesystem code.

Best regards,
Antonis

--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Benoît Canet

unread,

Feb 29, 2016, 2:26:25 PM2/29/16

to Huawei DBERC, Avi Kivity, Osv Dev, anthony.i...@huawei.com

This kind of apps works better on OSv.

http://osv.io/benchmarks/

Nadav Har'El

unread,

Mar 1, 2016, 1:49:41 AM3/1/16

to Huawei DBERC, Osv Dev, anthony.i...@huawei.com

On Mon, Feb 29, 2016 at 5:47 PM, Huawei DBERC <huawei...@gmail.com> wrote:

Hi all,

I am working on scaling MySQL on top of OSv on large VMs (32 vcpus), and I wanted to share with you several observations and performance/scalability issues that I have encountered so far.

Hi Antonis, you've made some very interesting benchmarks and analysis of the bottlenecks, and thank you for that.

In the past, we've made a significant effort to optimize certain parts of OSv which had been bottlenecks in certain applications on which we focused. But it's definitely possible, even likely, that other code paths which were lightly used in those applications got "neglected", not enough attention was paid to their performance - and they became a bottleneck in other applications, and need to be optimized as well. I think you found a couple of good examples:

More specifically, one of the first issues is the mempool. As MySQL does memory allocations (via std malloc), which frequently exceed the size of a page, these end up on the malloc_large() codepath which takes a lfmutex (WITH_LOCK), and effectively permits no concurrency. This hampers performance significantly (even on the free() path). It looks like that MySQL would greatly benefit from having underneath a memory allocator which also pools memory for large allocations, and provides lock-free fast-paths to that end.

Indeed, we've paid a lot of attention to the performance of small allocations, which in all the applications we targeted were the primary form of allocations. In particular, for all Java applications, one never sees large malloc() calls because the JVM heap is responsible for most allocations, not malloc(). So obviously we missed the performance of these large allocations, and yes, we need to better optimize them. Small allocations, on the other hand, are needed everywhere (including inside OSv itself) so were very important to optimize, so we already did.

As you noted, while doing this sort of optimization, it is useful to have a benchmark. You noticed below that even improving the large allocation performance does not improve MySQL performance because of other problems, so it might be worthwhile to address those other problems first, and remember the large-allocation-performance issue for later (by creating a bug-tracker issue for it).

The next issue is related with the pthread-based rwlocks. OSv implements those via the std::lock_guard<mutex>, which in turn leverages the OSv lfmutex. This means that even read-acquisitions (pthread_rwlock_rdlock()) need to block (as they have to acquire the lfmutex), and thus again the MySQL server threads cannot proceed concurrently.

This is a very true point, which we have noticed in the past and handled in various ways - mostly by avoiding rwlocks - but indeed should find better and more permanent fixes for applications that do use rwlock.

As you noted, the implementation of rwlock is a naive, almost "textbook", implementation of a reader-writer lock using a mutex protecting reader and writer counters. Such an implementation has two serious problems:

1. Slowness: An *uncontended* rwlock lock/unlock is twice slower than a mutex's lock/unlock. This is because locking an rwlock requires both locking and unlocking the internal mutex (and rwlock::unlock() does both again). This means that in lightly contended use case, converting a mutex to a rwlock (in an attempt to allow more concurrency) actually slows the code down.

2. Spurious contention: Concurrent read lock attempts, which should not have caused any contention, do cause spurious contention on the internal mutex: As you noticed, we need to take the mutex to verify we can take the read lock - and if several CPUs do this at the same time, they block.

You are right in pointing out that a *sleeping* mutex makes problem #2 much worse: The issue is fact that our mutex is unconditionally a *sleeping* one, i.e., whenever a mutex lock() attempt finds the mutex to be already locked, it immediately puts the calling thread to sleep, and initiates a context switch. This is usually the right thing to do, but is not the right thing to do when we know that mutex would be locked for an extremely short duration (as is the case in rwlock), so it would have made sense to spin a bit retrying the lock, instead of going to sleep immediately. We actually had exactly the same observation in the past, and produced some patches to have the mutex do both spinning (for a short while) and sleeping (if spinning didn't help). I will resend those patches if you would be interested to try them and discuss them.

But while adding the option of spinning to the sleeping mutex makes problem #2 less bad, and probably makes your workload much better, it does *NOT* make it go away, and it does *NOT* make the rwlock scalable when the number of cores doing read-locks concurrently is very high. This is because all the cores still fight over the single spin-lock, and this fighting is slower as the number of cores increases. This fix also doesn't make problem #1 go away (that rwlock is slower than a mutex when uncontended).

By the way, it would be worthwhile to see how Linux's pthread rwlock is optimized, since that is probably what mysql was written to be optimal for. If I understand correctly, it uses a combination of a lock-free algorithm (an atomic compare-exchange operation) and then a futex to do the actual sleeping when needed, but I never looked at the details and it will probably be useful to.

In the long run (though I'm not sure it's necessary for use case) it would be even better to have even more efficient rwlock implementations, especially for the common and important case of a "read-mostly" workload (reads are very common and very concurrent, writes are rarer):

One important replacement we already have for the rwlock in a read-mostly scenario is RCU, and we changed int he past several places in OSv to use RCU instead of rwlock, to solve *both* problems #1 and #2 mentioned above. We did this, for example, in the network stack which heavily used rwlocks for read-mostly use cases (e.g., the routing table) and changing these rwlock to RCU did wonders to improve performance.

But I've seen in the research literature several implementations of the traditional rwlock API (not the less familiar RCU) which are more optimized for read-mostly workloads and faster and more scalable for it. A really optimal implementation would be (unlike the naive implementation, even with a spinning mutex) scalable in the number of cores doing read-locks in parallel, e.g., by keeping a seperate reader count per core (and clever tricks to find a global count), so that readers from different cores can get their read access without disturbing each other at all.

This would be a very interesting direction to take OSv in, and I'll be happy to continue discussing it with you. But as I said above, it might be that in your case simply adding a short-spinning option to the mutex used in rwlock might be "good enough" for your use case.

Similar issues manifest after working around instances of the above problem. For example, concurrent calls to the vfs access() also end up taking a lock and putting the server threads to sleep (via namei/dentry_lookup mutex lock).

Yes, our VFS code is indeed too lock-happy, and take locks too often and too long (see for example https://github.com/cloudius-systems/osv/issues/451 and https://github.com/cloudius-systems/osv/issues/450). In some cases, however, I wonder if it's worthwhile to focus on OSv performance here or the application is the culprit. In your case, why would mysql call access() so often? But if their use case does make sense, then definitely this part of our VFS could be optimized.

In all cases that I have encountered, it was beneficial for the performance to either avoid the code paths that involve acquiring the lfmutex where possible, or simply to switch to a spinlock-based locking scheme. As this is clearly in contrast to the design decisions behind not using spinlocks in OSv (due to the lock holder preemption issue), your input and suggestions would be most welcome here.

As I explained above, your problem is not with the "lock-free" part of the OSv mutex, but with the fact then when this mutex sees contention it goes to sleep and doesn't spin (retry to take the lock), not even a bit. Since going to sleep has a non-negligable time overhead (of a context switch), let's say N nanoseconds, it doesn't make sense to do it in cases where the lock is expected to held for much less than N nanoseconds. "rwlock" is exactly such a case, because we know the lock is held for an extremely short duration (only while the lock holder checks the reader count). So if before going to sleep, the mutex retried to take the lock (spun) a few times, we would get the benefits of the spin lock without its problems (since after a few retries, the mutex reverts to sleeping).

As I said, we already had patches to try this, which were never committed but I can dig them up for you to try, if you're interested.

Nadav.

Nadav Har'El

unread,

Mar 1, 2016, 1:59:23 AM3/1/16

to Huawei DBERC, Avi Kivity, Osv Dev, anthony.i...@huawei.com

On Mon, Feb 29, 2016 at 7:05 PM, Huawei DBERC <huawei...@gmail.com> wrote:

Probably the best approach is a specialized implementation for a rwlock that uses cmpxchg when there are no lock owners, or only readers, and falls back to the mutex if it detects a writer (or when acquiring the rwlock for write).

Yes that would make good sense for the case of the rwlock (in fact this is how I would expect any typical rwlock implementation to function internally).

Indeed, even without fixing any of the other problems, it might make sense to replace our rwlock implementation by one that does exactly this. I believe glibc's pthread rwlock implementation, for example, does that (only falls back to a mutex (or futex) in rarer cases).

I would be highly interested though to hear your comments and suggestions regarding the lfmutex in general; it appears to be really impacting scalability on many other cases (as per my notes above). Would it be a possibility to have an OSv build option where the lfmutex becomes a spinlock? From a performance perspective this appears to make some sense in such cases of large VMs and highly concurrent workloads. Any other alternatives?

I commented about this in my other mail. I think the lfmutex should not unconditionally become a spinlock if some build option is enabled, but rather, the mutex should spin for a few iterations before reverting to its usual sleep mode. Maybe this is exactly what you meant when you said "spinlock" because obviously turning "mutex" to always be an unlimited spinlock will be disasterous for other workloads, because threads will start wasting CPU on long spinning periods instead of going to sleep and let other threads run.

The short spinning suggested above could be a compile-time option, so that code which knows it will only hold a mutex for very short duration (e.g., rwlock's implementation) will use a mutex_with_some_spinning while code which expects the lock to be held for long duration will use the "regular" sleeping-only mutex.

Nadav.

Huawei DBERC

unread,

Mar 1, 2016, 11:46:14 AM3/1/16

to Nadav Har'El, Osv Dev, anthony.i...@huawei.com

Dear Nadav,

Thank you for your extensive comments and feedback.

On Tue, Mar 1, 2016 at 7:49 AM, Nadav Har'El <n...@scylladb.com> wrote:

On Mon, Feb 29, 2016 at 5:47 PM, Huawei DBERC <huawei...@gmail.com> wrote:
Hi all,

I am working on scaling MySQL on top of OSv on large VMs (32 vcpus), and I wanted to share with you several observations and performance/scalability issues that I have encountered so far.

Hi Antonis, you've made some very interesting benchmarks and analysis of the bottlenecks, and thank you for that.

In the past, we've made a significant effort to optimize certain parts of OSv which had been bottlenecks in certain applications on which we focused. But it's definitely possible, even likely, that other code paths which were lightly used in those applications got "neglected", not enough attention was paid to their performance - and they became a bottleneck in other applications, and need to be optimized as well. I think you found a couple of good examples:

Indeed, and hopefully we could address those issues in order to strengthen OSv in those kinds of workloads. We are hoping that once we are past the concurrency and scalability issues, the benefits of having the database sitting within a single address space OS will start showing, along with many opportunities to improve its performance even further compared to a traditional guest OS.

More specifically, one of the first issues is the mempool. As MySQL does memory allocations (via std malloc), which frequently exceed the size of a page, these end up on the malloc_large() codepath which takes a lfmutex (WITH_LOCK), and effectively permits no concurrency. This hampers performance significantly (even on the free() path). It looks like that MySQL would greatly benefit from having underneath a memory allocator which also pools memory for large allocations, and provides lock-free fast-paths to that end.

Indeed, we've paid a lot of attention to the performance of small allocations, which in all the applications we targeted were the primary form of allocations. In particular, for all Java applications, one never sees large malloc() calls because the JVM heap is responsible for most allocations, not malloc(). So obviously we missed the performance of these large allocations, and yes, we need to better optimize them. Small allocations, on the other hand, are needed everywhere (including inside OSv itself) so were very important to optimize, so we already did.

As you noted, while doing this sort of optimization, it is useful to have a benchmark. You noticed below that even improving the large allocation performance does not improve MySQL performance because of other problems, so it might be worthwhile to address those other problems first, and remember the large-allocation-performance issue for later (by creating a bug-tracker issue for it).

Actually the mempool is the first and foremost bottleneck that we have experienced when benchmarking MySQL in the aforementioned setup, since there are many such allocations per query. The workaround was to add a simplistic lock-free memory cache on the MySQL side, but ideally it would be nice to see this properly attacked on the mempool level, and collapse the redundant software layers as much as possible.

The next issue is related with the pthread-based rwlocks. OSv implements those via the std::lock_guard<mutex>, which in turn leverages the OSv lfmutex. This means that even read-acquisitions (pthread_rwlock_rdlock()) need to block (as they have to acquire the lfmutex), and thus again the MySQL server threads cannot proceed concurrently.

This is a very true point, which we have noticed in the past and handled in various ways - mostly by avoiding rwlocks - but indeed should find better and more permanent fixes for applications that do use rwlock.

As you noted, the implementation of rwlock is a naive, almost "textbook", implementation of a reader-writer lock using a mutex protecting reader and writer counters. Such an implementation has two serious problems:

1. Slowness: An *uncontended* rwlock lock/unlock is twice slower than a mutex's lock/unlock. This is because locking an rwlock requires both locking and unlocking the internal mutex (and rwlock::unlock() does both again). This means that in lightly contended use case, converting a mutex to a rwlock (in an attempt to allow more concurrency) actually slows the code down.

2. Spurious contention: Concurrent read lock attempts, which should not have caused any contention, do cause spurious contention on the internal mutex: As you noticed, we need to take the mutex to verify we can take the read lock - and if several CPUs do this at the same time, they block.

You are right in pointing out that a *sleeping* mutex makes problem #2 much worse: The issue is fact that our mutex is unconditionally a *sleeping* one, i.e., whenever a mutex lock() attempt finds the mutex to be already locked, it immediately puts the calling thread to sleep, and initiates a context switch. This is usually the right thing to do, but is not the right thing to do when we know that mutex would be locked for an extremely short duration (as is the case in rwlock), so it would have made sense to spin a bit retrying the lock, instead of going to sleep immediately. We actually had exactly the same observation in the past, and produced some patches to have the mutex do both spinning (for a short while) and sleeping (if spinning didn't help). I will resend those patches if you would be interested to try them and discuss them.

That would be great, I would certainly like to have a look and try out the patches, and see how we could proceed from there.

But while adding the option of spinning to the sleeping mutex makes problem #2 less bad, and probably makes your workload much better, it does *NOT* make it go away, and it does *NOT* make the rwlock scalable when the number of cores doing read-locks concurrently is very high. This is because all the cores still fight over the single spin-lock, and this fighting is slower as the number of cores increases. This fix also doesn't make problem #1 go away (that rwlock is slower than a mutex when uncontended).

Yes I fully agree with that, but I did not want to prematurely raise the actual issue of scalability of the rwlock primitive on many-cores, which is indeed a distinct issue. There are of course a couple of nice approaches to dealing with that problem, which is especially exacerbated on multi-socket machines and NUMA architectures where contention over the cache-line is really detrimental since it is being moved over the chip-to-chip interconnect, but let's deal with the issues one step at at time.

By the way, it would be worthwhile to see how Linux's pthread rwlock is optimized, since that is probably what mysql was written to be optimal for. If I understand correctly, it uses a combination of a lock-free algorithm (an atomic compare-exchange operation) and then a futex to do the actual sleeping when needed, but I never looked at the details and it will probably be useful to.

Indeed this is the case at least with the glibc nptl-based implementation, which attempts to be optimistic in order to avoid switching into the kernel unless contention is present. Please note that this is also not an ideal implementation from a scalability standpoint since the readers will go to sleep (via futex wait), without even spinning a bit on the user-space integer.

In the long run (though I'm not sure it's necessary for use case) it would be even better to have even more efficient rwlock implementations, especially for the common and important case of a "read-mostly" workload (reads are very common and very concurrent, writes are rarer):

One important replacement we already have for the rwlock in a read-mostly scenario is RCU, and we changed int he past several places in OSv to use RCU instead of rwlock, to solve *both* problems #1 and #2 mentioned above. We did this, for example, in the network stack which heavily used rwlocks for read-mostly use cases (e.g., the routing table) and changing these rwlock to RCU did wonders to improve performance.

But I've seen in the research literature several implementations of the traditional rwlock API (not the less familiar RCU) which are more optimized for read-mostly workloads and faster and more scalable for it. A really optimal implementation would be (unlike the naive implementation, even with a spinning mutex) scalable in the number of cores doing read-locks in parallel, e.g., by keeping a seperate reader count per core (and clever tricks to find a global count), so that readers from different cores can get their read access without disturbing each other at all.

This would be a very interesting direction to take OSv in, and I'll be happy to continue discussing it with you. But as I said above, it might be that in your case simply adding a short-spinning option to the mutex used in rwlock might be "good enough" for your use case.

Many of the designs that attack the scalability of the rwlocks would also by nature of their design further deal with problem #1 that you mention, especially if the application semantics allow for biasing the lock against writers (i.e. read-heavy workloads), via some sort of per-cpu tracking of readers. The very semantics of the rwlocks anyway allow for this flexibility: the read tracking can have semaphore-like semantics. So fully agree with you on the points you make. Indeed I think optimistically spinning a bit before blocking may be sufficient for now at least, before we have to deal with the bigger issues of scalability.

Similar issues manifest after working around instances of the above problem. For example, concurrent calls to the vfs access() also end up taking a lock and putting the server threads to sleep (via namei/dentry_lookup mutex lock).

Yes, our VFS code is indeed too lock-happy, and take locks too often and too long (see for example https://github.com/cloudius-systems/osv/issues/451 and https://github.com/cloudius-systems/osv/issues/450). In some cases, however, I wonder if it's worthwhile to focus on OSv performance here or the application is the culprit. In your case, why would mysql call access() so often? But if their use case does make sense, then definitely this part of our VFS could be optimized.

I see what you mean. I think that especially the VFS code would benefit from RCU-based synchronization (much like the Linux kernel does...) but certainly this would require some effort. Currently this occurs very frequently (upon every statement execution), where mysql checks for the presence of a trigger file for the table to be accessed by every query. It is certainly conceivable that this could (and probably should) be properly optimized by caching this info in memory. I suppose that mysql is doing ample usage of the access() call as it appears to be simple enough on the surface, and expected to be cheap (non-destructive operation, non-blocking), but of course in reality VFS is quite complicated in order to maintain consistency.

In all cases that I have encountered, it was beneficial for the performance to either avoid the code paths that involve acquiring the lfmutex where possible, or simply to switch to a spinlock-based locking scheme. As this is clearly in contrast to the design decisions behind not using spinlocks in OSv (due to the lock holder preemption issue), your input and suggestions would be most welcome here.

As I explained above, your problem is not with the "lock-free" part of the OSv mutex, but with the fact then when this mutex sees contention it goes to sleep and doesn't spin (retry to take the lock), not even a bit. Since going to sleep has a non-negligable time overhead (of a context switch), let's say N nanoseconds, it doesn't make sense to do it in cases where the lock is expected to held for much less than N nanoseconds. "rwlock" is exactly such a case, because we know the lock is held for an extremely short duration (only while the lock holder checks the reader count). So if before going to sleep, the mutex retried to take the lock (spun) a few times, we would get the benefits of the spin lock without its problems (since after a few retries, the mutex reverts to sleeping).

Fully agree. It is also worthwhile to mention that even some of the mutex implementations (e.g. within the Linux kernel itself but also in nptl), incorporate some degree of adaptiveness by optimistically spinning, with the assumption that the lock holder is already running on another cpu and thus it may be better to avoid putting the requesting thread to sleep. I think in the case of mysql, at least for all the bottlenecks I have seen, it would be worthwhile to experiment with having an optimistically spinning lfmutex before blocking, which would certainly help with the performance in most of the cases encountered so far.

As I said, we already had patches to try this, which were never committed but I can dig them up for you to try, if you're interested.

Absolutely, please go ahead, I would be very interest to give it a go.

Nadav.

Nadav Har'El

unread,

Mar 2, 2016, 4:39:46 AM3/2/16

to Huawei DBERC, Osv Dev, anthony.i...@huawei.com

On Tue, Mar 1, 2016 at 6:46 PM, Huawei DBERC <huawei...@gmail.com> wrote:

You are right in pointing out that a *sleeping* mutex makes problem #2 much worse: The issue is fact that our mutex is unconditionally a *sleeping* one, i.e., whenever a mutex lock() attempt finds the mutex to be already locked, it immediately puts the calling thread to sleep, and initiates a context switch. This is usually the right thing to do, but is not the right thing to do when we know that mutex would be locked for an extremely short duration (as is the case in rwlock), so it would have made sense to spin a bit retrying the lock, instead of going to sleep immediately. We actually had exactly the same observation in the past, and produced some patches to have the mutex do both spinning (for a short while) and sleeping (if spinning didn't help). I will resend those patches if you would be interested to try them and discuss them.

That would be great, I would certainly like to have a look and try out the patches, and see how we could proceed from there.

I just sent a "mutex with spinning" patch to the mailing list, which you are very welcome to try.
This patch was only lightly tested, so please take it with a grain of salt. It also applies to *every* use of the mutex, which might not be a very good idea. It is also possible to make another copy of the "mutex" class with this spinning, and only use it in specific places which could benefit from it (namely, places which hold the mutex for very short durations).

By the way, it would be worthwhile to see how Linux's pthread rwlock is optimized, since that is probably what mysql was written to be optimal for. If I understand correctly, it uses a combination of a lock-free algorithm (an atomic compare-exchange operation) and then a futex to do the actual sleeping when needed, but I never looked at the details and it will probably be useful to.

Indeed this is the case at least with the glibc nptl-based implementation, which attempts to be optimistic in order to avoid switching into the kernel unless contention is present. Please note that this is also not an ideal implementation from a scalability standpoint since the readers will go to sleep (via futex wait), without even spinning a bit on the user-space integer.

Yes, the glibc implementation is probably not ideal either, but if we use something similar to it, at least we won't be *worse* than it ;-)

Fully agree. It is also worthwhile to mention that even some of the mutex implementations (e.g. within the Linux kernel itself but also in nptl), incorporate some degree of adaptiveness by optimistically spinning, with the assumption that the lock holder is already running on another cpu and thus it may be better to avoid putting the requesting thread to sleep.

Indeed, the patch I just sent does something like that. My patch is not "adaptive" in the sense that it doesn't learn from previous experience how much time it is worthwhile to try spinning. I don't remember now what Linux does in this case (it's been a while since I last looked at it), but if I remember correctly, Java's locks *are* adaptive, and learn from previous experience how many iterations of spinning are worth doing.

Reply all

Reply to author

Forward