Hi all,
I am working on scaling MySQL on top of OSv on large VMs (32 vcpus), and I wanted to share with you several observations and performance/scalability issues that I have encountered so far.
In general, the performance of MySQL on top of OSv out of the box is about 7x worse for a read-only workloads and about 5x worse for a read-write workloads, compared to running the same on top of Linux as a guest (exact details on the setup below). In most cases it appears that there is very little concurrency permitted by OSv and as a result most of the MySQL server threads are blocked and the OSv vcpus are mosty idle.
More specifically, one of the first issues is the mempool. As MySQL does memory allocations (via std malloc), which frequently exceed the size of a page, these end up on the malloc_large() codepath which takes a lfmutex (WITH_LOCK), and effectively permits no concurrency. This hampers performance significantly (even on the free() path). It looks like that MySQL would greatly benefit from having underneath a memory allocator which also pools memory for large allocations, and provides lock-free fast-paths to that end.
Now, I have worked-around the issue temporarily (by pooling memory on the MySQL side), in order to see if this removes the bottleneck. Unfortunately, there are many other places in the code that prevent MySQL from scaling.
The next issue is related with the pthread-based rwlocks. OSv implements those via the std::lock_guard<mutex>, which in turn leverages the OSv lfmutex. This means that even read-acquisitions (pthread_rwlock_rdlock()) need to block (as they have to acquire the lfmutex), and thus again the MySQL server threads cannot proceed concurrently.
Similar issues manifest after working around instances of the above problem. For example, concurrent calls to the vfs access() also end up taking a lock and putting the server threads to sleep (via namei/dentry_lookup mutex lock).
In all cases that I have encountered, it was beneficial for the performance to either avoid the code paths that involve acquiring the lfmutex where possible, or simply to switch to a spinlock-based locking scheme. As this is clearly in contrast to the design decisions behind not using spinlocks in OSv (due to the lock holder preemption issue), your input and suggestions would be most welcome here.
On 02/29/2016 05:47 PM, Huawei DBERC wrote:
Hi all,
I am working on scaling MySQL on top of OSv on large VMs (32 vcpus), and I wanted to share with you several observations and performance/scalability issues that I have encountered so far.
In general, the performance of MySQL on top of OSv out of the box is about 7x worse for a read-only workloads and about 5x worse for a read-write workloads, compared to running the same on top of Linux as a guest (exact details on the setup below). In most cases it appears that there is very little concurrency permitted by OSv and as a result most of the MySQL server threads are blocked and the OSv vcpus are mosty idle.
More specifically, one of the first issues is the mempool. As MySQL does memory allocations (via std malloc), which frequently exceed the size of a page, these end up on the malloc_large() codepath which takes a lfmutex (WITH_LOCK), and effectively permits no concurrency. This hampers performance significantly (even on the free() path). It looks like that MySQL would greatly benefit from having underneath a memory allocator which also pools memory for large allocations, and provides lock-free fast-paths to that end.
Now, I have worked-around the issue temporarily (by pooling memory on the MySQL side), in order to see if this removes the bottleneck. Unfortunately, there are many other places in the code that prevent MySQL from scaling.
The next issue is related with the pthread-based rwlocks. OSv implements those via the std::lock_guard<mutex>, which in turn leverages the OSv lfmutex. This means that even read-acquisitions (pthread_rwlock_rdlock()) need to block (as they have to acquire the lfmutex), and thus again the MySQL server threads cannot proceed concurrently.
Similar issues manifest after working around instances of the above problem. For example, concurrent calls to the vfs access() also end up taking a lock and putting the server threads to sleep (via namei/dentry_lookup mutex lock).
In all cases that I have encountered, it was beneficial for the performance to either avoid the code paths that involve acquiring the lfmutex where possible, or simply to switch to a spinlock-based locking scheme. As this is clearly in contrast to the design decisions behind not using spinlocks in OSv (due to the lock holder preemption issue), your input and suggestions would be most welcome here.
Probably the best approach is a specialized implementation for a rwlock that uses cmpxchg when there are no lock owners, or only readers, and falls back to the mutex if it detects a writer (or when acquiring the rwlock for write).
Best regards,Antonis--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Hi all,I am working on scaling MySQL on top of OSv on large VMs (32 vcpus), and I wanted to share with you several observations and performance/scalability issues that I have encountered so far.
More specifically, one of the first issues is the mempool. As MySQL does memory allocations (via std malloc), which frequently exceed the size of a page, these end up on the malloc_large() codepath which takes a lfmutex (WITH_LOCK), and effectively permits no concurrency. This hampers performance significantly (even on the free() path). It looks like that MySQL would greatly benefit from having underneath a memory allocator which also pools memory for large allocations, and provides lock-free fast-paths to that end.
The next issue is related with the pthread-based rwlocks. OSv implements those via the std::lock_guard<mutex>, which in turn leverages the OSv lfmutex. This means that even read-acquisitions (pthread_rwlock_rdlock()) need to block (as they have to acquire the lfmutex), and thus again the MySQL server threads cannot proceed concurrently.
Similar issues manifest after working around instances of the above problem. For example, concurrent calls to the vfs access() also end up taking a lock and putting the server threads to sleep (via namei/dentry_lookup mutex lock).
In all cases that I have encountered, it was beneficial for the performance to either avoid the code paths that involve acquiring the lfmutex where possible, or simply to switch to a spinlock-based locking scheme. As this is clearly in contrast to the design decisions behind not using spinlocks in OSv (due to the lock holder preemption issue), your input and suggestions would be most welcome here.
Probably the best approach is a specialized implementation for a rwlock that uses cmpxchg when there are no lock owners, or only readers, and falls back to the mutex if it detects a writer (or when acquiring the rwlock for write).Yes that would make good sense for the case of the rwlock (in fact this is how I would expect any typical rwlock implementation to function internally).
I would be highly interested though to hear your comments and suggestions regarding the lfmutex in general; it appears to be really impacting scalability on many other cases (as per my notes above). Would it be a possibility to have an OSv build option where the lfmutex becomes a spinlock? From a performance perspective this appears to make some sense in such cases of large VMs and highly concurrent workloads. Any other alternatives?
On Mon, Feb 29, 2016 at 5:47 PM, Huawei DBERC <huawei...@gmail.com> wrote:Hi all,I am working on scaling MySQL on top of OSv on large VMs (32 vcpus), and I wanted to share with you several observations and performance/scalability issues that I have encountered so far.Hi Antonis, you've made some very interesting benchmarks and analysis of the bottlenecks, and thank you for that.In the past, we've made a significant effort to optimize certain parts of OSv which had been bottlenecks in certain applications on which we focused. But it's definitely possible, even likely, that other code paths which were lightly used in those applications got "neglected", not enough attention was paid to their performance - and they became a bottleneck in other applications, and need to be optimized as well. I think you found a couple of good examples:
More specifically, one of the first issues is the mempool. As MySQL does memory allocations (via std malloc), which frequently exceed the size of a page, these end up on the malloc_large() codepath which takes a lfmutex (WITH_LOCK), and effectively permits no concurrency. This hampers performance significantly (even on the free() path). It looks like that MySQL would greatly benefit from having underneath a memory allocator which also pools memory for large allocations, and provides lock-free fast-paths to that end.Indeed, we've paid a lot of attention to the performance of small allocations, which in all the applications we targeted were the primary form of allocations. In particular, for all Java applications, one never sees large malloc() calls because the JVM heap is responsible for most allocations, not malloc(). So obviously we missed the performance of these large allocations, and yes, we need to better optimize them. Small allocations, on the other hand, are needed everywhere (including inside OSv itself) so were very important to optimize, so we already did.As you noted, while doing this sort of optimization, it is useful to have a benchmark. You noticed below that even improving the large allocation performance does not improve MySQL performance because of other problems, so it might be worthwhile to address those other problems first, and remember the large-allocation-performance issue for later (by creating a bug-tracker issue for it).
The next issue is related with the pthread-based rwlocks. OSv implements those via the std::lock_guard<mutex>, which in turn leverages the OSv lfmutex. This means that even read-acquisitions (pthread_rwlock_rdlock()) need to block (as they have to acquire the lfmutex), and thus again the MySQL server threads cannot proceed concurrently.This is a very true point, which we have noticed in the past and handled in various ways - mostly by avoiding rwlocks - but indeed should find better and more permanent fixes for applications that do use rwlock.
As you noted, the implementation of rwlock is a naive, almost "textbook", implementation of a reader-writer lock using a mutex protecting reader and writer counters. Such an implementation has two serious problems:
1. Slowness: An *uncontended* rwlock lock/unlock is twice slower than a mutex's lock/unlock. This is because locking an rwlock requires both locking and unlocking the internal mutex (and rwlock::unlock() does both again). This means that in lightly contended use case, converting a mutex to a rwlock (in an attempt to allow more concurrency) actually slows the code down.2. Spurious contention: Concurrent read lock attempts, which should not have caused any contention, do cause spurious contention on the internal mutex: As you noticed, we need to take the mutex to verify we can take the read lock - and if several CPUs do this at the same time, they block.You are right in pointing out that a *sleeping* mutex makes problem #2 much worse: The issue is fact that our mutex is unconditionally a *sleeping* one, i.e., whenever a mutex lock() attempt finds the mutex to be already locked, it immediately puts the calling thread to sleep, and initiates a context switch. This is usually the right thing to do, but is not the right thing to do when we know that mutex would be locked for an extremely short duration (as is the case in rwlock), so it would have made sense to spin a bit retrying the lock, instead of going to sleep immediately. We actually had exactly the same observation in the past, and produced some patches to have the mutex do both spinning (for a short while) and sleeping (if spinning didn't help). I will resend those patches if you would be interested to try them and discuss them.
But while adding the option of spinning to the sleeping mutex makes problem #2 less bad, and probably makes your workload much better, it does *NOT* make it go away, and it does *NOT* make the rwlock scalable when the number of cores doing read-locks concurrently is very high. This is because all the cores still fight over the single spin-lock, and this fighting is slower as the number of cores increases. This fix also doesn't make problem #1 go away (that rwlock is slower than a mutex when uncontended).
By the way, it would be worthwhile to see how Linux's pthread rwlock is optimized, since that is probably what mysql was written to be optimal for. If I understand correctly, it uses a combination of a lock-free algorithm (an atomic compare-exchange operation) and then a futex to do the actual sleeping when needed, but I never looked at the details and it will probably be useful to.
In the long run (though I'm not sure it's necessary for use case) it would be even better to have even more efficient rwlock implementations, especially for the common and important case of a "read-mostly" workload (reads are very common and very concurrent, writes are rarer):One important replacement we already have for the rwlock in a read-mostly scenario is RCU, and we changed int he past several places in OSv to use RCU instead of rwlock, to solve *both* problems #1 and #2 mentioned above. We did this, for example, in the network stack which heavily used rwlocks for read-mostly use cases (e.g., the routing table) and changing these rwlock to RCU did wonders to improve performance.But I've seen in the research literature several implementations of the traditional rwlock API (not the less familiar RCU) which are more optimized for read-mostly workloads and faster and more scalable for it. A really optimal implementation would be (unlike the naive implementation, even with a spinning mutex) scalable in the number of cores doing read-locks in parallel, e.g., by keeping a seperate reader count per core (and clever tricks to find a global count), so that readers from different cores can get their read access without disturbing each other at all.This would be a very interesting direction to take OSv in, and I'll be happy to continue discussing it with you. But as I said above, it might be that in your case simply adding a short-spinning option to the mutex used in rwlock might be "good enough" for your use case.
Similar issues manifest after working around instances of the above problem. For example, concurrent calls to the vfs access() also end up taking a lock and putting the server threads to sleep (via namei/dentry_lookup mutex lock).Yes, our VFS code is indeed too lock-happy, and take locks too often and too long (see for example https://github.com/cloudius-systems/osv/issues/451 and https://github.com/cloudius-systems/osv/issues/450). In some cases, however, I wonder if it's worthwhile to focus on OSv performance here or the application is the culprit. In your case, why would mysql call access() so often? But if their use case does make sense, then definitely this part of our VFS could be optimized.
In all cases that I have encountered, it was beneficial for the performance to either avoid the code paths that involve acquiring the lfmutex where possible, or simply to switch to a spinlock-based locking scheme. As this is clearly in contrast to the design decisions behind not using spinlocks in OSv (due to the lock holder preemption issue), your input and suggestions would be most welcome here.As I explained above, your problem is not with the "lock-free" part of the OSv mutex, but with the fact then when this mutex sees contention it goes to sleep and doesn't spin (retry to take the lock), not even a bit. Since going to sleep has a non-negligable time overhead (of a context switch), let's say N nanoseconds, it doesn't make sense to do it in cases where the lock is expected to held for much less than N nanoseconds. "rwlock" is exactly such a case, because we know the lock is held for an extremely short duration (only while the lock holder checks the reader count). So if before going to sleep, the mutex retried to take the lock (spun) a few times, we would get the benefits of the spin lock without its problems (since after a few retries, the mutex reverts to sleeping).
As I said, we already had patches to try this, which were never committed but I can dig them up for you to try, if you're interested.
Nadav.
You are right in pointing out that a *sleeping* mutex makes problem #2 much worse: The issue is fact that our mutex is unconditionally a *sleeping* one, i.e., whenever a mutex lock() attempt finds the mutex to be already locked, it immediately puts the calling thread to sleep, and initiates a context switch. This is usually the right thing to do, but is not the right thing to do when we know that mutex would be locked for an extremely short duration (as is the case in rwlock), so it would have made sense to spin a bit retrying the lock, instead of going to sleep immediately. We actually had exactly the same observation in the past, and produced some patches to have the mutex do both spinning (for a short while) and sleeping (if spinning didn't help). I will resend those patches if you would be interested to try them and discuss them.That would be great, I would certainly like to have a look and try out the patches, and see how we could proceed from there.
By the way, it would be worthwhile to see how Linux's pthread rwlock is optimized, since that is probably what mysql was written to be optimal for. If I understand correctly, it uses a combination of a lock-free algorithm (an atomic compare-exchange operation) and then a futex to do the actual sleeping when needed, but I never looked at the details and it will probably be useful to.Indeed this is the case at least with the glibc nptl-based implementation, which attempts to be optimistic in order to avoid switching into the kernel unless contention is present. Please note that this is also not an ideal implementation from a scalability standpoint since the readers will go to sleep (via futex wait), without even spinning a bit on the user-space integer.
Fully agree. It is also worthwhile to mention that even some of the mutex implementations (e.g. within the Linux kernel itself but also in nptl), incorporate some degree of adaptiveness by optimistically spinning, with the assumption that the lock holder is already running on another cpu and thus it may be better to avoid putting the requesting thread to sleep.