I am writing to seek feedback on a new C++ library that I have been
developing (https://github.com/Sernior/fairness
<https://github.com/Sernior/fairness>).
This library, named "*Fairness*," is still a work in progress.
Nonetheless, I believe that this is an opportune time to begin gathering
feedback.
*Fairness* is designed to provide fair synchronization primitives. While
the standard C++ library offers various tools for managing concurrency,
it does not provide mechanisms for handling fairness policies.
To introduce the most fundamental component of my library, the
"priority_mutex," I would like to share a simple example:
/#include <thread>/
/#include <chrono>/
/#include <algorithm>/
/#include <BS_thread_pool.hpp>/
/#include <boost/fairness.hpp>/
/
/
/static boost::fairness::priority_mutex<4> pm;/
/
/
/static void busy_wait_nano(int64_t nanoseconds){/
/auto begin = std::chrono::steady_clock::now();/
/for(;std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::steady_clock::now()
- begin).count() < nanoseconds;)/
/ continue;/
/}/
/
/
/static void thread_function_nano(int p, int preCriticalTime, int
postCriticalTime, int criticalTime){/
/ busy_wait_nano(preCriticalTime);/
/ pm.lock(p);/
/ busy_wait_nano(criticalTime);/
/ pm.unlock();/
/ busy_wait_nano(postCriticalTime);/
/}/
/
/
/int main()/
/{/
/ std::array<int, 8> prios {0, 2, 2, 1, 1, 3, 3, 0};/
/ std::array<int, 8> preCT {2000, 1500, 2000, 3000, 1000, 500, 500, 2000};/
/ std::array<int, 8> postCT {5000, 3000, 2000, 2500, 1000, 1500, 1500,
4500};/
/ int criticalTime = 1000;/
/
/
/ BS::thread_pool pool(8);/
/
/
/ for (int i = 0; i < 8; ++i){/
/ pool.push_task(thread_function_nano, prios[i], preCT[i],
postCT[i], criticalTime);/
/ }/
/ pool.wait_for_tasks();/
/
/
/ return 0;/
/}
/
/// using BS thread pool https://github.com/bshoshany/thread-pool in
this example./
/
/
In this example priorities are not given randomly, you can see that I
gave 0 priority (the highest priority) to thread 0 which has 2000 + 1000
+ 5000 and thread 7 which has 2000 + 1000 + 4500.
Priority mutexes function much like a `std::array`, using a 0-indexed
based system. For instance, a `priority_mutex<4>` will manage priorities
ranging from 0 to 3, with 0 being the highest priority.
Currently, the library comprises priority mutexes, both recursive and
shared versions, along with RAII wrappers such as `unique_lock` and
`shared_lock`, all of which support priorities. There are, however, some
elements still missing, including priority condvars and timed versions
of the mutexes.
The primary motivation behind creating this library is to enhance
throughput in multi-threaded pipelines.
To illustrate the concept more simply, imagine a pipeline with N tasks,
some of which access a protected resource.
In cases of contention, there exists an optimal order of resource
accesses that minimizes overall execution time,
https://sernior.github.io/fairness/
<https://sernior.github.io/fairness/> the images show 2 different ways
of ordering critical sections one of which is much better than the other.
This library aims to empower developers, who have profiled their
software, to manipulate this order with minimal performance overhead and
as simply as possible.
In terms of implementation, much of the development has revolved around
benchmarking and testing. During this process, I've experimented with
over 10 different versions of the priority mutex.
Ultimately, I achieved the best results by using a priority spinlock,
which is also available to library users, to provide internal atomicity
to the rest of the mutexes.
An essential consideration in the implementation is ensuring that the
internal priority spinlock is 128 bytes away from the rest of the memory
used by the mutex especially the memory used to perform the atomic::wait on.
This separation helps avoid false sharing. The choice of 128 bytes
aligns with processor architectures like AMD Zen 4 (the one I am testing
on), as indicated in their optimization manual.
I can attest that without those 128 byte separation the PM_LockUnlock
benchmark was 10X worse and the PM_pipeline_benchmark_fast was worse
than the std counterpart (see below for the benchmarks).
Another aspect to note in the implementation is the use of 4 bytes for
booleans when only a single byte is needed. This decision is related to
the way atomic::wait is implemented. Most standard library
implementations of atomic::wait check if the type size is supported by
the current platform's wait syscall.
For instance, Windows can wait on 1 byte, while Linuxkernel based
systems can only wait on 4 bytes.
These are some benchmark results:
6.4.0-kali3-amd64
AMD Ryzen 5 7600X 6-Core Processor
Now the LockUnlock benchmarks were the most difficult to optimize.
Those are 8 threads simply doing:
m.lock();
m.unlock();
The pipeline benchmarks (with different lenght) are actual pipelines
where each thread has different times before/during/after the critical
section and I try to set the priorities accordingly to finish as soon as
possible.
You can see there is also a mutex called slim_priority_mutex, that is a
priority_mutex implementation that can support up to 7 (15 if boost
atomic is found and your hardware supports the DWCAS) priorities. Those
are light weight versions that only take 128 bits up to 7 priorities and
past that, up to 15, 256 bits.
I am also testing with a custom wait implementations
(https://github.com/Sernior/fairness/blob/main/include/boost/fairness/detail/wait_ops.hpp
<https://github.com/Sernior/fairness/blob/main/include/boost/fairness/detail/wait_ops.hpp>)
which can be tested by defining BOOST_FAIRNESS_USE_EXPERIMENTAL_WAIT_NOTIFY.
At some point I wish to make this wait the default version the lib uses
instead of the std one since users of my lib could configure this
version using my configs macros (number of fast spins - relaxed spins -
yield spins before the wait syscall).
As I said in the beginning other than to present my lib, this email has
the purpose of obtaining feedback. I would like to see if there is
anyone (other than me) interested in what I am doing and, if so,
consider that I am still far from finished and I would love suggestions.
Thank you all for your attention.
Best regards,
Federico Abrignani.
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
---------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------------------------------------------------------------------
priority_mutex_benchmark::PM_LockUnlock/threads:8 1.64 ns 13.0
ns 54189280
standard_mutex_benchmark::STD_LockUnlock/threads:8 0.712 ns 5.54
ns 125137360
slim_priority_mutex_benchmark::SLM_PM_LockUnlock/threads:8 3.27
ns 26.0 ns 27112392
spinlock_priority_mutex_benchmark::SPNLC_PM_LockUnlock/threads:8 0.825
ns 6.49 ns 111065720
slim_spinlock_priority_mutex_benchmark::SPNLC_SLM_PM_LockUnlock/threads:8
3.41 ns 27.2 ns 25413960
recursive_priority_mutex_benchmark::R_PM_LockUnlock/threads:8 1.65
ns 13.1 ns 53737320
standard_recursive_mutex_benchmark::R_STD_LockUnlock/threads:8 0.960
ns 7.67 ns 105165456
shared_priority_mutex_benchmark::PM_S_LockUnlock/threads:8 1.62
ns 12.9 ns 54708152
standard_shared_mutex_benchmark::STD_S_LockUnlock/threads:8 0.965
ns 7.68 ns 106124968
shared_priority_mutex_benchmark::PM_S_SLockSUnlock/threads:8 1.58
ns 12.6 ns 54070432
standard_shared_mutex_benchmark::STD_S_SLockSUnlock/threads:8 0.721
ns 5.76 ns 117148936
____________________________________________________________________________________________
priority_mutex_benchmark::PM_pipeline_benchmark_long/threads:8 9541052
ns 52513517 ns 16
standard_mutex_benchmark::STD_pipeline_benchmark_long/threads:8 9774822
ns 51878916 ns 16
slim_priority_mutex_benchmark::SLM_PM_pipeline_benchmark_long/threads:8
9380027 ns 55017352 ns 16
spinlock_priority_mutex_benchmark::SPNLC_PM_pipeline_benchmark_long/threads:8
9531843 ns 75511719 ns 8
slim_spinlock_priority_mutex_benchmark::SPNLC_SLM_PM_pipeline_benchmark_long/threads:8
9532650 ns 75912929 ns 8
recursive_priority_mutex_benchmark::R_PM_pipeline_benchmark_long/threads:8
9539370 ns 52512176 ns 16
standard_recursive_mutex_benchmark::R_STD_pipeline_benchmark_long/threads:8
9775127 ns 51879528 ns 16
shared_priority_mutex_benchmark::PM_S_pipeline_benchmark_long/threads:8
9531850 ns 76251304 ns 16
standard_shared_mutex_benchmark::STD_S_pipeline_benchmark_long/threads:8
9775146 ns 51880461 ns 16
shared_priority_mutex_benchmark::PM_S_Spipeline_benchmark_long/threads:8
6484466 ns 51870805 ns 16
standard_shared_mutex_benchmark::STD_S_Spipeline_benchmark_long/threads:8
6484449 ns 51869927 ns 16
____________________________________________________________________________________________
priority_mutex_benchmark::PM_pipeline_benchmark_gaming/threads:8 949731
ns 5213265 ns 136
standard_mutex_benchmark::STD_pipeline_benchmark_gaming/threads:8
1001777 ns 5193649 ns 136
slim_priority_mutex_benchmark::SLM_PM_pipeline_benchmark_gaming/threads:8
966555 ns 5557157 ns 128
spinlock_priority_mutex_benchmark::SPNLC_PM_pipeline_benchmark_gaming/threads:8
952582 ns 7604306 ns 96
slim_spinlock_priority_mutex_benchmark::SPNLC_SLM_PM_pipeline_benchmark_gaming/threads:8
921612 ns 7355304 ns 88
recursive_priority_mutex_benchmark::R_PM_pipeline_benchmark_gaming/threads:8
969296 ns 5291337 ns 136
standard_recursive_mutex_benchmark::R_STD_pipeline_benchmark_gaming/threads:8
1001943 ns 5193937 ns 136
shared_priority_mutex_benchmark::PM_S_pipeline_benchmark_gaming/threads:8
921843 ns 7348795 ns 88
standard_shared_mutex_benchmark::STD_S_pipeline_benchmark_gaming/threads:8
999545 ns 5191391 ns 136
shared_priority_mutex_benchmark::PM_S_Spipeline_benchmark_gaming/threads:8
648466 ns 5187308 ns 136
standard_shared_mutex_benchmark::STD_S_Spipeline_benchmark_gaming/threads:8
648462 ns 5187299 ns 136
____________________________________________________________________________________________
priority_mutex_benchmark::PM_pipeline_benchmark_audio/threads:8 96730
ns 525237 ns 1336
standard_mutex_benchmark::STD_pipeline_benchmark_audio/threads:8 102634
ns 521351 ns 1344
slim_priority_mutex_benchmark::SLM_PM_pipeline_benchmark_audio/threads:8
97671 ns 566508 ns 1264
spinlock_priority_mutex_benchmark::SPNLC_PM_pipeline_benchmark_audio/threads:8
91991 ns 734638 ns 936
slim_spinlock_priority_mutex_benchmark::SPNLC_SLM_PM_pipeline_benchmark_audio/threads:8
97838 ns 780163 ns 928
recursive_priority_mutex_benchmark::R_PM_pipeline_benchmark_audio/threads:8
96704 ns 525095 ns 1336
standard_recursive_mutex_benchmark::R_STD_pipeline_benchmark_audio/threads:8
102726 ns 521373 ns 1344
shared_priority_mutex_benchmark::PM_S_pipeline_benchmark_audio/threads:8
92130 ns 734847 ns 904
standard_shared_mutex_benchmark::STD_S_pipeline_benchmark_audio/threads:8
102640 ns 521309 ns 1344
shared_priority_mutex_benchmark::PM_S_Spipeline_benchmark_audio/threads:8
64865 ns 518881 ns 1352
standard_shared_mutex_benchmark::STD_S_Spipeline_benchmark_audio/threads:8
64864 ns 518872 ns 1352
____________________________________________________________________________________________
priority_mutex_benchmark::PM_pipeline_benchmark_fast/threads:8 1181
ns 7046 ns 95936
standard_mutex_benchmark::STD_pipeline_benchmark_fast/threads:8 1537
ns 7263 ns 95040
slim_priority_mutex_benchmark::SLM_PM_pipeline_benchmark_fast/threads:8
1218 ns 9528 ns 78904
spinlock_priority_mutex_benchmark::SPNLC_PM_pipeline_benchmark_fast/threads:8
1004 ns 8017 ns 86744
slim_spinlock_priority_mutex_benchmark::SPNLC_SLM_pipeline_benchmark_fast/threads:8
1028 ns 8192 ns 85040
recursive_priority_mutex_benchmark::R_PM_pipeline_benchmark_fast/threads:8
1177 ns 7084 ns 94584
standard_recursive_mutex_benchmark::R_STD_pipeline_benchmark_fast/threads:8
1573 ns 7318 ns 96456
shared_priority_mutex_benchmark::PM_S_pipeline_benchmark_fast/threads:8
1380 ns 10979 ns 59928
standard_shared_mutex_benchmark::STD_S_pipeline_benchmark_fast/threads:8
1534 ns 7301 ns 97200
shared_priority_mutex_benchmark::PM_S_Spipeline_benchmark_fast/threads:8
665 ns 5318 ns 131400
standard_shared_mutex_benchmark::STD_S_Spipeline_benchmark_fast/threads:8
663 ns 5303 ns 131816
[snip]
> The primary motivation behind creating this library is to enhance
> throughput in multi-threaded pipelines.
> To illustrate the concept more simply, imagine a pipeline with N tasks,
> some of which access a protected resource.
> In cases of contention, there exists an optimal order of resource
> accesses that minimizes overall execution time,
> https://sernior.github.io/fairness/
> <https://sernior.github.io/fairness/> the images show 2 different ways
> of ordering critical sections one of which is much better than the other.
>
> This library aims to empower developers, who have profiled their
> software, to manipulate this order with minimal performance overhead and
> as simply as possible.
While I do think priority-based synchronization primitives would be a
very useful addition to Boost libraries, I find this motivation
puzzling. Priorities are introduced to avoid starvation of specific
(high-priority) threads at the expense of the less important threads and
likely overall performance (because fairness is not free). The chart you
show on the front page is nice, but it seems to me like a hand-tailored
case that is not necessarily related to reality. Specifically, there is
no guarantee that processing times for different threads are aligned as
depicted on your chart. For example, one of the threads could have a
very long processing time compared to other threads, which would negate
any effect on the total processing time from reordering threads using
priorities. What priorities actually provide is minimize the processing
time of the high-priority thread, pretty much regardless of the
low-priority threads, barring data starvation, of course. So I think,
you should probably reconsider the motivation or at least present some
evidence that the library helps to achieve the stated goals in practice
(and how does it do that).
Yes the charts I provide are a simple hand tailored case that I am using
to explain visually a concept that may happen in the real world.
> Specifically, there is
> no guarantee that processing times for different threads are aligned as
> depicted on your chart. For example, one of the threads could have a
> very long processing time compared to other threads, which would negate
> any effect on the total processing time from reordering threads using
> priorities. What priorities actually provide is minimize the processing
> time of the high-priority thread, pretty much regardless of the
> low-priority threads, barring data starvation, of course. So I think,
> you should probably reconsider the motivation or at least present some
> evidence that the library helps to achieve the stated goals in practice
> (and how does it do that).
>
Right, but that is also why I said "After profiling your software".
There is no guarantee until a developer sees it happening.
Allow me to quote my own documentation:
"
These tools, if misused, have the potential to cause harm rather than
benefit. Careful implementation and understanding are crucial to harness
their benefits effectively.
"
What I meant with this sentence is exactly what you are saying:
These tools should be used by people who have first profiled their
software and found themselves in the situation where they thought:
"If only I had a way to tell my threads that a specific task should be
performed first ".
But I feel your concern as it is also my concern. I surely need to
perform more studies to find out exactly the boundaries of when using
priority mutexes is better than not using them.
If you are interested you can have a look at my pipeline benchmarks and
tinker with them!
https://github.com/Sernior/fairness/blob/main/benchmarks/pm/priority_mutex_benchmark.hpp
std::array<int, 8> prios {0, 2, 2, 1, 1, 3, 3, 0};
std::array<int, 8> preCT {20, 15, 20, 30, 10, 5, 5, 20};
int CT = 10;
std::array<int, 8> postCT {50, 30, 20, 25, 10, 15, 15, 45};
these 3 arrays are the priorities, times before and after the critical
sections for the 8 threads. CT is the time of the critical section.
But returning to your example, if one of the threads, lets call it *L*,
has has a very *L*ong processing time compared to the other threads,
wouldn`t you agree that it would be better if *L* had priority
over the other threads? If you wanted to finish as fast as possible the
tasks of all threads?
Feel free to contact me by email directly if you want some explanations
on how to use my benchmarks!
Thanks for the feedback.
Federico Abrignani.