[boost] [Looking for Feedback] - Fairness lib

22 views
Skip to first unread message

Federico Abrignani via Boost

unread,
Oct 21, 2023, 10:38:05 AM10/21/23
to bo...@lists.boost.org, Federico Abrignani
Dear Boost Community,


I am writing to seek feedback on a new C++ library that I have been
developing (https://github.com/Sernior/fairness
<https://github.com/Sernior/fairness>).
This library, named "*Fairness*," is still a work in progress.
Nonetheless, I believe that this is an opportune time to begin gathering
feedback.

*Fairness* is designed to provide fair synchronization primitives. While
the standard C++ library offers various tools for managing concurrency,
it does not provide mechanisms for handling fairness policies.

To introduce the most fundamental component of my library, the
"priority_mutex," I would like to share a simple example:

/#include <thread>/
/#include <chrono>/
/#include <algorithm>/
/#include <BS_thread_pool.hpp>/
/#include <boost/fairness.hpp>/
/
/
/static boost::fairness::priority_mutex<4> pm;/
/
/
/static void busy_wait_nano(int64_t nanoseconds){/
/auto begin = std::chrono::steady_clock::now();/
/for(;std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::steady_clock::now()
- begin).count() < nanoseconds;)/
/      continue;/
/}/
/
/
/static void thread_function_nano(int p, int preCriticalTime, int
postCriticalTime, int criticalTime){/
/  busy_wait_nano(preCriticalTime);/
/  pm.lock(p);/
/  busy_wait_nano(criticalTime);/
/  pm.unlock();/
/  busy_wait_nano(postCriticalTime);/
/}/
/
/
/int main()/
/{/
/  std::array<int, 8> prios {0, 2, 2, 1, 1, 3, 3, 0};/
/  std::array<int, 8> preCT {2000, 1500, 2000, 3000, 1000, 500, 500, 2000};/
/  std::array<int, 8> postCT {5000, 3000, 2000, 2500, 1000, 1500, 1500,
4500};/
/  int criticalTime = 1000;/
/
/
/  BS::thread_pool pool(8);/
/
/
/  for (int i = 0; i < 8; ++i){/
/      pool.push_task(thread_function_nano, prios[i], preCT[i],
postCT[i], criticalTime);/
/  }/
/  pool.wait_for_tasks();/
/
/
/  return 0;/
/}

/
/// using BS thread pool https://github.com/bshoshany/thread-pool in
this example./
/
/
In this example priorities are not given randomly, you can see that I
gave 0 priority (the highest priority) to thread 0 which has 2000 + 1000
+ 5000 and thread 7 which has 2000 + 1000 + 4500.
Priority mutexes function much like a `std::array`, using a 0-indexed
based system. For instance, a `priority_mutex<4>` will manage priorities
ranging from 0 to 3, with 0 being the highest priority.

Currently, the library comprises priority mutexes, both recursive and
shared versions, along with RAII wrappers such as `unique_lock` and
`shared_lock`, all of which support priorities. There are, however, some
elements still missing, including priority condvars and timed versions
of the mutexes.

The primary motivation behind creating this library is to enhance
throughput in multi-threaded pipelines.
To illustrate the concept more simply, imagine a pipeline with N tasks,
some of which access a protected resource.
In cases of contention, there exists an optimal order of resource
accesses that minimizes overall execution time,
https://sernior.github.io/fairness/
<https://sernior.github.io/fairness/> the images show 2 different ways
of ordering critical sections one of which is much better than the other.

This library aims to empower developers, who have profiled their
software, to manipulate this order with minimal performance overhead and
as simply as possible.

In terms of implementation, much of the development has revolved around
benchmarking and testing. During this process, I've experimented with
over 10 different versions of the priority mutex.
Ultimately, I achieved the best results by using a priority spinlock,
which is also available to library users, to provide internal atomicity
to the rest of the mutexes.
An essential consideration in the implementation is ensuring that the
internal priority spinlock is 128 bytes away from the rest of the memory
used by the mutex especially the memory used to perform the atomic::wait on.
This separation helps avoid false sharing. The choice of 128 bytes
aligns with processor architectures like AMD Zen 4 (the one I am testing
on), as indicated in their optimization manual.
I can attest that without those 128 byte separation the PM_LockUnlock
benchmark was 10X worse and the PM_pipeline_benchmark_fast was worse
than the std counterpart (see below for the benchmarks).

Another aspect to note in the implementation is the use of 4 bytes for
booleans when only a single byte is needed. This decision is related to
the way atomic::wait is implemented. Most standard library
implementations of atomic::wait check if the type size is supported by
the current platform's wait syscall.
For instance, Windows can wait on 1 byte, while Linuxkernel based
systems can only wait on 4 bytes.

These are some benchmark results:
6.4.0-kali3-amd64
AMD Ryzen 5 7600X 6-Core Processor

Now the LockUnlock benchmarks were the most difficult to optimize.
Those are 8 threads simply doing:
m.lock();
m.unlock();

The pipeline benchmarks (with different lenght) are actual pipelines
where each thread has different times before/during/after the critical
section and I try to set the priorities accordingly to finish as soon as
possible.

You can see there is also a mutex called slim_priority_mutex, that is a
priority_mutex implementation that can support up to 7 (15 if boost
atomic is found and your hardware supports the DWCAS) priorities. Those
are light weight versions that only take 128 bits up to 7 priorities and
past that, up to 15, 256 bits.

I am also testing with a custom wait implementations
(https://github.com/Sernior/fairness/blob/main/include/boost/fairness/detail/wait_ops.hpp
<https://github.com/Sernior/fairness/blob/main/include/boost/fairness/detail/wait_ops.hpp>)
which can be tested by defining BOOST_FAIRNESS_USE_EXPERIMENTAL_WAIT_NOTIFY.
At some point I wish to make this wait the default version the lib uses
instead of the std one since users of my lib could configure this
version using my configs macros (number of fast spins - relaxed spins -
yield spins before the wait syscall).

As I said in the beginning other than to present my lib, this email has
the purpose of obtaining feedback. I would like to see if there is
anyone (other than me) interested in what I am doing and, if so,
consider that I am still far from finished and I would love suggestions.

Thank you all for your attention.

Best regards,

Federico Abrignani.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Federico Abrignani via Boost

unread,
Oct 21, 2023, 11:42:24 AM10/21/23
to bo...@lists.boost.org, Federico Abrignani
To add the benchmark results since the image did not show in my mail.


---------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------------------------------
priority_mutex_benchmark::PM_LockUnlock/threads:8 1.64 ns         13.0
ns     54189280
standard_mutex_benchmark::STD_LockUnlock/threads:8 0.712 ns         5.54
ns    125137360
slim_priority_mutex_benchmark::SLM_PM_LockUnlock/threads:8 3.27
ns         26.0 ns     27112392
spinlock_priority_mutex_benchmark::SPNLC_PM_LockUnlock/threads:8 0.825
ns         6.49 ns    111065720
slim_spinlock_priority_mutex_benchmark::SPNLC_SLM_PM_LockUnlock/threads:8
3.41 ns         27.2 ns     25413960
recursive_priority_mutex_benchmark::R_PM_LockUnlock/threads:8 1.65
ns         13.1 ns     53737320
standard_recursive_mutex_benchmark::R_STD_LockUnlock/threads:8 0.960
ns         7.67 ns    105165456
shared_priority_mutex_benchmark::PM_S_LockUnlock/threads:8 1.62
ns         12.9 ns     54708152
standard_shared_mutex_benchmark::STD_S_LockUnlock/threads:8 0.965
ns         7.68 ns    106124968
shared_priority_mutex_benchmark::PM_S_SLockSUnlock/threads:8 1.58
ns         12.6 ns     54070432
standard_shared_mutex_benchmark::STD_S_SLockSUnlock/threads:8 0.721
ns         5.76 ns    117148936
____________________________________________________________________________________________

priority_mutex_benchmark::PM_pipeline_benchmark_long/threads:8 9541052
ns     52513517 ns           16
standard_mutex_benchmark::STD_pipeline_benchmark_long/threads:8 9774822
ns     51878916 ns           16
slim_priority_mutex_benchmark::SLM_PM_pipeline_benchmark_long/threads:8
9380027 ns     55017352 ns           16
spinlock_priority_mutex_benchmark::SPNLC_PM_pipeline_benchmark_long/threads:8
9531843 ns     75511719 ns            8
slim_spinlock_priority_mutex_benchmark::SPNLC_SLM_PM_pipeline_benchmark_long/threads:8
9532650 ns     75912929 ns            8
recursive_priority_mutex_benchmark::R_PM_pipeline_benchmark_long/threads:8
9539370 ns     52512176 ns           16
standard_recursive_mutex_benchmark::R_STD_pipeline_benchmark_long/threads:8
9775127 ns     51879528 ns           16
shared_priority_mutex_benchmark::PM_S_pipeline_benchmark_long/threads:8
9531850 ns     76251304 ns           16
standard_shared_mutex_benchmark::STD_S_pipeline_benchmark_long/threads:8
9775146 ns     51880461 ns           16
shared_priority_mutex_benchmark::PM_S_Spipeline_benchmark_long/threads:8
6484466 ns     51870805 ns           16
standard_shared_mutex_benchmark::STD_S_Spipeline_benchmark_long/threads:8
6484449 ns     51869927 ns           16
____________________________________________________________________________________________

priority_mutex_benchmark::PM_pipeline_benchmark_gaming/threads:8 949731
ns      5213265 ns          136
standard_mutex_benchmark::STD_pipeline_benchmark_gaming/threads:8
1001777 ns      5193649 ns          136
slim_priority_mutex_benchmark::SLM_PM_pipeline_benchmark_gaming/threads:8
966555 ns      5557157 ns          128
spinlock_priority_mutex_benchmark::SPNLC_PM_pipeline_benchmark_gaming/threads:8
952582 ns      7604306 ns           96
slim_spinlock_priority_mutex_benchmark::SPNLC_SLM_PM_pipeline_benchmark_gaming/threads:8
921612 ns      7355304 ns           88
recursive_priority_mutex_benchmark::R_PM_pipeline_benchmark_gaming/threads:8
969296 ns      5291337 ns          136
standard_recursive_mutex_benchmark::R_STD_pipeline_benchmark_gaming/threads:8
1001943 ns      5193937 ns          136
shared_priority_mutex_benchmark::PM_S_pipeline_benchmark_gaming/threads:8
921843 ns      7348795 ns           88
standard_shared_mutex_benchmark::STD_S_pipeline_benchmark_gaming/threads:8
999545 ns      5191391 ns          136
shared_priority_mutex_benchmark::PM_S_Spipeline_benchmark_gaming/threads:8
648466 ns      5187308 ns          136
standard_shared_mutex_benchmark::STD_S_Spipeline_benchmark_gaming/threads:8
648462 ns      5187299 ns          136
____________________________________________________________________________________________

priority_mutex_benchmark::PM_pipeline_benchmark_audio/threads:8 96730
ns       525237 ns         1336
standard_mutex_benchmark::STD_pipeline_benchmark_audio/threads:8 102634
ns       521351 ns         1344
slim_priority_mutex_benchmark::SLM_PM_pipeline_benchmark_audio/threads:8
97671 ns       566508 ns         1264
spinlock_priority_mutex_benchmark::SPNLC_PM_pipeline_benchmark_audio/threads:8
91991 ns       734638 ns          936
slim_spinlock_priority_mutex_benchmark::SPNLC_SLM_PM_pipeline_benchmark_audio/threads:8
97838 ns       780163 ns          928
recursive_priority_mutex_benchmark::R_PM_pipeline_benchmark_audio/threads:8
96704 ns       525095 ns         1336
standard_recursive_mutex_benchmark::R_STD_pipeline_benchmark_audio/threads:8
102726 ns       521373 ns         1344
shared_priority_mutex_benchmark::PM_S_pipeline_benchmark_audio/threads:8
92130 ns       734847 ns          904
standard_shared_mutex_benchmark::STD_S_pipeline_benchmark_audio/threads:8
102640 ns       521309 ns         1344
shared_priority_mutex_benchmark::PM_S_Spipeline_benchmark_audio/threads:8
64865 ns       518881 ns         1352
standard_shared_mutex_benchmark::STD_S_Spipeline_benchmark_audio/threads:8
64864 ns       518872 ns         1352
____________________________________________________________________________________________

priority_mutex_benchmark::PM_pipeline_benchmark_fast/threads:8 1181
ns         7046 ns        95936
standard_mutex_benchmark::STD_pipeline_benchmark_fast/threads:8 1537
ns         7263 ns        95040
slim_priority_mutex_benchmark::SLM_PM_pipeline_benchmark_fast/threads:8
1218 ns         9528 ns        78904
spinlock_priority_mutex_benchmark::SPNLC_PM_pipeline_benchmark_fast/threads:8
1004 ns         8017 ns        86744
slim_spinlock_priority_mutex_benchmark::SPNLC_SLM_pipeline_benchmark_fast/threads:8
1028 ns         8192 ns        85040
recursive_priority_mutex_benchmark::R_PM_pipeline_benchmark_fast/threads:8
1177 ns         7084 ns        94584
standard_recursive_mutex_benchmark::R_STD_pipeline_benchmark_fast/threads:8
1573 ns         7318 ns        96456
shared_priority_mutex_benchmark::PM_S_pipeline_benchmark_fast/threads:8
1380 ns        10979 ns        59928
standard_shared_mutex_benchmark::STD_S_pipeline_benchmark_fast/threads:8
1534 ns         7301 ns        97200
shared_priority_mutex_benchmark::PM_S_Spipeline_benchmark_fast/threads:8
665 ns         5318 ns       131400
standard_shared_mutex_benchmark::STD_S_Spipeline_benchmark_fast/threads:8
663 ns         5303 ns       131816

Andrey Semashev via Boost

unread,
Oct 21, 2023, 12:28:28 PM10/21/23
to bo...@lists.boost.org, Andrey Semashev
On 10/21/23 17:11, Federico Abrignani via Boost wrote:
> Dear Boost Community,
>
>
> I am writing to seek feedback on a new C++ library that I have been
> developing (https://github.com/Sernior/fairness
> <https://github.com/Sernior/fairness>).
> This library, named "*Fairness*," is still a work in progress.
> Nonetheless, I believe that this is an opportune time to begin gathering
> feedback.
>
> *Fairness* is designed to provide fair synchronization primitives. While
> the standard C++ library offers various tools for managing concurrency,
> it does not provide mechanisms for handling fairness policies.

[snip]

> The primary motivation behind creating this library is to enhance
> throughput in multi-threaded pipelines.
> To illustrate the concept more simply, imagine a pipeline with N tasks,
> some of which access a protected resource.
> In cases of contention, there exists an optimal order of resource
> accesses that minimizes overall execution time,
> https://sernior.github.io/fairness/
> <https://sernior.github.io/fairness/> the images show 2 different ways
> of ordering critical sections one of which is much better than the other.
>
> This library aims to empower developers, who have profiled their
> software, to manipulate this order with minimal performance overhead and
> as simply as possible.

While I do think priority-based synchronization primitives would be a
very useful addition to Boost libraries, I find this motivation
puzzling. Priorities are introduced to avoid starvation of specific
(high-priority) threads at the expense of the less important threads and
likely overall performance (because fairness is not free). The chart you
show on the front page is nice, but it seems to me like a hand-tailored
case that is not necessarily related to reality. Specifically, there is
no guarantee that processing times for different threads are aligned as
depicted on your chart. For example, one of the threads could have a
very long processing time compared to other threads, which would negate
any effect on the total processing time from reordering threads using
priorities. What priorities actually provide is minimize the processing
time of the high-priority thread, pretty much regardless of the
low-priority threads, barring data starvation, of course. So I think,
you should probably reconsider the motivation or at least present some
evidence that the library helps to achieve the stated goals in practice
(and how does it do that).

Federico Abrignani via Boost

unread,
Oct 21, 2023, 3:49:44 PM10/21/23
to Andrey Semashev via Boost, Federico Abrignani

> Priorities are introduced to avoid starvation of specific
> (high-priority) threads at the expense of the less important threads and
> likely overall performance (because fairness is not free). The chart you
> show on the front page is nice, but it seems to me like a hand-tailored
> case that is not necessarily related to reality.


Yes the charts I provide are a simple hand tailored case that I am using
to explain visually a concept that may happen in the real world.

> Specifically, there is
> no guarantee that processing times for different threads are aligned as
> depicted on your chart. For example, one of the threads could have a
> very long processing time compared to other threads, which would negate
> any effect on the total processing time from reordering threads using
> priorities. What priorities actually provide is minimize the processing
> time of the high-priority thread, pretty much regardless of the
> low-priority threads, barring data starvation, of course. So I think,
> you should probably reconsider the motivation or at least present some
> evidence that the library helps to achieve the stated goals in practice
> (and how does it do that).
>

Right, but that is also why I said "After profiling your software".
There is no guarantee until a developer sees it happening.
Allow me to quote my own documentation:
"
These tools, if misused, have the potential to cause harm rather than
benefit. Careful implementation and understanding are crucial to harness
their benefits effectively.
"

What I meant with this sentence is exactly what you are saying:
These tools should be used by people who have first profiled their
software and found themselves in the situation where they thought:
"If only I had a way to tell my threads that a specific task should be
performed first ".

But I feel your concern as it is also my concern. I surely need to
perform more studies to find out exactly the boundaries of when using
priority mutexes is better than not using them.

If you are interested you can have a look at my pipeline benchmarks and
tinker with them!

https://github.com/Sernior/fairness/blob/main/benchmarks/pm/priority_mutex_benchmark.hpp

        std::array<int, 8> prios {0, 2, 2, 1, 1, 3, 3, 0};

        std::array<int, 8> preCT {20, 15, 20, 30, 10, 5, 5, 20};
        int CT = 10;
        std::array<int, 8> postCT {50, 30, 20, 25, 10, 15, 15, 45};

these 3 arrays are the priorities, times before and after the critical
sections for the 8 threads. CT is the time of the critical section.

But returning to your example, if one of the threads, lets call it *L*,
has has a very *L*ong processing time compared to the other threads,
wouldn`t you agree that it would be better if *L* had priority

over the other threads? If you wanted to finish as fast as possible the
tasks of all threads?


Feel free to contact me by email directly if you want some explanations
on how to use my benchmarks!

Thanks for the feedback.


Federico Abrignani.

Andrey Semashev via Boost

unread,
Oct 21, 2023, 7:52:33 PM10/21/23
to bo...@lists.boost.org, Andrey Semashev
On 10/21/23 21:03, Federico Abrignani via Boost wrote:
>
>> Specifically, there is
>> no guarantee that processing times for different threads are aligned as
>> depicted on your chart. For example, one of the threads could have a
>> very long processing time compared to other threads, which would negate
>> any effect on the total processing time from reordering threads using
>> priorities. What priorities actually provide is minimize the processing
>> time of the high-priority thread, pretty much regardless of the
>> low-priority threads, barring data starvation, of course. So I think,
>> you should probably reconsider the motivation or at least present some
>> evidence that the library helps to achieve the stated goals in practice
>> (and how does it do that).
>
> But returning to your example, if one of the threads, lets call it *L*,
> has has a very *L*ong processing time compared to the other threads,
> wouldn`t you agree that it would be better if *L* had priority
>
> over the other threads? If you wanted to finish as fast as possible the
> tasks of all threads?

No, not necessarily. It is usually not the matter of long or short run
time, but rather what is the bottleneck or what is required to finish fast.

To give an example, consider two threads. One is produces data and puts
it in a queue, the other one dequeues the data and, say, writes it to a
file or sends to network. Producing data is expensive, so the first
thread may take a lot of processing to produce a piece of the data.
Also, the queue has limited capacity and may overflow, in which case the
writer will overwrite previously written data. In order to avoid that,
you would typically prefer the second thread to have a higher priority,
so that the enqueued data is dispatched ASAP and the queue is mostly
empty. Doing so may increase the runtime of the first thread, as it may
block on the queue for longer periods of time, especially if there are
multiple readers.

Surely, there is a wide variety of use cases, some of them are more
along the lines of what you're saying. My point, though, is that
priorities are not about minimizing overall run time, as your
documentation suggests. They are primarily about minimizing starvation
or minimizing latency of *specific* threads or tasks at the expense of
other threads and general performance. Whether that effect is used to
achieve correctness, reduce latency or improve throughput is
application-specific.

Federico Abrignani via Boost

unread,
Oct 21, 2023, 8:54:22 PM10/21/23
to Andrey Semashev via Boost, Federico Abrignani

> Surely, there is a wide variety of use cases, some of them are more
> along the lines of what you're saying. My point, though, is that
> priorities are not about minimizing overall run time, as your
> documentation suggests. They are primarily about minimizing starvation
> or minimizing latency of *specific* threads or tasks at the expense of
> other threads and general performance. Whether that effect is used to
> achieve correctness, reduce latency or improve throughput is
> application-specific.


I see your point. Yes, the reason I started this lib was loop like
pipelines and maybe I got too "settled" with that specific use case
while writing that piece of documentation.
The consumer producer problem is surely not the use case I started
this library for and who knows how many more use cases are there.

But my point is: Imagine you have an loop like pipeline.

while (true){
update();
render();
}

This could be a video game or any other graphical application.
update() get performed by X threads each one having a specific task.
In order to gain maximum FPS you want those threads to be able to finish
as fast as possible.
This is the use case I has in mind when I wrote that.
But I see your point now. I will remove that statement and provide instead
different examples of different use cases instead.

Thanks for the feedback.

Federico Abrignani.

Reply all
Reply to author
Forward
0 new messages