xchg based thing...

Chris M. Thomasson

unread,

Nov 27, 2023, 1:15:07 AM11/27/23

to

Fwiw, here is a quick and dirty mock up of my atomic exchange based lifo
work queue thing. Not sure how to classify it quite yet. I am going to
use it as a work "queue" for my parts of my CPU based rendering engine,
no need for it wrt my shader work. However, I needed to do it. It's
passing Relacy tests. That's a good thing. Now I need to add in a slow
path so it can wait in the kernel during periods of no work. I think an
eventcount should work for it. Or perhaps I might integrate some state
in the pointer values for futexes, or whatever. I have not made up my
mind yet.

I will port my Relacy test units over to standard C++. It's not all that
hard. In fact, it's rather straightforward.

Speaking of pointer values, take note of CT_PENDING.

The fun part about this scheme is that it only uses atomic exchange for
its logic. It also has wait-free loopless push and flush:

void push(work* w)
{
// this completes the pending logic...
w->m_next.store(CT_PENDING, rl::mo_relaxed, $);
work* head = m_head.exchange(w, rl::mo_release, $);
w->m_next.store(head, rl::mo_release, $);
}

work* flush()
{
return m_head.exchange(nullptr, rl::mo_acquire, $);
}

Here is my current Relacy test unit:

https://github.com/dvyukov/relacy
_______________________________
#include <iostream>
#include <relacy/relacy.hpp>

#define CT_PRODUCERS_N 2
#define CT_CONSUMERS_N 3
#define CT_THREAD_N (CT_PRODUCERS_N + CT_CONSUMERS_N)

// Notes:
// Using spin backoff for waits as of now
// Need to slow-path it with with kernel waits
// Need to refine the per-node backoff

// humm...
#define CT_PENDING ((ct_xchg_lifo_ver_001::work*)(0xDEADBEEF))

struct ct_xchg_lifo_ver_001
{

// simulate some work...
struct work
{
rl::atomic<work*> m_next;
VAR_T(unsigned long) m_work;

work()
: m_next(CT_PENDING),
m_work(0)
{

}

~work()
{
RL_ASSERT(VAR(m_work) == 1);
}

// no data races on m_work!
void execute_work()
{
VAR(m_work) += 1;
}

// this is the pending node logic...
work* get_next()
{
work* next = m_next.load(rl::mo_consume, $);

while (next == CT_PENDING)
{
rl::backoff();
next = m_next.load(rl::mo_consume, $);
}

return next;
}
};

rl::atomic<work*> m_head;

ct_xchg_lifo_ver_001()
: m_head(nullptr)
{

}

~ct_xchg_lifo_ver_001()
{
RL_ASSERT(! m_head.load(rl::mo_relaxed, $));
}

void push(work* w)
{
// this completes the pending logic...
w->m_next.store(CT_PENDING, rl::mo_relaxed, $);
work* head = m_head.exchange(w, rl::mo_release, $);
w->m_next.store(head, rl::mo_release, $);
}

work* flush()
{
return m_head.exchange(nullptr, rl::mo_acquire, $);
}

void process(work* cur)
{
// process work nodes...

while (cur)
{
// do real work _before_ the pending logic
// this is highly beneficial.
cur->execute_work();

// get the next work node.
cur = cur->get_next();
}
}

// dump worked on nodes...
void destroy(work* cur)
{
while (cur)
{
work* next = cur->m_next.load(rl::mo_relaxed, $);

// no pending work shall be destroyed!
RL_ASSERT(next != CT_PENDING);

delete cur;
cur = next;
}
}
};

struct ct_relacy_test_fun : rl::test_suite<ct_relacy_test_fun, CT_THREAD_N>
{
ct_xchg_lifo_ver_001 m_work_lifo;

void before()
{

}

void after()
{
ct_xchg_lifo_ver_001::work* w = m_work_lifo.flush();
m_work_lifo.process(w);
m_work_lifo.destroy(w);
}

void producer(unsigned int tidx)
{
ct_xchg_lifo_ver_001::work* w = new ct_xchg_lifo_ver_001::work();
m_work_lifo.push(w);
}

void consumer(unsigned int tidx)
{
ct_xchg_lifo_ver_001::work* w = m_work_lifo.flush();
m_work_lifo.process(w);
m_work_lifo.destroy(w);
}

void thread(unsigned int tidx)
{
if (tidx < CT_PRODUCERS_N)
{
// producer threads
producer(tidx);
}

else
{
// consumer threads
consumer(tidx);
}
}
};

int
main()
{
std::cout << "Exchange LIFO Container Experiment ver:0.0.1\n";
std::cout << "Relacy Unit Test ver:0.0.1\n";
std::cout << "by: Chris M. Thomasson\n";
std::cout << "__________________________________\n" << std::endl;

{
rl::test_params p;

p.iteration_count = 400000;
//p.execution_depth_limit = 33333;
//p.search_type = rl::sched_bound;
//p.search_type = rl::fair_full_search_scheduler_type;
//p.search_type = rl::fair_context_bound_scheduler_type;

std::cout << "Executing Relacy Unit Test...\n";
std::cout << "__________________________________" << std::endl;

rl::simulate<ct_relacy_test_fun>(p);
}

return 0;
}
_______________________________

Chris M. Thomasson

unread,

Nov 27, 2023, 1:20:43 AM11/27/23

to

> [...]

>         // this is the pending node logic...
>         work* get_next()
>         {
>             work* next = m_next.load(rl::mo_consume, $);

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Now, I think I might be able to get away with using a relaxed membar
here instead of consume. I think so because of the acquire release
relation ship between the push and flush functions. This is were Relacy
can come in handy. I will have some more time to work on it later on
tonight. I need to test that. I don't think a consume membar is needed
here. I think relaxed should work fine. Not exactly sure right now.

>
>             while (next == CT_PENDING)
>             {
>                 rl::backoff();
>                 next = m_next.load(rl::mo_consume, $);
>             }
>
>             return next;
>         }
>     };

[...]

Chris M. Thomasson

unread,

Nov 27, 2023, 1:25:59 AM11/27/23

to

On 11/26/2023 10:14 PM, Chris M. Thomasson wrote:

[...]

>     void push(work* w)
>     {
>         // this completes the pending logic...
>         w->m_next.store(CT_PENDING, rl::mo_relaxed, $);
>         work* head = m_head.exchange(w, rl::mo_release, $);

>         w->m_next.store(head, rl::mo_release, $);

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I also think I can use relaxed for that store! Humm... I just need to
test it. Tonight is going to be fun. :^)

> }
[...]

Chris M. Thomasson

unread,

Nov 27, 2023, 9:32:59 PM11/27/23

to

On 11/26/2023 10:14 PM, Chris M. Thomasson wrote:

> Fwiw, here is a quick and dirty mock up of my atomic exchange based lifo
> work queue thing. Not sure how to classify it quite yet. I am going to
> use it as a work "queue" for my parts of my CPU based rendering engine,
> no need for it wrt my shader work. However, I needed to do it. It's
> passing Relacy tests. That's a good thing. Now I need to add in a slow
> path so it can wait in the kernel during periods of no work. I think an
> eventcount should work for it. Or perhaps I might integrate some state
> in the pointer values for futexes, or whatever. I have not made up my
> mind yet.
>
> I will port my Relacy test units over to standard C++. It's not all that
> hard. In fact, it's rather straightforward.
>
> Speaking of pointer values, take note of CT_PENDING.
>
> The fun part about this scheme is that it only uses atomic exchange for
> its logic. It also has wait-free loopless push and flush:

[...]

Fwiw, here is just a little exploration into distributing it. I will
make a home for it on GitHub. Check it out, my simple first venture into
distributing the work across this experiment wrt n-threads. Take careful
notice of local work vs. foreign work.

I managed to streamline the memory barriers to the weakest possible. Got
rid of a release membar, and a consume membar. I don't think it can get
any weaker without being incorrect.

___________________________
#include <iostream>
#include <cstddef>
#include <relacy/relacy.hpp>

#define CT_THREAD_N 5
#define CT_WORK_N 3
#define CT_PRODUCER_WORK_N 5
#define CT_CONSUMER_WORK_N 2
#define CT_DISTRIBUTE_WORK_N CT_THREAD_N

// Notes:
// Using spin backoff for waits as of now
// Need to slow-path it with with kernel waits
// Need to refine the per-node backoff

// humm...
#define CT_PENDING ((ct_xchg_lifo_ver_001::work*)(0xDEADBEEF))

static unsigned long g_stats_local_work = 0;
static unsigned long g_stats_foreign_work = 0;

struct ct_xchg_lifo_ver_001
{

// simulate some work...
struct work
{
rl::atomic<work*> m_next;
VAR_T(unsigned long) m_work;

VAR_T(unsigned long) m_tidx;

work(unsigned long tidx)
: m_next(CT_PENDING),
m_work(0),
m_tidx(tidx)

{

}

~work()
{
RL_ASSERT(VAR(m_work) == 1);
}

// no data races on m_work!

void execute_work(unsigned long tidx)
{
VAR(m_work) += 1;

unsigned long local_tidx = VAR(m_tidx);

if (local_tidx != tidx)
{
g_stats_foreign_work += 1;
/*
std::cout << "foriegn work detected "
<< local_tidx
<< " -> "
<< tidx
<< std::endl;
*/
}

else
{
g_stats_local_work += 1;

/*
std::cout << "local work detected "
<< local_tidx
<< " -> "
<< tidx
<< std::endl;
*/

}
}

// this is the pending node logic...
work* get_next()
{

work* next = m_next.load(rl::mo_relaxed, $);

while (next == CT_PENDING)
{
rl::backoff();

next = m_next.load(rl::mo_relaxed, $);
}

return next;
}

static void process(work* cur, unsigned long tidx)
{
// must not be pending!
RL_ASSERT(cur != CT_PENDING);

// process work nodes...

while (cur)
{
// do real work _before_ the pending logic

// this is highly beneficial... Big time!
cur->execute_work(tidx);

// get the next work node.
cur = cur->get_next();
}
}

// dump worked on nodes...

static void destroy(work* cur, unsigned long tidx)
{

// no pending work shall be destroyed!

RL_ASSERT(cur != CT_PENDING);

while (cur)
{
work* next = cur->m_next.load(rl::mo_relaxed, $);

// no pending work shall be destroyed!
RL_ASSERT(next != CT_PENDING);

delete cur;
cur = next;
}
}
};

rl::atomic<work*> m_head;

ct_xchg_lifo_ver_001()
: m_head(nullptr)
{

}

~ct_xchg_lifo_ver_001()
{
RL_ASSERT(! m_head.load(rl::mo_relaxed, $));
}

void push(work* w)
{
// this completes the pending logic...
w->m_next.store(CT_PENDING, rl::mo_relaxed, $);
work* head = m_head.exchange(w, rl::mo_release, $);

w->m_next.store(head, rl::mo_relaxed, $);

}

work* flush()
{
return m_head.exchange(nullptr, rl::mo_acquire, $);
}

};

// A test of a simple way to distribute work
// Using the ct_xchg_lifo_ver_001 API
template<typename T, std::size_t T_N>
struct ct_distribute_ver_001
{
T m_table[T_N];
typedef typename T::work work;

void push(work* w, unsigned int tidx)
{
T& t = m_table[tidx % T_N];
t.push(w);
}

work* pop(unsigned int tidx)
{
for (std::size_t i = 0; i < T_N; ++i)
{
T& t = m_table[(i + tidx) % T_N];
work* w = t.flush();
if (w) return w;
}

return nullptr;
}
};

// Relacy Test Unit...

struct ct_relacy_test_fun
: rl::test_suite<ct_relacy_test_fun, CT_THREAD_N>
{

typedef ct_distribute_ver_001<
ct_xchg_lifo_ver_001,
CT_DISTRIBUTE_WORK_N
> distribute;

distribute m_work;

void thread(unsigned int tidx)
{
for (unsigned long wi = 0; wi < CT_WORK_N; ++wi)
{
for (unsigned long i = 0; i < CT_PRODUCER_WORK_N; ++i)
{
distribute::work* w = new distribute::work(tidx);
m_work.push(w, tidx);
}

for (unsigned long i = 0; i < CT_CONSUMER_WORK_N; ++i)
{
distribute::work* w = m_work.pop(tidx);
distribute::work::process(w, tidx);
distribute::work::destroy(w, tidx);
}
}
}
};

int
main()
{
std::cout << "Exchange LIFO Container Experiment ver:0.0.2\n";
std::cout << "Relacy Unit Test ver:0.0.2\n";

std::cout << "by: Chris M. Thomasson\n";
std::cout << "__________________________________\n" << std::endl;

{
rl::test_params p;

p.iteration_count = 1000000;

//p.execution_depth_limit = 33333;
//p.search_type = rl::sched_bound;
//p.search_type = rl::fair_full_search_scheduler_type;
//p.search_type = rl::fair_context_bound_scheduler_type;

std::cout << "Executing Relacy Unit Test...\n";
std::cout << "__________________________________" << std::endl;

rl::simulate<ct_relacy_test_fun>(p);

std::cout << "\nwork load "
<< g_stats_local_work
<< ", "
<< g_stats_foreign_work
<< std::endl;
}

return 0;
}
___________________________

Chris M. Thomasson

unread,

Nov 27, 2023, 9:34:33 PM11/27/23

to

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Fwiw, this is an out of bound stat that I am using. If I were keeping
stats in a real C++ impl, I would use split counters.
[...]

Chris M. Thomasson

unread,

Nov 29, 2023, 6:30:00 PM11/29/23

to

Actually, I should refer to the stats as out-of-band state that is
outside of the Relacy unit test.

Chris M. Thomasson

unread,

Nov 30, 2023, 11:31:46 PM11/30/23

to

On 11/27/2023 6:32 PM, Chris M. Thomasson wrote:
> On 11/26/2023 10:14 PM, Chris M. Thomasson wrote:
>> Fwiw, here is a quick and dirty mock up of my atomic exchange based

>> lifo work queue thing. Not sure how to classify it quite yet. [...]

>         static void process(work* cur, unsigned long tidx)
>         {
>             // must not be pending!
>             RL_ASSERT(cur != CT_PENDING);
>
>             // process work nodes...
>
>             while (cur)
>             {
>                 // do real work _before_ the pending logic
>                 // this is highly beneficial... Big time!
>                 cur->execute_work(tidx);
>
>                 // get the next work node.
>                 cur = cur->get_next();
>             }
>         }

[...]
There are several ways to allow the next work to be passed to another
thread if the real work is compute intensive, so to speak.

Chris M. Thomasson

unread,

Dec 2, 2023, 4:26:26 PM12/2/23

to

Another fun aspect of this is that it can be distributed in many
different ways. Keep in mind that it only using atomic exchange...

Chris M. Thomasson

unread,

Dec 2, 2023, 4:34:53 PM12/2/23

to

On 11/26/2023 10:14 PM, Chris M. Thomasson wrote:

[...]

Another fun improvement is to allow push to insert a head and tail such
that it can insert many nodes at once. Still no CAS needed for this.
Also, flush can install a new list if it wants to. No CAS needed. ;^)

A goal of mine was to create a work system using atomic exchange only.

Chris M. Thomasson

unread,

Dec 2, 2023, 10:09:48 PM12/2/23

to

On 12/2/2023 1:34 PM, Chris M. Thomasson wrote:
> On 11/26/2023 10:14 PM, Chris M. Thomasson wrote:

[...]

I need to model a version of some real work in relacy, just for fun.
Imvvvho, an interesting question is how to gain an equipotential in 3d
space when there are an infinite number of them... One of my thoughts is
to gain three points in a normal field line, then use the surface normal
of the triangle for an equipotential. Seems like it works okay...

Chris M. Thomasson

unread,

Dec 2, 2023, 10:11:06 PM12/2/23

to

So, real work on a fine grain level wrt that type of work can consist of
three points and an epsilon. The surface normal would be returned. This
is fairly fine grain here...

Chris M. Thomasson

unread,

Dec 2, 2023, 10:12:06 PM12/2/23

to

Speaking of granularity, what bout the logic that derives the three
points via field line plotting? They can be in the same work node, so to
speak.

Chris M. Thomasson

unread,

Dec 3, 2023, 1:18:30 AM12/3/23

to

A thread checks the work system, gains some work and starts executing.
Humm... A real work item can tell it to compute n points of a field line
in a vector field. The work item might look like, just sketching this out:

<pseudo-code>
____________________
struct work_field_line_plot
{
// a dedicated portion of a canvas for
// this work item to plot on...
ct_canvas_section& m_canvas;

// say this work does not alter the
// vector field... so, its const
ct_vector_field const& m_field;

// line plotting data...
glm::vec4 m_origin;
glm::vec4 m_origin_radii;
float m_origin_angle;
float m_epsilon;
unsigned long m_step_n;

// get to work!
void execute()
{
// plot field lines on m_canvas...

float normal_step = 1.f / m_step_n;

// [...] Plotting logic...
}
};
____________________

Something along those lines. A work item has a constant reference to the
field and a non-constant reference to its portion of a canvas. Now,
should I give each work item its own independent canvas? I have did that
before and its nice because each work item can have its own rendering.
Combining all of them together makes the final image...

vs, each work item having its own place to plot in a single, shared
canvas? Humm... :^)

Chris M. Thomasson

unread,

Dec 3, 2023, 1:24:56 AM12/3/23

to

Afaict, it would be a very strange experiment to allow for the field
lines to be plotted using a carefully constructed race condition in the
sense that its 100% legit in C++, but, the field lines would become
corrupted in "interesting" ways, no longer isolated from one another...
The work items can race with each other while building their lines...
Something along the lines of:

https://groups.google.com/g/comp.lang.c++/c/7u_rLgQe86k/m/fYU9SnuAFQAJ

Just thinking out loud here... Sorry for flooding. ;^o

Chris M. Thomasson

unread,

Jan 15, 2024, 4:39:02 PM1/15/24

to

On 11/26/2023 10:14 PM, Chris M. Thomasson wrote:

> Fwiw, here is a quick and dirty mock up of my atomic exchange based lifo
> work queue thing. Not sure how to classify it quite yet. I am going to
> use it as a work "queue" for my parts of my CPU based rendering engine,
> no need for it wrt my shader work. However, I needed to do it. It's
> passing Relacy tests. That's a good thing. Now I need to add in a slow
> path so it can wait in the kernel during periods of no work. I think an
> eventcount should work for it. Or perhaps I might integrate some state
> in the pointer values for futexes, or whatever. I have not made up my
> mind yet.

[...]

Got some nice test targets for the experimental xchg based "queue" of
mine to render:

https://i.ibb.co/p2xgRh2/image.png

https://i.ibb.co/sCpZ70V/image.png

Chris M. Thomasson

unread,

Jan 27, 2024, 12:55:44 AM1/27/24

to

On 11/26/2023 10:14 PM, Chris M. Thomasson wrote:

> Fwiw, here is a quick and dirty mock up of my atomic exchange based lifo
> work queue thing. Not sure how to classify it quite yet. I am going to

[...]

Humm, should have some free time to do this... Might as well use it for
n-ary fields. 3d here:

https://i.ibb.co/Qd8QtGz/image.png

Chris M. Thomasson

unread,

Feb 1, 2024, 4:16:41 PM2/1/24

to

On 11/26/2023 10:14 PM, Chris M. Thomasson wrote:

> Fwiw, here is a quick and dirty mock up of my atomic exchange based lifo
> work queue thing. Not sure how to classify it quite yet. I am going to
> use it as a work "queue" for my parts of my CPU based rendering engine,
> no need for it wrt my shader work. However, I needed to do it. It's
> passing Relacy tests. That's a good thing. Now I need to add in a slow
> path so it can wait in the kernel during periods of no work. I think an
> eventcount should work for it. Or perhaps I might integrate some state
> in the pointer values for futexes, or whatever. I have not made up my
> mind yet.
>
> I will port my Relacy test units over to standard C++. It's not all that
> hard. In fact, it's rather straightforward.

[...]

Humm... Might be able to use it to quickly generate very complex models.
Reduce the loading screen time... This one is fairly complex and I can
get 60fps...

https://i.ibb.co/CBJqT8h/image.png

Chris M. Thomasson

unread,

Feb 1, 2024, 4:17:59 PM2/1/24

to

https://i.ibb.co/02MskNt/image.png