using a special pointer value...

Chris M. Thomasson

unread,

Jun 28, 2020, 8:27:43 PM6/28/20

to

The original message for this can be found here:

https://groups.google.com/forum/#!original/comp.arch/k1Qr520dcDk/4vl_jiVRBQAJ

within the following thread:

https://groups.google.com/d/topic/comp.arch/k1Qr520dcDk/discussion

I am wondering how bad this hack can really be in C++. Take careful note
of the SPECIAL macro... Here is the original text:

Fwiw, lets see if I can create some pseudo code on the fly right here of
a _highly_ experimental atomic "mostly" lock-free stack I did a while
back... Pardon any typos... So, this stack should be "different" in that
it uses a loopless atomic exchange operation, well on the x86 anyway,
for the push and pop, or flush to be more accurate. LOCK XCHG, or isn't
LOCK implied for XCHG on x86? I think so, but whatever. atomic
exchange... Here we go!

Lets create a special address that comes in handy later...

Membars aside for a moment:

// hackish!
#define SPECIAL ((node*)0xDEADBEEF)

Lets create a node:

struct node
{
node* next;
// user defined payload
};

Okay. Lets create a stack:

struct stack
{
node* head; // initialize with a nullptr
};

Fine. Lets create a function that adds a new node to this stack:

void stack_push(stack* self, node* n)
{
n->next = SPECIAL;
node* previous = XCHG(&self->head, n);
n->next = previous;
}

Okay! That's it for push. Nice and simple, basically wait-free and gets
LIFO order: Okay for a stack. Now, lets allow one to remove all items
from this stack at once:

node* stack_flush(stack* self)
{
return XCHG(&self->head, nullptr);
}

Perfect! A thread can grab all of the nodes at once. Pretty fast, and
wait-free. Now... What can we do with the list of nodes returned from
stack_flush, assuming its not nullptr? Well.. We can process it! Lets
define something called user_process_node. Its a function that takes a
pointer to a node as input, and does whatever a user wants with its user
defined payload. I do not even need to define it, just declare its
interface. So, user_process_node(node*) is it.

Now, lets create a "driver" that eventually calls into
user_process_node, call it node_process. It takes a list of nodes and
executes each one of them... So, something like...

Humm, well, before I show the driver, perhaps I should show a very
important function first, call it node_next_wait. Perhaps, the most
important. Its going to turn people off for a moment, but it can be sort
of, "amortized" across the complexity of user processing in a sense.

node* node_next_wait(node* n)
{
node* next = nullptr;

while ((next = n->next) == SPECIAL) backoff();

return next;
}

Well, there ya go. This is blocking after all, however its still
interesting to me, it returns the next pointer. Okay, lets define the
driver function node_process:

void node_process(node* head)
{
node* n = head;

while (n)
{
// user processing, _before_ we wait...
// should soak up some time...
user_process_node(n);

// wait for the next work
n = node_next_wait(n);
}
}

Okay, this can processes a list of nodes using wait free atomics for the
stack, and use some blocking, spinning on nodes, however, the user
processing should act as a sort of backoff anyway, minimizing actual
spins doing nothing.

Think of something like:

node* work = stack_flush(work_stack);

node_process(work);

Well, if work is not nullptr, then actual user work is going to be
performed.

[...]

I have this coded up in C++ somewhere. I need to find it.

James Kuyper

unread,

Jun 28, 2020, 9:09:24 PM6/28/20

to

On 6/28/20 8:27 PM, Chris M. Thomasson wrote:
> The original message for this can be found here:
>
> https://groups.google.com/forum/#!original/comp.arch/k1Qr520dcDk/4vl_jiVRBQAJ
>
> within the following thread:
>
> https://groups.google.com/d/topic/comp.arch/k1Qr520dcDk/discussion
>
>
> I am wondering how bad this hack can really be in C++. Take careful note
> of the SPECIAL macro... Here is the original text:
>
>
> Fwiw, lets see if I can create some pseudo code on the fly right here of
> a _highly_ experimental atomic "mostly" lock-free stack I did a while
> back... Pardon any typos... So, this stack should be "different" in that
> it uses a loopless atomic exchange operation, well on the x86 anyway,
> for the push and pop, or flush to be more accurate. LOCK XCHG, or isn't
> LOCK implied for XCHG on x86? I think so, but whatever. atomic
> exchange... Here we go!
>
> Lets create a special address that comes in handy later...
>
> Membars aside for a moment:
>
> // hackish!
> #define SPECIAL ((node*)0xDEADBEEF)

Your code does explicit type conversion using cast notation, which can
perform a wide variety of different kinds of conversions (8.4p4). This
particular conversion is covered by reinterpret_cast<>, and it would be
better if you to explicitly use reinterpret_cast<node*>, which does a
better job than (node*) of drawing attention to the fact that what
you're doing is rather dangerous. (node*) is more dangerous than
reinterpret_cast<>, but Stroustrup invented the named casts in part so
it would be easier to notice and search for them.

"A value of integral type or enumeration type can be explicitly
converted to a pointer. A pointer converted to an integer of sufficient
size (if any such exists on the implementation) and back to the same
pointer type will have its original value; mappings between pointers and
integers are otherwise implementation-defined.
[ Note: Except as described in 6.7.4.3, the result of such a conversion
will not be a safely-derived pointer value. — end note ]" (8.2.10p5).

The implementation-defined value of SPECIAL might, in particular,
coincidentally point at an actual node object that contains data you
don't want ignored.

0xDEADBEAF doesn't match any of the options listed in 6.7.4.3p4.

A MUCH safer approach is to define a special node object, and to define
SPECIAL as a pointer to that object.

Chris M. Thomasson

unread,

Jun 28, 2020, 9:31:40 PM6/28/20

to

If SPECIAL points at an actual node, the whole thing would bite the damn
dust: Big time! Ouch... ;^o

> 0xDEADBEAF doesn't match any of the options listed in 6.7.4.3p4.
>
> A MUCH safer approach is to define a special node object, and to define
> SPECIAL as a pointer to that object.

Agreed. Perhaps something simple like:
_______________________
#include <iostream>

struct node
{
node* next;
};

static node const g_special_wait_node = { nullptr };

#define SPECIAL (&g_special_wait_node)

int main()
{
std::cout << SPECIAL << "\n";

return 0;
}
_______________________

Humm...

Juha Nieminen

unread,

Jun 29, 2020, 2:19:16 AM6/29/20

to

Chris M. Thomasson <chris.m.t...@gmail.com> wrote:
> struct node
> {
> node* next;
> // user defined payload
> };

The idea of using a linked list for anything where maximum efficiency is
expected bothers me. Linked lists are quite inefficient in C++ (and C)
in particular, because "new" is inefficient, and allocating individual
nodes causes memory fragmentation and cache suboptimality.

> node* previous = XCHG(&self->head, n);

I don't really know why you are using a hypothetical "XCHG" function
when the standard already defines one: std::atomic_exchange().

Chris M. Thomasson

unread,

Jun 29, 2020, 3:30:06 AM6/29/20

to

On 6/28/2020 11:19 PM, Juha Nieminen wrote:
> Chris M. Thomasson <chris.m.t...@gmail.com> wrote:
>> struct node
>> {
>> node* next;
>> // user defined payload
>> };
>
> The idea of using a linked list for anything where maximum efficiency is
> expected bothers me. Linked lists are quite inefficient in C++ (and C)
> in particular, because "new" is inefficient, and allocating individual
> nodes causes memory fragmentation and cache suboptimality.

It can be a hybrid where each node contains an array.

Wrt to a memory allocator for this, I can think of several "fairly
efficient" schemes. Fragmentation aside for a moment. Multiple lifo
lists, even per-thread can be used and primed with nodes. So, ideally
the calls to new would be minimized. When we finally have to call into
new because all of the free lists are empty, we can allocate large
blocks of nodes in one shot, not just a single one.

>
>> node* previous = XCHG(&self->head, n);
>
> I don't really know why you are using a hypothetical "XCHG" function
> when the standard already defines one: std::atomic_exchange().
>

It was pseudo-code. std::atomic_exchange() works like a charm.

Chris M. Thomasson

unread,

Jun 29, 2020, 4:19:13 AM6/29/20

to

On 6/28/2020 5:27 PM, Chris M. Thomasson wrote:
> The original message for this can be found here:
>
> https://groups.google.com/forum/#!original/comp.arch/k1Qr520dcDk/4vl_jiVRBQAJ
>
>
> within the following thread:
>
> https://groups.google.com/d/topic/comp.arch/k1Qr520dcDk/discussion
>

[...]

Here is a _highly_ _crude_ little implementation. It should compile
right up.
_____________________________
#include <iostream>
#include <atomic>
#include <thread>

struct user_payload
{
int m_foo;

user_payload(int foo) : m_foo(foo) {}

user_payload(const user_payload& rhs) : m_foo(rhs.m_foo) {}

void process()
{
std::cout << this << "->user_payload::process() " << m_foo << "\n";
std::this_thread::yield();
}
};

namespace ct
{

static struct node* node_wait_next(node*);

struct node
{
std::atomic<node*> m_next;
user_payload m_payload;

node(user_payload const& payload) : m_next(nullptr),
m_payload(payload)
{
std::cout << this << "->node::node()" << "\n";
}

~node()
{
std::cout << this << "->node::~node()" << "\n";
}

static void process(node* head)
{
node* cur = head;

while (cur)
{
// user processing first!
cur->m_payload.process();

// now, we see if we need to wait...
node* next = node_wait_next(cur);

delete cur;

cur = next;
}
}
};

static node g_special_wait_node(0);

#define CT_SPECIAL (&ct::g_special_wait_node)

node* node_wait_next(node* n)
{
node* next = nullptr;
while ((next = n->m_next.load(std::memory_order_relaxed))
== CT_SPECIAL) std::this_thread::yield();
return next;
}

struct stack
{
std::atomic<node*> m_head;

stack() : m_head(nullptr) {}

void push(node* n)
{
n->m_next.store(CT_SPECIAL, std::memory_order_relaxed);
node* prev = m_head.exchange(n, std::memory_order_release);
n->m_next.store(prev, std::memory_order_relaxed);
}

node* flush()
{
return m_head.exchange(nullptr, std::memory_order_acquire);
}
};
}

int main()
{
std::cout << "\nCT_SPECIAL = " << CT_SPECIAL << "\n\n";

{
ct::stack alist;

alist.push(new ct::node(789));
alist.push(new ct::node(456));
alist.push(new ct::node(123));

ct::node* work = alist.flush();

ct::node::process(work);

alist.push(new ct::node(321));
alist.push(new ct::node(654));
alist.push(new ct::node(987));

work = alist.flush();

ct::node::process(work);
}

return 0;
}
_____________________________

Can you get it to work? Fwiw, I get the following output:
_____________________________
0x6021c0->node::node()

CT_SPECIAL = 0x6021c0

0x17a0280->node::node()
0x17a02a0->node::node()
0x17a02c0->node::node()
0x17a02c8->user_payload::process() 123
0x17a02c0->node::~node()
0x17a02a8->user_payload::process() 456
0x17a02a0->node::~node()
0x17a0288->user_payload::process() 789
0x17a0280->node::~node()
0x17a0280->node::node()
0x17a02a0->node::node()
0x17a02c0->node::node()
0x17a02c8->user_payload::process() 987
0x17a02c0->node::~node()
0x17a02a8->user_payload::process() 654
0x17a02a0->node::~node()
0x17a0288->user_payload::process() 321
0x17a0280->node::~node()
0x6021c0->node::~node()
_____________________________

Juha Nieminen

unread,

Jun 29, 2020, 5:46:00 AM6/29/20

to

Chris M. Thomasson <chris.m.t...@gmail.com> wrote:
> Wrt to a memory allocator for this, I can think of several "fairly
> efficient" schemes. Fragmentation aside for a moment. Multiple lifo
> lists, even per-thread can be used and primed with nodes. So, ideally
> the calls to new would be minimized. When we finally have to call into
> new because all of the free lists are empty, we can allocate large
> blocks of nodes in one shot, not just a single one.

The problem with using a custom allocator is that now you have to also
make this allocator itself thread-safe. (The clib allocator, which
new, malloc() etc use is itself thread-safe, by necessity, which is
one of the many reasons why it's a bit slow.)

You make a mention that the nodes could be thread-local, in which case
you don't need thread-safety measures. Of course the problem now becomes
that the nodes cannot be shared between threads. But I suppose in some
situations that might be ok.

Chris M. Thomasson

unread,

Jun 29, 2020, 1:19:47 PM6/29/20

to

Nodes can be shared between threads even though the free lists are
thread local. If a thread frees a node that it did not create, it just
pushes the node onto the threads free list that allocated it. Each
thread has basically two free lists. One sync free, and another
lock-free one used for remote frees.

This is efficient. Basically, a free operation consists of rounding down
the pointer value to a large base. This allows us to get at the header.
Then the freeing thread compares its id to the id in the said header. If
they are the same, it can just add it to its sync-free list. If they
differ, it uses an atomic RMW to add it to the remote owner thread's list.

Something like, pseudo-code:

void node_free(node* n)
{
header* h = round_down(n);

if (h->id == this_thread->id)
{
// fast
h->local.push(n);
}

else
{
// uses lock-free list, slower than local free
h->remote.push(n);
}
}

See?

Chris M. Thomasson

unread,

Jun 29, 2020, 1:22:09 PM6/29/20

to

If a thread tries to allocate from its sync-free list, and its empty...
It transfers all of the nodes from its remote list to the sync-free
list, and tries again. If everything empty, it allocates another large
batch of nodes.

Chris M. Thomasson

unread,

Jun 29, 2020, 3:17:53 PM6/29/20

to

On 6/29/2020 10:19 AM, Chris M. Thomasson wrote:

[...]

A fun aspect is that the node link can be part, think of union, of the
user payload itself. The only caveat is that each node is the size of a
pointer. So, if the user allocates a single char, well, the node is the
size of a pointer, wasted space. The only time this pointer is used is
when its in a deallocated state, or on the remote free list of its owner
thread. Therefore, the if the user allocates something that happens to
be the size of a a pointer or greater, well, then there is no wasted space.

> See?

Chris M. Thomasson

unread,

Jun 30, 2020, 6:34:25 PM6/30/20

to

On 6/29/2020 1:19 AM, Chris M. Thomasson wrote:
> On 6/28/2020 5:27 PM, Chris M. Thomasson wrote:
>> The original message for this can be found here:
>>
>> https://groups.google.com/forum/#!original/comp.arch/k1Qr520dcDk/4vl_jiVRBQAJ
>>
>>
>> within the following thread:

[...]

>     struct stack
>     {
>         std::atomic<node*> m_head;
>
>         stack() : m_head(nullptr) {}
>
>         void push(node* n)
>         {
>             n->m_next.store(CT_SPECIAL, std::memory_order_relaxed);
>             node* prev = m_head.exchange(n, std::memory_order_release);
>             n->m_next.store(prev, std::memory_order_relaxed);
>         }
>
>
>         node* flush()
>         {
>             return m_head.exchange(nullptr, std::memory_order_acquire);
>         }
>     };

[...]

fwiw, there is a way to get rid of the acquire barrier in the
stack::flush function. The only difference is that some relaxed barriers
would become consume. This is fine on most arch's because they are
no-ops wrt memory barriers, except dec alpha.