Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Poor Mans RCU...

117 views
Skip to first unread message

Chris M. Thomasson

unread,
Feb 7, 2021, 11:20:55 PM2/7/21
to
Well, here is a test program. When you get some free time, can you
please try to run it; give it a go?

This is a crude implementation, but it shows how reader threads can
iterate a lockfree stack while writer threads are mutating it. 100%
lockfree. Before I post my output, I want to see yours. The output you
get should be identical to my output. Will be adding a lot more context
here sometime tomorrow. Looking forward to your input, and showing me
your output. If you find any issues, please, I am all ears, or eyes wrt
reading the posts... ;^)

Here is the code, it should compile right up with C++11. My next version
will use C++17 and alignas.

https://pastebin.com/raw/nPVYXbWM
_____________________________________
// Chris M. Thomassons Poor Mans RCU... Example 123...


#include <iostream>
#include <atomic>
#include <thread>
#include <cstdlib>
#include <cstdint>
#include <climits>
#include <functional>


// Masks
static constexpr std::uint32_t ct_ref_mask = 0xFFFFFFF0U;
static constexpr std::uint32_t ct_ref_complete = 0x30U;
static constexpr std::uint32_t ct_ref_inc = 0x20U;
static constexpr std::uint32_t ct_proxy_mask = 0xFU;
static constexpr std::uint32_t ct_proxy_quiescent = 0x10U;


// Iteration settings
static constexpr unsigned long ct_reader_iters_n = 1000000;
static constexpr unsigned long ct_writer_iters_n = 100000;


// Thread counts
static constexpr unsigned long ct_reader_threads_n = 42;
static constexpr unsigned long ct_writer_threads_n = 7;


// Some debug/sanity check things...
// Need to make this conditional in compilation with some macros...
static std::atomic<std::uint32_t> g_debug_node_allocations(0);
static std::atomic<std::uint32_t> g_debug_node_deallocations(0);


// Need to align and pad data structures! To do...

struct ct_node
{
std::atomic<ct_node*> m_next;
ct_node* m_defer_next;

ct_node() : m_next(nullptr), m_defer_next(nullptr)
{
g_debug_node_allocations.fetch_add(1, std::memory_order_relaxed);
}

~ct_node()
{
g_debug_node_deallocations.fetch_add(1, std::memory_order_relaxed);
}
};


// The proxy collector itself... :^)
template<std::size_t T_defer_limit>
class ct_proxy
{
static void prv_destroy(ct_node* n)
{
while (n)
{
ct_node* next = n->m_defer_next;
delete n;
n = next;
}
}


public:
class collector
{
friend class ct_proxy;

private:
std::atomic<ct_node*> m_defer;
std::atomic<std::uint32_t> m_defer_count;
std::atomic<std::uint32_t> m_count;



public:
collector()
: m_defer(nullptr),
m_defer_count(0),
m_count(0)
{

}


~collector()
{
prv_destroy(m_defer.load(std::memory_order_relaxed));
}
};


private:
std::atomic<std::uint32_t> m_current;
std::atomic<bool> m_quiesce;
ct_node* m_defer;
collector m_collectors[2];


public:
ct_proxy()
: m_current(0),
m_quiesce(false),
m_defer(nullptr)
{

}

~ct_proxy()
{
prv_destroy(m_defer);
}


private:
void prv_quiesce_begin()
{
// Try to begin the quiescence process.
if (! m_quiesce.exchange(true, std::memory_order_acquire))
{
// advance the current collector and grab the old one.
std::uint32_t old =
m_current.load(std::memory_order_relaxed) & ct_proxy_mask;
old = m_current.exchange((old + 1) & 1,
std::memory_order_acq_rel);
collector& c = m_collectors[old & ct_proxy_mask];

// decode reference count.
std::uint32_t refs = old & ct_ref_mask;

// increment and generate an odd reference count.
std::uint32_t old_refs = c.m_count.fetch_add(refs +
ct_proxy_quiescent, std::memory_order_release);

if (old_refs == refs)
{
// odd reference count and drop-to-zero condition detected!
prv_quiesce_complete(c);
}
}
}


void prv_quiesce_complete(collector& c)
{
// the collector `c' is now in a quiescent state! :^)
std::atomic_thread_fence(std::memory_order_acquire);

// maintain the back link and obtain "fresh" objects from
// this collection.
ct_node* n = m_defer;
m_defer = c.m_defer.load(std::memory_order_relaxed);
c.m_defer.store(0, std::memory_order_relaxed);

// reset the reference count.
c.m_count.store(0, std::memory_order_relaxed);
c.m_defer_count.store(0, std::memory_order_relaxed);

// release the quiesce lock.
m_quiesce.store(false, std::memory_order_release);

// destroy nodes.
prv_destroy(n);
}


public:
collector& acquire()
{
// increment the master count _and_ obtain current collector.
std::uint32_t current =
m_current.fetch_add(ct_ref_inc, std::memory_order_acquire);

// decode the collector index.
return m_collectors[current & ct_proxy_mask];
}

void release(collector& c)
{
// decrement the collector.
std::uint32_t count =
c.m_count.fetch_sub(ct_ref_inc, std::memory_order_release);

// check for the completion of the quiescence process.
if ((count & ct_ref_mask) == ct_ref_complete)
{
// odd reference count and drop-to-zero condition detected!
prv_quiesce_complete(c);
}
}


collector& sync(collector& c)
{
// check if the `c' is in the middle of a quiescence process.
if (c.m_count.load(std::memory_order_relaxed) & ct_proxy_quiescent)
{
// drop `c' and get the next collector.
release(c);

return acquire();
}

return c;
}


void collect()
{
prv_quiesce_begin();
}


void collect(collector& c, ct_node* n)
{
if (! n) return;

// link node into the defer list.
ct_node* prev = c.m_defer.exchange(n, std::memory_order_relaxed);
n->m_defer_next = prev;

// bump the defer count and begin quiescence process if over
// the limit.
std::uint32_t count =
c.m_defer_count.fetch_add(1, std::memory_order_relaxed) + 1;

if (count >= (T_defer_limit / 2))
{
prv_quiesce_begin();
}
}
};



typedef ct_proxy<10> ct_proxy_collector;


// you're basic lock-free stack...
// well, minus ABA counter and DWCAS of course! ;^)
class ct_stack
{
std::atomic<ct_node*> m_head;


public:
ct_stack() : m_head(nullptr)
{

}


public:
void push(ct_node* n)
{
ct_node* head = m_head.load(std::memory_order_relaxed);

do
{
n->m_next.store(head, std::memory_order_relaxed);
}

while (! m_head.compare_exchange_weak(
head,
n,
std::memory_order_release));
}


ct_node* flush()
{
return m_head.exchange(nullptr, std::memory_order_acquire);
}


ct_node* get_head()
{
return m_head.load(std::memory_order_acquire);
}


ct_node* pop()
{
ct_node* head = m_head.load(std::memory_order_acquire);
ct_node* xchg;

do
{
if (!head) return nullptr;

xchg = head->m_next.load(std::memory_order_relaxed);
}

while (! m_head.compare_exchange_weak(
head,
xchg,
std::memory_order_acquire));

return head;
}
};


// The shared state
struct ct_shared
{
ct_proxy<10> m_proxy_gc;
ct_stack m_stack;
};



// Reader threads
// Iterates through the lock free stack
void ct_thread_reader(ct_shared& shared)
{
// iterate the lockfree stack
for (unsigned long i = 0; i < ct_reader_iters_n; ++i)
{
ct_proxy_collector::collector& c = shared.m_proxy_gc.acquire();

ct_node* n = shared.m_stack.get_head();

while (n)
{
// need to add in some processing...
//std::this_thread::yield();

n = n->m_next.load(std::memory_order_relaxed);
}

shared.m_proxy_gc.release(c);
}
}



// Writer threads
// Mutates the lock free stack
void ct_thread_writer(ct_shared& shared)
{
for (unsigned long wloop = 0; wloop < 42; ++wloop)
{
for (unsigned long i = 0; i < ct_writer_iters_n; ++i)
{
shared.m_stack.push(new ct_node());
}

std::this_thread::yield();

ct_proxy_collector::collector& c = shared.m_proxy_gc.acquire();

for (unsigned long i = 0; i < ct_writer_iters_n; ++i)
{
shared.m_proxy_gc.collect(c, shared.m_stack.pop());
}

shared.m_proxy_gc.release(c);

std::this_thread::yield();

if ((wloop % 3) == 0)
{
shared.m_proxy_gc.collect();
}
}
}



int main()
{
std::cout << "Chris M. Thomassons Proxy Collector Port ver
.0.0.1...\n";
std::cout << "_______________________________________\n\n";


{
ct_shared shared;

std::thread readers[ct_reader_threads_n];
std::thread writers[ct_writer_threads_n];

std::cout << "Booting threads...\n";

for (unsigned long i = 0; i < ct_writer_threads_n; ++i)
{
writers[i] = std::thread(ct_thread_writer, std::ref(shared));
}

for (unsigned long i = 0; i < ct_reader_threads_n; ++i)
{
readers[i] = std::thread(ct_thread_reader, std::ref(shared));
}

std::cout << "Threads running...\n";

for (unsigned long i = 0; i < ct_reader_threads_n; ++i)
{
readers[i].join();
}

for (unsigned long i = 0; i < ct_writer_threads_n; ++i)
{
writers[i].join();
}
}

std::cout << "Threads completed!\n\n";


// Sanity check!
{
std::uint32_t node_allocations =
g_debug_node_allocations.load(std::memory_order_relaxed);
std::uint32_t node_deallocations =
g_debug_node_deallocations.load(std::memory_order_relaxed);

std::cout << "node_allocations = " << node_allocations << "\n";
std::cout << "node_deallocations = " << node_deallocations << "\n";

if (node_allocations != node_deallocations)
{
std::cout << "OH SHIT! NODE LEAK!!! SHIT! = " <<
node_allocations - node_deallocations << "\n\n";
}

}

std::cout << "\n\nTest Completed!\n\n";

return 0;
}
_____________________________________

Well, can you even get it compile on your end? What do you get?

Thanks everybody.

Manfred

unread,
Feb 8, 2021, 7:50:54 AM2/8/21
to
On 2/8/2021 5:20 AM, Chris M. Thomasson wrote:
> Well, here is a test program. When you get some free time, can you
> please try to run it; give it a go?
>

$ c++ --version
c++ (GCC) 10.2.1 20201125 (Red Hat 10.2.1-9)
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is
NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
PURPOSE.


$ c++ -std=c++11 -Wall -O2 -lpthread rcu_chris.cc && ./a.out
Chris M. Thomassons Proxy Collector Port ver .0.0.1...
_______________________________________

Booting threads...
Threads running...
Threads completed!

node_allocations = 29400000
node_deallocations = 29400000


Test Completed!

Paavo Helde

unread,
Feb 8, 2021, 2:15:04 PM2/8/21
to
08.02.2021 06:20 Chris M. Thomasson kirjutas:
> Well, here is a test program. When you get some free time, can you
> please try to run it; give it a go?
>
> This is a crude implementation, but it shows how reader threads can
> iterate a lockfree stack while writer threads are mutating it. 100%
> lockfree. Before I post my output, I want to see yours. The output you
> get should be identical to my output. Will be adding a lot more context
> here sometime tomorrow. Looking forward to your input, and showing me
> your output. If you find any issues, please, I am all ears, or eyes wrt
> reading the posts... ;^)

VS2017 x64 on Windows 10: no issues (8 physical cores)

Chris M. Thomassons Proxy Collector Port ver .0.0.1...
_______________________________________

Booting threads...
Threads running...
Threads completed!

node_allocations = 29400000
node_deallocations = 29400000


Test Completed!

-------------------------------------------------------------------------

gcc 7.4.0 on Linux: no issues (2 NUMA nodes, 24 physical cores)

Chris M. Thomassons Proxy Collector Port ver .0.0.1...

Chris M. Thomasson

unread,
Feb 8, 2021, 10:20:42 PM2/8/21
to
Perfect output. That is the exact number I am looking for. Thank you so
much for taking the time to give it a go Manfred. Now, its time to move
this over to github. I am writing my response to another kind person,
Paavo, who also took the time and energy to compile _and_ run the sucker.

I am currently writing up some other use cases for my poor mans rcu.

Chris M. Thomasson

unread,
Feb 8, 2021, 10:35:58 PM2/8/21
to
Wonderful! First off thanks Paavo, and Manfred. Its nice to see my code
being run on a diverse set of arch's and os's. Okay, your output is
perfect: you got the numbers I am looking for. Time for me to move it to
github, and create other use cases.

Actually, this version 0.0.1 is a pretty hardcore test. Usually RCU does
not like heavy writer activity. Hence the reader to writer thread ratio,
at 42:7. However, those seven writer threads are assaulting the poor
mans proxy collector. I need to create a benchmark test wrt a read/write
mutex vs. my proxy gc. Since rwmutex is the perfect thing to test
against, well, I am currently coding one up. Wrt this type of benchmark,
we are going to measure the number of reads-per-second, per-thread with
my poor RCU vs rwmutex. Since writers do not block readers, and vise
versa, and from past experience, well... The rwmutex does not really
stand a chance.

So, I will post a link to my new entry into gibhub for my code. Using
pastebin is not ideal.

Thanks again for taking the time to run my code Paavo. I really do
appreciate it.

Chris M. Thomasson

unread,
Feb 9, 2021, 2:05:28 AM2/9/21
to
On 2/7/2021 8:20 PM, Chris M. Thomasson wrote:
> Well, here is a test program. When you get some free time, can you
> please try to run it; give it a go?
>
> This is a crude implementation, but it shows how reader threads can
> iterate a lockfree stack while writer threads are mutating it. 100%
> lockfree. Before I post my output, I want to see yours. The output you
> get should be identical to my output. Will be adding a lot more context
> here sometime tomorrow. Looking forward to your input, and showing me
> your output. If you find any issues, please, I am all ears, or eyes wrt
> reading the posts... ;^)
>
> Here is the code, it should compile right up with C++11. My next version
> will use C++17 and alignas.
>
> https://pastebin.com/raw/nPVYXbWM
> _____________________________________
> // Chris M. Thomassons Poor Mans RCU... Example 123...
[...]
> private:
>     void prv_quiesce_begin()
>     {
>         // Try to begin the quiescence process.
>         if (! m_quiesce.exchange(true, std::memory_order_acquire))
>         {
>             // advance the current collector and grab the old one.
>             std::uint32_t old =
> m_current.load(std::memory_order_relaxed) & ct_proxy_mask;
>             old = m_current.exchange((old + 1) & 1,
> std::memory_order_acq_rel);
>             collector& c = m_collectors[old & ct_proxy_mask];
>
>             // decode reference count.
>             std::uint32_t refs = old & ct_ref_mask;
>
>             // increment and generate an odd reference count.
>             std::uint32_t old_refs = c.m_count.fetch_add(refs +
> ct_proxy_quiescent, std::memory_order_release);
>
>             if (old_refs == refs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

There is a rather odd issue with this. It can sometimes miss collector
cycles. Everything works as far as allowing concurrent reads and writes
to the data-structure, but its still a big issue. Correcting it makes
the collector more efficient. It will be corrected in my next version.

Btw, Manfred and Pavvo, here is the fix:

if (old_refs == 0 - refs)

Its an odd bug in a sense, because it works either way wrt allowing for
lockfree reads and writes to a shared data-structure, except one can
hold up some collection cycles. The condition is hard to trip.


>             {
>                 // odd reference count and drop-to-zero condition
> detected!
>                 prv_quiesce_complete(c);
>             }
>         }
>     }
[...]

Chris M. Thomasson

unread,
Feb 9, 2021, 2:31:42 AM2/9/21
to
On 2/7/2021 8:20 PM, Chris M. Thomasson wrote:
> Well, here is a test program. When you get some free time, can you
> please try to run it; give it a go?

I need to ask for one more favor, especially to Manfred and Pavvo. Can
you please run this version 0.0.2? It corrected a reference counting
issue in version 0.0.1. Even though the first version works, it can
sometimes miss a collection cycle, and allow memory to grow when it does
not have to. Here is the new version and it contains a lot more debug
stuff to show how your system is reacting to the algorihtm. I am sorry
for missing that counting issue in version 0.0.1. If you can please,
when you get free time, run it again? I need to see your output.

https://pastebin.com/raw/CYZ78gVj

Here is my output from a run:

Chris M. Thomassons Proxy Collector Port ver .0.0.2...
_______________________________________

Booting threads...
Threads running...
Threads completed!

node_allocations = 92400000
node_deallocations = 92400000

dtor_collect = 4
release_collect = 120
quiesce_complete = 124
quiesce_begin = 124
quiesce_complete_nodes = 92200000


Test Completed!

----

quiesce_complete should always equal quiesce_begin and node_allocations
should always equal node_deallocations. Also, quiesce_complete_nodes
should always be <= node_allocations


This version performs much better.



Here is the code:
_______________________________
// Chris M. Thomassons Poor Mans RCU... Example 456...


#include <iostream>
#include <atomic>
#include <thread>
#include <cstdlib>
#include <cstdint>
#include <climits>
#include <functional>


// Masks
static constexpr std::uint32_t ct_ref_mask = 0xFFFFFFF0U;
static constexpr std::uint32_t ct_ref_complete = 0x30U;
static constexpr std::uint32_t ct_ref_inc = 0x20U;
static constexpr std::uint32_t ct_proxy_mask = 0xFU;
static constexpr std::uint32_t ct_proxy_quiescent = 0x10U;


// Iteration settings
static constexpr unsigned long ct_reader_iters_n = 2000000;
static constexpr unsigned long ct_writer_iters_n = 200000;


// Thread counts
static constexpr unsigned long ct_reader_threads_n = 53;
static constexpr unsigned long ct_writer_threads_n = 11;


// Some debug/sanity check things...
// Need to make this conditional in compilation with some macros...
static std::atomic<std::uint32_t> g_debug_node_allocations(0);
static std::atomic<std::uint32_t> g_debug_node_deallocations(0);
static std::atomic<std::uint32_t> g_debug_dtor_collect(0);
static std::atomic<std::uint32_t> g_debug_release_collect(0);
static std::atomic<std::uint32_t> g_debug_quiesce_begin(0);
static std::atomic<std::uint32_t> g_debug_quiesce_complete(0);
static std::atomic<std::uint32_t> g_debug_quiesce_complete_nodes(0);

// Need to align and pad data structures! To do...

struct ct_node
{
std::atomic<ct_node*> m_next;
ct_node* m_defer_next;

ct_node() : m_next(nullptr), m_defer_next(nullptr)
{
g_debug_node_allocations.fetch_add(1, std::memory_order_relaxed);
}

~ct_node()
{
g_debug_node_deallocations.fetch_add(1, std::memory_order_relaxed);
}
};


// The proxy collector itself... :^)
template<std::size_t T_defer_limit>
class ct_proxy
{
static std::uint32_t prv_destroy(ct_node* n)
{
std::uint32_t count = 0;

while (n)
{
ct_node* next = n->m_defer_next;
delete n;
count++;
n = next;
}

return count;
g_debug_quiesce_begin.fetch_add(1, std::memory_order_relaxed);

// advance the current collector and grab the old one.
std::uint32_t old =
m_current.load(std::memory_order_relaxed) & ct_proxy_mask;
old = m_current.exchange((old + 1) & 1,
std::memory_order_acq_rel);
collector& c = m_collectors[old & ct_proxy_mask];

// decode reference count.
std::uint32_t refs = old & ct_ref_mask;

// increment and generate an odd reference count.
std::uint32_t old_refs = c.m_count.fetch_add(refs +
ct_proxy_quiescent, std::memory_order_release);

if (old_refs == 0 - refs)
{
g_debug_dtor_collect.fetch_add(1,
std::memory_order_relaxed);

// odd reference count and drop-to-zero condition detected!
prv_quiesce_complete(c);
}
}
}


void prv_quiesce_complete(collector& c)
{
g_debug_quiesce_complete.fetch_add(1, std::memory_order_relaxed);

// the collector `c' is now in a quiescent state! :^)
std::atomic_thread_fence(std::memory_order_acquire);

// maintain the back link and obtain "fresh" objects from
// this collection.
ct_node* n = m_defer;
m_defer = c.m_defer.load(std::memory_order_relaxed);
c.m_defer.store(0, std::memory_order_relaxed);

// reset the reference count.
c.m_count.store(0, std::memory_order_relaxed);
c.m_defer_count.store(0, std::memory_order_relaxed);

// release the quiesce lock.
m_quiesce.store(false, std::memory_order_release);

// destroy nodes.
std::uint32_t count = prv_destroy(n);

g_debug_quiesce_complete_nodes.fetch_add(count,
std::memory_order_relaxed);
}


public:
collector& acquire()
{
// increment the master count _and_ obtain current collector.
std::uint32_t current =
m_current.fetch_add(ct_ref_inc, std::memory_order_acquire);

// decode the collector index.
return m_collectors[current & ct_proxy_mask];
}

void release(collector& c)
{
// decrement the collector.
std::uint32_t count =
c.m_count.fetch_sub(ct_ref_inc, std::memory_order_release);

// check for the completion of the quiescence process.
if ((count & ct_ref_mask) == ct_ref_complete)
{
// odd reference count and drop-to-zero condition detected!
g_debug_release_collect.fetch_add(1,
std::memory_order_relaxed);
if (! head) return nullptr;

xchg = head->m_next.load(std::memory_order_relaxed);
}

while (!m_head.compare_exchange_weak(
head,
xchg,
std::memory_order_acquire));

return head;
}
};


// The shared state
struct ct_shared
{
ct_proxy<10> m_proxy_gc;
ct_stack m_stack;
};



// Reader threads
// Iterates through the lock free stack
void ct_thread_reader(ct_shared& shared)
{
// iterate the lockfree stack
for (unsigned long i = 0; i < ct_reader_iters_n; ++i)
{
ct_proxy_collector::collector& c = shared.m_proxy_gc.acquire();

ct_node* n = shared.m_stack.get_head();

while (n)
{
// need to add in some processing...
// std::this_thread::yield();

n = n->m_next.load(std::memory_order_relaxed);
}

shared.m_proxy_gc.release(c);
}
}



// Writer threads
// Mutates the lock free stack
void ct_thread_writer(ct_shared& shared)
{
for (unsigned long wloop = 0; wloop < 42; ++wloop)
{
shared.m_proxy_gc.collect();

for (unsigned long i = 0; i < ct_writer_iters_n; ++i)
{
shared.m_stack.push(new ct_node());
}

//std::this_thread::yield();

ct_proxy_collector::collector& c = shared.m_proxy_gc.acquire();

for (unsigned long i = 0; i < ct_writer_iters_n; ++i)
{
shared.m_proxy_gc.collect(c, shared.m_stack.pop());
}

shared.m_proxy_gc.release(c);

for (unsigned long i = 0; i < ct_writer_iters_n / 2; ++i)
{
shared.m_proxy_gc.collect();
}

{
ct_proxy_collector::collector& c = shared.m_proxy_gc.acquire();

for (unsigned long i = 0; i < ct_writer_iters_n; ++i)
{
ct_node* n = shared.m_stack.pop();
if (! n) break;

shared.m_proxy_gc.collect(c, n);
}

shared.m_proxy_gc.release(c);
}

if ((wloop % 3) == 0)
{
shared.m_proxy_gc.collect();
}
}
}



int main()
{
std::cout << "Chris M. Thomassons Proxy Collector Port ver
.0.0.2...\n";
std::cout << "_______________________________________\n\n";

{
ct_shared shared;

std::thread readers[ct_reader_threads_n];
std::thread writers[ct_writer_threads_n];

std::cout << "Booting threads...\n";

for (unsigned long i = 0; i < ct_writer_threads_n; ++i)
{
writers[i] = std::thread(ct_thread_writer, std::ref(shared));
}

for (unsigned long i = 0; i < ct_reader_threads_n; ++i)
{
readers[i] = std::thread(ct_thread_reader, std::ref(shared));
}

std::cout << "Threads running...\n";

for (unsigned long i = 0; i < ct_reader_threads_n; ++i)
{
readers[i].join();
}

for (unsigned long i = 0; i < ct_writer_threads_n; ++i)
{
writers[i].join();
}
}

std::cout << "Threads completed!\n\n";


// Sanity check!
{
std::uint32_t node_allocations =
g_debug_node_allocations.load(std::memory_order_relaxed);
std::uint32_t node_deallocations =
g_debug_node_deallocations.load(std::memory_order_relaxed);
std::uint32_t dtor_collect =
g_debug_dtor_collect.load(std::memory_order_relaxed);
std::uint32_t release_collect =
g_debug_release_collect.load(std::memory_order_relaxed);
std::uint32_t quiesce_complete =
g_debug_quiesce_complete.load(std::memory_order_relaxed);
std::uint32_t quiesce_begin =
g_debug_quiesce_begin.load(std::memory_order_relaxed);
std::uint32_t quiesce_complete_nodes =
g_debug_quiesce_complete_nodes.load(std::memory_order_relaxed);

std::cout << "node_allocations = " << node_allocations << "\n";
std::cout << "node_deallocations = " << node_deallocations <<
"\n\n";
std::cout << "dtor_collect = " << dtor_collect << "\n";
std::cout << "release_collect = " << release_collect << "\n";
std::cout << "quiesce_complete = " << quiesce_complete << "\n";
std::cout << "quiesce_begin = " << quiesce_begin << "\n";
std::cout << "quiesce_complete_nodes = " <<
quiesce_complete_nodes << "\n";

Paavo Helde

unread,
Feb 9, 2021, 8:03:29 AM2/9/21
to
09.02.2021 09:31 Chris M. Thomasson kirjutas:
> On 2/7/2021 8:20 PM, Chris M. Thomasson wrote:
>> Well, here is a test program. When you get some free time, can you
>> please try to run it; give it a go?
>
> I need to ask for one more favor, especially to Manfred and Pavvo. Can
> you please run this version 0.0.2? It corrected a reference counting
> issue in version 0.0.1. Even though the first version works, it can
> sometimes miss a collection cycle, and allow memory to grow when it does
> not have to. Here is the new version and it contains a lot more debug
> stuff to show how your system is reacting to the algorihtm. I am sorry
> for missing that counting issue in version 0.0.1. If you can please,
> when you get free time, run it again? I need to see your output.
>

Win10 MSVC2017, Intel Xeon E-2286M, 8 phys cores

> time ../x64/Release/ConsoleTestVS2017.exe
Chris M. Thomassons Proxy Collector Port ver .0.0.2...
_______________________________________

Booting threads...
Threads running...
Threads completed!

node_allocations = 92400000
node_deallocations = 92400000

dtor_collect = 33
release_collect = 753
quiesce_complete = 786
quiesce_begin = 786
quiesce_complete_nodes = 92400000


Test Completed!


real 0m19.027s
user 0m0.015s
sys 0m0.000s

---------------------------------------------------------------------

gcc 7.4.0 on Linux: no issues (2 NUMA nodes, 24 physical cores)

> g++ -std=c++11 -O2 -Wall main.cpp -lpthread
> time ./a.out
Chris M. Thomassons Proxy Collector Port ver .0.0.2...
_______________________________________

Booting threads...
Threads running...
Threads completed!

node_allocations = 92400000
node_deallocations = 92400000

dtor_collect = 11
release_collect = 152
quiesce_complete = 163
quiesce_begin = 163
quiesce_complete_nodes = 92400000


Test Completed!


real 0m43.760s
user 34m31.243s
sys 0m3.295s

I did not measure the times yesterday, but it feels like the new version
is slower.

Manfred

unread,
Feb 9, 2021, 9:02:12 AM2/9/21
to
On 2/9/2021 8:31 AM, Chris M. Thomasson wrote:
> On 2/7/2021 8:20 PM, Chris M. Thomasson wrote:
>> Well, here is a test program. When you get some free time, can you
>> please try to run it; give it a go?
>
> I need to ask for one more favor, especially to Manfred and Pavvo. Can
> you please run this version 0.0.2? It corrected a reference counting
> issue in version 0.0.1. Even though the first version works, it can
> sometimes miss a collection cycle, and allow memory to grow when it does
> not have to. Here is the new version and it contains a lot more debug
> stuff to show how your system is reacting to the algorihtm. I am sorry
> for missing that counting issue in version 0.0.1. If you can please,
> when you get free time, run it again? I need to see your output.
>
> https://pastebin.com/raw/CYZ78gVj

I'm posting the timed run from both yesterday's and today's versions.
Yesterday's code runs faster, I have no idea if that's due to extra
debugging.

BTW I forgot to mention that these are run on a VM with 8 cores. Later
on I'll run them on non-virtualized hardware.


$ c++ -std=c++11 -Wall -O2 -lpthread rcu_chris.cc && time ./a.out
Chris M. Thomassons Proxy Collector Port ver .0.0.1...
_______________________________________

Booting threads...
Threads running...
Threads completed!

node_allocations = 29400000
node_deallocations = 29400000


Test Completed!


real 0m6.890s
user 0m54.270s
sys 0m0.092s


$ c++ -std=c++11 -Wall -O2 -lpthread rcu_chris.0.0.2.cc && time ./a.out
Chris M. Thomassons Proxy Collector Port ver .0.0.2...
_______________________________________

Booting threads...
Threads running...
Threads completed!

node_allocations = 92400000
node_deallocations = 92400000

dtor_collect = 11
release_collect = 178
quiesce_complete = 189
quiesce_begin = 189
quiesce_complete_nodes = 92400000


Test Completed!


real 0m22.231s
user 2m56.383s
sys 0m0.249s

Manfred

unread,
Feb 9, 2021, 9:33:53 AM2/9/21
to
On 2/9/21 3:01 PM, Manfred wrote:
> On 2/9/2021 8:31 AM, Chris M. Thomasson wrote:
>> On 2/7/2021 8:20 PM, Chris M. Thomasson wrote:
>>> Well, here is a test program. When you get some free time, can you
>>> please try to run it; give it a go?
>>
>> I need to ask for one more favor, especially to Manfred and Pavvo. Can
>> you please run this version 0.0.2? It corrected a reference counting
>> issue in version 0.0.1. Even though the first version works, it can
>> sometimes miss a collection cycle, and allow memory to grow when it
>> does not have to. Here is the new version and it contains a lot more
>> debug stuff to show how your system is reacting to the algorihtm. I am
>> sorry for missing that counting issue in version 0.0.1. If you can
>> please, when you get free time, run it again? I need to see your output.
>>
>> https://pastebin.com/raw/CYZ78gVj
>
> I'm posting the timed run from both yesterday's and today's versions.
> Yesterday's code runs faster, I have no idea if that's due to extra
> debugging.
>
> BTW I forgot to mention that these are run on a VM with 8 cores. Later
> on I'll run them on non-virtualized hardware.
>

Here they are, running on bare metal.

GCC 9.3.1

$ c++ -std=c++11 -Wall -O2 -lpthread rcu_chris.0.0.1.cc && time ./a.out
Chris M. Thomassons Proxy Collector Port ver .0.0.1...
_______________________________________

Booting threads...
Threads running...
Threads completed!

node_allocations = 29400000
node_deallocations = 29400000


Test Completed!


real 0m8.814s
user 2m19.340s
sys 0m0.242s

$ c++ -std=c++11 -Wall -O2 -lpthread rcu_chris.0.0.2.cc && time ./a.out
Chris M. Thomassons Proxy Collector Port ver .0.0.2...
_______________________________________

Booting threads...
Threads running...
Threads completed!

node_allocations = 92400000
node_deallocations = 92400000

dtor_collect = 11
release_collect = 171
quiesce_complete = 182
quiesce_begin = 182
quiesce_complete_nodes = 92400000


Test Completed!


real 0m24.975s
user 6m36.626s
sys 0m0.587s

Chris M. Thomasson

unread,
Feb 9, 2021, 3:33:46 PM2/9/21
to
Thanks for running it again. Your output means everything worked as it
should. node_allocations == node_deallocations, and
quiesce_complete_nodes is <= node_allocations. Also, quiesce_complete ==
quiesce_begin.

Now, this version 0.0.2 should definitely be slower than 0.0.1 because
of several things. One is that it creates more threads and allocates a
lot more nodes.

Take a careful look at ver 0.0.1 wrt:
_________________________
// Iteration settings
static constexpr unsigned long ct_reader_iters_n = 1000000;
static constexpr unsigned long ct_writer_iters_n = 100000;


// Thread counts
static constexpr unsigned long ct_reader_threads_n = 42;
static constexpr unsigned long ct_writer_threads_n = 7;
_________________________


Vs. ver 0.0.2 wrt:
_________________________
// Iteration settings
static constexpr unsigned long ct_reader_iters_n = 2000000;
static constexpr unsigned long ct_writer_iters_n = 200000;


// Thread counts
static constexpr unsigned long ct_reader_threads_n = 53;
static constexpr unsigned long ct_writer_threads_n = 11;
_________________________



Then there is the debug stuff that will make it go slower as well...

Now, there is another interesting aspect. Ver 0.0.1 can sometimes miss
collection cycles allowing the memory to grow. This means that calls to
delete are far less then they need to be when the threads are running.
Hence memory growing. So, it skips a lot of calls to delete, which makes
it run faster.

Ver 0.0.2 does not miss _any_ collection cycles. So, it will invoke
delete a lot more times when the threads are running.

In the properly working 0.0.2, this can be adjusted with the template
parameter std::size_t T_defer_limit in the ct_proxy class. Basically, it
waits to actually begin a quiescence process prv_quiesce_begin() until
the number of deferred nodes is greater than or equal to T_defer_limit.

Don't ask me why I introduced T_defer_limit as a template parameter.

Uggg.

Chris M. Thomasson

unread,
Feb 9, 2021, 3:44:30 PM2/9/21
to
Perfect. version 0.0.2 should be a lot slower. Please read the response
I gave to Paavo. Since ver 0.0.2 creates more threads and doubles the
iterations of ver 0.0.1, adds more debug interference, well, it
basically has to run slower. The interesting part is that ver 0.0.2 does
not miss any collection cycles. So, calls to delete are more frequent
while the threads are churning along. This can actually break down into
the performance of the underlying memory allocator being hammered with a
lot of activity.

Humm... It would be fun to test raw new/delete vs a simple lock-free
pooling allocator in this proxy gc context.

Also, the template parameter std::size_t T_defer_limit in the ct_proxy
class can effect how many nodes are deferred before a collection cycle
is triggered.

Thanks again for giving it a go. Now, I think its time for me to give it
a proper home over on github.

:^)

Chris M. Thomasson

unread,
Feb 9, 2021, 6:04:40 PM2/9/21
to
On 2/8/2021 11:31 PM, Chris M. Thomasson wrote:
> On 2/7/2021 8:20 PM, Chris M. Thomasson wrote:
>> Well, here is a test program. When you get some free time, can you
>> please try to run it; give it a go?
[...]

I have been getting some requests to add in a std::cout.flush() after
some of the output in my program. So:

> int main()
> {
>     std::cout << "Chris M. Thomassons Proxy Collector Port ver
> .0.0.2...\n";
>     std::cout << "_______________________________________\n\n";

Add a std::cout.flush(); right here.

>
>     {
>         ct_shared shared;
>
>         std::thread readers[ct_reader_threads_n];
>         std::thread writers[ct_writer_threads_n];
>
>         std::cout << "Booting threads...\n";

std::cout.flush();


>
>         for (unsigned long i = 0; i < ct_writer_threads_n; ++i)
>         {
>             writers[i] = std::thread(ct_thread_writer, std::ref(shared));
>         }
>
>         for (unsigned long i = 0; i < ct_reader_threads_n; ++i)
>         {
>             readers[i] = std::thread(ct_thread_reader, std::ref(shared));
>         }
>
>         std::cout << "Threads running...\n";

std::cout.flush();

[...]

Some people are telling me that they cannot see the output until the
program is finished, hence adding std::cout.flush() helps here.

Öö Tiib

unread,
Feb 10, 2021, 3:46:55 PM2/10/21
to
On Wednesday, 10 February 2021 at 01:04:40 UTC+2, Chris M. Thomasson wrote:
> On 2/8/2021 11:31 PM, Chris M. Thomasson wrote:
> > On 2/7/2021 8:20 PM, Chris M. Thomasson wrote:
> >> Well, here is a test program. When you get some free time, can you
> >> please try to run it; give it a go?
> [...]
>
> I have been getting some requests to add in a std::cout.flush() after
> some of the output in my program. So:
> > int main()
> > {
> > std::cout << "Chris M. Thomassons Proxy Collector Port ver
> > .0.0.2...\n";
> > std::cout << "_______________________________________\n\n";
> Add a std::cout.flush(); right here.

Or add << std::endl instead of << "\n" wherever you want to flush
text streams.

Chris M. Thomasson

unread,
Feb 10, 2021, 8:35:47 PM2/10/21
to
Yup, that works as well. For some reason, I tend to avoid it.

std::cout << "abcd\n";
std::cout << "efgh\n";
std::cout << "hijk\n";
std::cout.flush();

vs. using std::endl everywhere:


std::cout << "abcd" << std::endl;
std::cout << "efgh" << std::endl;
std::cout << "hijk" << std::endl;

or even:

std::cout << "abcd\n";
std::cout << "efgh\n";
std::cout << "hijk" << std::endl;

Jorgen Grahn

unread,
Feb 13, 2021, 7:54:26 AM2/13/21
to
On Tue, 2021-02-09, Chris M. Thomasson wrote:
> On 2/8/2021 11:31 PM, Chris M. Thomasson wrote:
>> On 2/7/2021 8:20 PM, Chris M. Thomasson wrote:
>>> Well, here is a test program. When you get some free time, can you
>>> please try to run it; give it a go?
> [...]
>
> I have been getting some requests to add in a std::cout.flush() after
> some of the output in my program. So:
>
>> int main()
>> {
>>     std::cout << "Chris M. Thomassons Proxy Collector Port ver
>> .0.0.2...\n";
>>     std::cout << "_______________________________________\n\n";
>
> Add a std::cout.flush(); right here.
>
>>
>>     {
>>         ct_shared shared;
>>
>>         std::thread readers[ct_reader_threads_n];
>>         std::thread writers[ct_writer_threads_n];
>>
>>         std::cout << "Booting threads...\n";
>
> std::cout.flush();
...

> Some people are telling me that they cannot see the output until the
> program is finished, hence adding std::cout.flush() helps here.

Don't those people have a broken environment? If you write a full
line of text to std::cout, and std::cout is a terminal, surely that
line should be printed (leave the process) when operator<< returns?
I.e. std::cout should be in line buffered mode.

At least on Unix (where the whole output/error stream thing comes
from).

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Chris M. Thomasson

unread,
Feb 14, 2021, 1:39:55 AM2/14/21
to
I don't know if that's guaranteed.

Manfred

unread,
Feb 14, 2021, 12:02:03 PM2/14/21
to
As Öö Tiib pointed out, the difference between std::endl and '\n' in C++
is exactly that the former executes basic_ostream::flush(), the latter
doesn't.

From the example in https://en.cppreference.com/w/cpp/io/manip/endl :

> "With \n instead of endl, the output would be the same, but may not appear in real time."


Jorgen Grahn

unread,
Feb 14, 2021, 3:13:10 PM2/14/21
to
I think it is, but it would be nice to have it confirmed. I think I
can quote W R Stevens, but he only writes about Unix.

If the people with the problems e.g. ran the code in an IDE, that
would explain it. Or piped the output through less(1).

Scott Lurndal

unread,
Feb 14, 2021, 5:22:53 PM2/14/21
to
Jorgen Grahn <grahn...@snipabacken.se> writes:
>On Sun, 2021-02-14, Chris M. Thomasson wrote:
>> On 2/13/2021 4:54 AM, Jorgen Grahn wrote:
>>> On Tue, 2021-02-09, Chris M. Thomasson wrote:
>...
>>>> Some people are telling me that they cannot see the output until the
>>>> program is finished, hence adding std::cout.flush() helps here.
>>>
>>> Don't those people have a broken environment? If you write a full
>>> line of text to std::cout, and std::cout is a terminal, surely that
>>> line should be printed (leave the process) when operator<< returns?
>>> I.e. std::cout should be in line buffered mode.
>>>
>>> At least on Unix (where the whole output/error stream thing comes
>>> from).
>>
>> I don't know if that's guaranteed.
>
>I think it is, but it would be nice to have it confirmed. I think I
>can quote W R Stevens, but he only writes about Unix

POSIX requires stdout and stderr to be line buffered if and only
if the underlying file descriptor refers to a terminal, serial port,
console or pseudoterminal device (isatty() == true).

Otherwise they'll be fully buffered.

The application controls the buffering using setvbuf(3) or setbuf(3),
and it is often useful for an application to explicity set the buffering
mode to line-buffered so that when redirected to a file, the output
is available to other tools like the tail(1) command line by line.

Chris M. Thomasson

unread,
Feb 18, 2021, 6:17:40 AM2/18/21
to
> [...]
> _______________________________
>
>

Have not forgot about this. Just got caught up in some fractal work.

Chris M. Thomasson

unread,
Feb 28, 2021, 11:45:24 PM2/28/21
to
On 2/18/2021 3:17 AM, Chris M. Thomasson wrote:
> On 2/8/2021 11:31 PM, Chris M. Thomasson wrote:
>> On 2/7/2021 8:20 PM, Chris M. Thomasson wrote:
>>> Well, here is a test program. When you get some free time, can you
>>> please try to run it; give it a go?
>>
>> I need to ask for one more favor, especially to Manfred and Pavvo. Can
>> you please run this version 0.0.2? It corrected a reference counting
>> issue in version 0.0.1. Even though the first version works, it can
>> sometimes miss a collection cycle, and allow memory to grow when it
>> does not have to. Here is the new version and it contains a lot more
>> debug stuff to show how your system is reacting to the algorihtm. I am
>> sorry for missing that counting issue in version 0.0.1. If you can
>> please, when you get free time, run it again? I need to see your output.
>>
>> https://pastebin.com/raw/CYZ78gVj
[...]

I have an idea about creating a more "distributed" poor mans RCU using
per thread sequence locks. Just need to code it up in Relacy.
0 new messages