Atomic store does not synchronize with atomic load in this case.
See C++ 1.10/8:
Certain library calls synchronize with other library calls performed
by another thread. For example, an atomic store-release synchronizes
with a load-acquire that takes its value from the store (29.3). [
Note: Except in the specified cases, reading a later value does not
necessarily ensure visibility as described below. Such a requirement
would sometimes interfere with efficient implementation. — end note ]
The problem is that store stores 3, but load loads 4.
To fix this you need to ensure transitive synchronization. Thread that
stores 3 synchronizes with thread that stores 4, which in turn
synchronizes with load of 4. As the result store of 3 will be
synchronized with load of 4.
Something along the lines of using memory_order_acquire here:
while( m_tail_2($).load( rl::mo_relaxed ) != tail )