Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Refreshing cpu cache before atomic relaxed loads

103 views
Skip to first unread message

itaj sherman

unread,
Jun 22, 2015, 3:30:04 PM6/22/15
to

My question turned up while implementing spin-wait
when a thread needs to wait for another thread to complete a short work,
so short that locking/releasing a mutex might take longer than the work.
The problem the way I see it is the standard does not explicitly define a
clear way to refresh the cpu cache before re-loading an atomic variable.

In my example code below, atomic_oneway_flag::spin_wait_flag has 3 suggested
implementations. They are equivalent w.r.t memory ordering as seen by user
code. But they might not be equal in speed.

Specifically, it seems that practically (correct me if I'm wrong),
implementation-2 of atomic_oneway_flag::spin_wait_flag below
is faster/better than implementation-1.
I.e. load( ..., memory_order_acquire ) or fence( memory_order_acquire )

Now, is it somehow implied by the standard that an acquire operation might
refresh the following loads faster than a relaxed operation?
I cannot see that it is.
Thus, I would expect to do something like implementation-3 below using an
operation that explicitly and specifically refreshes the cache for the next
relaxed load.

class atomic_oneway_flag
{

//data
private: atomic<bool> m;

//ctors
public: atomic_oneway_flag()
:
m(false)
{
}

//methods
public: void turn_on()
{
std::atomic::store( m, true, memory_order_release );
}

public: bool test()
{
bool x( std::atomic::load( m, memory_order_relaxed ) );
if( x ) {
std::atomic::fence( memory_order_acquire );
}
return x;
}

#if USE_IMPLEMENTATION() == 1

public: void spin_wait_flag() //implementation 1
{
while( true ) {
bool x( std::atomic::load( m, memory_order_relaxed ) );
if( x ) {
std::atomic::fence( memory_order_acquire );
return;
}
}
}

#elif USE_IMPLEMENTATION() == 2

public: void spin_wait_flag() //implementation 2
{
while( true ) {
bool x( std::atomic::load( m, memory_order_acquire ) );
/* if x is false acquire might cause */
/* cpu to refresh faster for next load */
if( x ) {
return;
}
}
}

#elif USE_IMPLEMENTATION() == 3

public: void spin_wait_flag() //implementation 3
{
while( true ) {
bool x( std::atomic::load( m, memory_order_relaxed ) );
if( x ) {
std::atomic::fence( memory_order_acquire );
return;
} else {
/* Some code that refreshes the cache for the */
/* following relaxed load. */
/* Supposedly std::atomic::load_memory_barrier(); */
}
}
}

#elif
#error
#endif

};


//user code

atomic_oneway_flag flag;

//thread 1
... do some very short work
flag.turn_on();

//threads 2..N
flag.spin_wait_flag(); //while thread 1 does short work.
... do some work

regards,
itaj


--
[ comp.std.c++ is moderated. To submit articles, try posting with your ]
[ newsreader. If that fails, use mailto:std-cpp...@vandevoorde.com ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]

via....@googlemail.com

unread,
Jun 25, 2015, 3:40:07 PM6/25/15
to

On Monday, 22 June 2015 21:30:04 UTC+2, itaj sherman wrote:
> My question turned up while implementing spin-wait
> when a thread needs to wait for another thread to complete a short work,
> so short that locking/releasing a mutex might take longer than the work.
> The problem the way I see it is the standard does not explicitly define a
> clear way to refresh the cpu cache before re-loading an atomic variable.

The standard does not deal with CPU caches, or even CPUs for that matter.
The standard specifies an abstract machine, whose observable behaviour
every conforming implementation must emulate with respect to a well-formed
C++ program given to it. CPUs and caches are a detail internal to the
implementation; your C++ program cannot deal with that in a
standard-conforming, implementation-independent way. So the part of your
question regarding refreshing the cache is meaningless for this group.

With respect to "speed", the question is not very well formed. The speed of
what? Of every single iteration of the while loop? Is that even important?
Or the speed of the entire spin_wait_flag execution? Under what conditions?

Unless your formulate your speed measurement methodology, your question
cannot be answered. And if you do formulate your methodology, then you do
not need this group to answer your question, because you can use your
methodology to measure the speed, whatever it is.

itaj sherman

unread,
Jun 26, 2015, 2:30:05 AM6/26/15
to

On Monday, June 22, 2015 at 10:30:04 PM UTC+3, itaj sherman wrote:
> My question turned up while implementing spin-wait
> when a thread needs to wait for another thread to complete a short work,
> so short that locking/releasing a mutex might take longer than the work.

I guess it's important to add:
knowing that the other thread is actually currently doing that work,
which is shorter than thread context switch (and mutex/condvar operations).

> The problem the way I see it is the standard does not explicitly define a
> clear way to refresh the cpu cache before re-loading an atomic variable.
>
> In my example code below, atomic_oneway_flag::spin_wait_flag has 3
suggested
> implementations. They are equivalent w.r.t memory ordering as seen by user
> code. But they might not be equal in speed.
>
> Specifically, it seems that practically (correct me if I'm wrong),
> implementation-2 of atomic_oneway_flag::spin_wait_flag below
> is faster/better than implementation-1.
> I.e. load( ..., memory_order_acquire ) or fence( memory_order_acquire )
>
> Now, is it somehow implied by the standard that an acquire operation might
> refresh the following loads faster than a relaxed operation?
> I cannot see that it is.
> Thus, I would expect to do something like implementation-3 below using an
> operation that explicitly and specifically refreshes the cache for the
next
> relaxed load.
>

Seems possibly I'm asking about something like the x86 "pause" instruction.
As explained here:
http://www.quora.com/What-is-the-purpose-of-the-pause-instruction-in-the-x86-ISA

I've seen this instruction is used in implementation of
boost::atomics::detail::pause().

So is there anything like that in the standard?
Was it ever discussed?


> ....
>
> public: void spin_wait_flag() //implementation 3
> {
> while( true ) {
> bool x( std::atomic::load( m, memory_order_relaxed ) );
> if( x ) {
> std::atomic::fence( memory_order_acquire );
> return;
> } else {
> /* Some code that refreshes the cache for the */
> /* following relaxed load. */
> /* Supposedly std::atomic::load_memory_barrier(); */
> }
> }
> }
>

to fix implementation 3 with that:

public: void spin_wait_flag() //implementation 3
{
while( true ) {
bool x( std::atomic::load( m, memory_order_relaxed ) );
if( x ) {
std::atomic::fence( memory_order_acquire );
return;
} else {
/* Some code that refreshes the cache for the */
/* following relaxed load. */
boost::atomics::detail::pause();
}
}
}

//code from boost_1_58_0\boost\atomic\detail\pause.hpp(30)

BOOST_FORCEINLINE void pause() BOOST_NOEXCEPT
{
#if defined(_MSC_VER) && (defined(_M_AMD64) || defined(_M_IX86))
_mm_pause();
#elif defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__))
__asm__ __volatile__("pause;");
#endif
}

> ...
>
> regards,
> itaj
0 new messages