Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

asymmetric eventcount

25 views
Skip to first unread message

Dmitriy V'jukov

unread,
Apr 5, 2008, 12:39:23 PM4/5/08
to
I've figured out that it's possible to make eventcount asymmetric.
I.e. move all overheads from producer's fast-path to consumer's slow-
path.

Now consumer pattern looks like:

work_t* get_work()
{
work_t* work = 0;
while (! (work = try_get_work()) {
eventcount_cmp_t const cmp = eventcount_get();
if (work = try_get_work()) { break; }
eventcount_wait(cmp);
}
return work;
}

If we modify consumer pattern like this:

work_t* get_work()
{
work_t* work = 0;
while (! (work = try_get_work()) {
eventcount_cmp_t const cmp = eventcount_get();

/*-------------------------------------*/
system_wide_synchronize();
/*-------------------------------------*/

if (work = try_get_work()) { break; }
eventcount_wait(cmp);
}
return work;
}

Then producer is not required to execute #StoreLoad after submitting
work item and before signalling eventcount.

system_wide_synchronize() can be FlushProcessorWriteBuffers(),
rcu_synchronize(), signal to every thread, APC to every thread,
polling of thread/processor activity, IPI/x-calls, thread suspend/
resume, mprotect()/VirtualAlloc(), wait long enough, or something like
this:
http://groups.google.ru/group/comp.programming.threads/browse_frm/thread/f8f78ff07df6df5f

Probably it can be worth doing. Because heavy
system_wide_synchronize() executed only when consumer has no work to
do and going to sleep, i.e. when system is underloaded. On the other
hand it allows to eliminate #StoreLoad from producer's fast-path, i.e.
increase throughput under heavy load.
In the environment with thread per processor,
system_wide_synchronize() will eat processor time only when otherwise
processor would be idle.

The whole thing makes sense only if producer doesn't execute
#StoreLoad after submitting work item anyway. I.e. if producer uses
something like spsc-queue, work-stealing deque etc. Or if platform has
atomic RMW operations stripped from memory barriers.
If producer uses x86 atomic RMW operation (InterlockedXXX) to submit
work item, then asymmetric eventcount is senseless.


Dmitriy V'jukov

Dmitriy V'jukov

unread,
Apr 5, 2008, 1:19:27 PM4/5/08
to
On 5 апр, 20:39, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> I've figured out that it's possible to make eventcount asymmetric.
> I.e. move all overheads from producer's fast-path to consumer's slow-
> path.


system_wide_synchronize() can be implemented in interesting and
effective way wrt asymmetric eventcount. Note that:
1. While waiting for system wide synchronization to complete consumer
can periodically check whether new work is arrived. And if new work is
arrived then consumer can just cancel or forgot about system wide
synchronization and begin to processing new work.
2. It's not very critical if consumer can miss new work with low
probability and untimely go to sleep. Because any new work item will
resume consumer. It's not critical like in asymmetric mutex.


So if we using 'APC/signal to every thread', or 'suspend/resume every
thread' etc, this can implemented this way:

work_t* get_work()
{
work_t* work = 0;
while (! (work = try_get_work()) {
eventcount_cmp_t const cmp = eventcount_get();

system_wide_synchronize_state state;
start_system_wide_synchronize(&state);
do
{
yield_thread_execution();


if (work = try_get_work()) { break; }

continue_system_wide_synchronize(&state);
}
while (is_system_wide_synchronize_completed(&state));

if (work) { break; }


if (work = try_get_work()) { break; }

eventcount_wait(cmp);
}
return work;
}

Here continue_system_wide_synchronize() can, for example, suspend/
resume one thread at a time. This will increase reactivity and
decrease overheads if new work is arrived.


Now we will take into consideration point 2:

work_t* get_work()
{
work_t* work = 0;
while (! (work = try_get_work()) {
eventcount_cmp_t const cmp = eventcount_get();

uint64_t const cycles_to_wait = 2000;
uint64_t start_time = _rdtsc(); // hardware cycle counter
do
{
yield_thread_execution(); // SwitchToThread()


if (work = try_get_work()) { break; }
}

while (_rdtsc() - start_time < cycles_to_wait);

if (work) { break; }


if (work = try_get_work()) { break; }

eventcount_wait(cmp);
}
return work;
}

It's simple, portable and effective way to implement
system_wide_synchronize() for asymmetric eventcount. And it doesn't
affect in any way other threads and other processors, so it's "O(1)".
Cool indeed.
It's a bit hazardous, but in this situation it's tolerably in most
cases.


Dmitriy V'jukov

Chris Thomasson

unread,
Apr 5, 2008, 11:54:43 PM4/5/08
to
"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
news:5238f11f-3e62-461e...@s37g2000prg.googlegroups.com...

> On 5 апр, 20:39, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> > I've figured out that it's possible to make eventcount asymmetric.
> > I.e. move all overheads from producer's fast-path to consumer's slow-
> > path.


> system_wide_synchronize() can be implemented in interesting and
> effective way wrt asymmetric eventcount.

Eventcounts kick as$!!!

:^)

[...]

Its VERY nice that you did not have to actually modify the eventcount API,
or implementation, in any way in order to implement your scheme which is
designed to heavily reduce the pressure on the producers wrt the nasty
membar. As you well know, there are narrow situations in which producers
could "always" produce multiple items and only execute #StoreLoad _after_
production. Anyway, I think it would still be helpful in that specific
scenario because producer still must execute #StoreLoad. Basically, I think
it is definitely worth while to go ahead and try to reduce "any" overheads
associated with producers at the expense of a more expensive consumer
slow-path of course; very cool!

:^)

0 new messages