FlushProcessWriteBuffers() performance

231 views

Skip to first unread message

Dmitriy Vyukov

unread,

Jan 27, 2010, 12:52:49 PM1/27/10

to Scalable Synchronization Algorithms

It seems that Win32's FlushProcessWriteBuffers() function is not that
heavy one may think. A year ago I done a simple synthetic benchmark of
FlushProcessWriteBuffers() on dual-core P9500 machine. Here is an
excerpt from some old email:

----------------------------------------

Also I've finally tested FlushProcessWriteBuffers() on dual-core with
following test:

unsigned __stdcall thread(void* p)
{
unsigned __int64 t1 = __rdtsc();
volatile __int64 data = 0;
unsigned __int64 const count = 1000000000;
for (unsigned __int64 i = 0; i != count; ++i)
{
data *= data;
}
unsigned __int64 t2 = __rdtsc();
printf("time=%u\n", (unsigned)((t2-t1)*1000/count));
return 0;
}

int main()
{
HANDLE t = (HANDLE)_beginthreadex(0, 0, thread, 0,
CREATE_SUSPENDED, 0);
SetThreadAffinityMask(t, 2);
SetThreadAffinityMask(GetCurrentThread(), 1);
ResumeThread(t);
unsigned __int64 tmin = 1000000000;
unsigned __int64 tmax = 0;
unsigned __int64 tsum = 0;
unsigned __int64 tcount = 0;
while (WAIT_TIMEOUT == WaitForSingleObject(t, 0))
{
#ifdef DO_FLUSH
unsigned __int64 t1 = __rdtsc();
FlushProcessWriteBuffers();
unsigned __int64 t2 = __rdtsc() - t1;
if (t2 < tmin)
tmin = t2;
if (t2 > tmax)
tmax = t2;
tsum += t2;
tcount += 1;
#endif
}
printf("min=%u, max=%u, mean=%d, count=%d\n", (unsigned)tmin,
(unsigned)tmax, tcount ? (unsigned)(tsum / tcount) : 0,
(unsigned)tcount);
}

Without DO_FLUSH I've get (from 3 runs):
time=15991
time=15825
time=15865

With DO_FLUSH:
time=28950
min=418, max=18783970, mean=1611, count=10725801

time=28663
min=418, max=31881715, mean=1599, count=10706885

time=29029
min=418, max=56535079, mean=1592, count=11036976

It's clearly seen than FlushProcessWriteBuffers() affects not only
current processor, but also all other processors. Average overhead on
host processor is 1600 cycles, which is probably not so many (however
it's question what time will be on quad-core). Average overhead on
remote processor is 1300 cycles (if my math is correct).
I hope than on quad-core processor OS will issue IPIs in parallel to
all processors. So the total overhead per FlushProcessWriteBuffers()
can be roughly estimated as 1500 cycles * number_of_processors. If
will take into account that single cache miss on a distributed system
may take 300-1000 cycles, then mentioned per-epoch overhead looks not
so serious.

--
Dmitriy V'jukov

Chris M. Thomasson

unread,

Jan 27, 2010, 11:41:08 PM1/27/10

to Scalable Synchronization Algorithms

On Jan 27, 9:52 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> It seems that Win32's FlushProcessWriteBuffers() function is not that
> heavy one may think. A year ago I done a simple synthetic benchmark of
> FlushProcessWriteBuffers() on dual-core P9500 machine. Here is an
> excerpt from some old email:

[...]

Yes; AFAICT active synchronization epoch detection via
`FlushProcessWriteBuffers()' works very well with asymmetric
synchronization patterns in general. I was mainly interested in how
the slow-path (e.g., writer in asymmetric read-write lock) responds to
periods of load. IIRC the email had something to do with:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/abb3622071b0d52f
(a hack for passive sync epoch detection in Windows)

BTW, could you perhaps provide a patch for Relacy that has
`FlushProcessWriteBuffers()' in the near future? I would be very
interested in modeling all of my algorithms that depend on automatic
sync epoch detection.

Reply all

Reply to author

Forward

0 new messages