Cohort-Scheduling...

Chris M. Thomasson

unread,

Sep 25, 2008, 2:32:31 AM9/25/08

to

Here is the idea:

http://research.microsoft.com/users/larus/Papers/usenix02_cohort.pdf

I am in the process of drafting a design on a new IOCP based Windows server
framework. I have written one in the past, but its outdated and should
really be rewritten from scratch. I am thinking about incorporating
cohort-scheduling into the framework internals in order to make the process
completely transparent to the user. In order to do this, I would need to
process work in batches and coalesce similar operations together. I have an
idea on how to do this. Here is the general outline... Here would be the
portion of the per-io/socket/thread data-structures that are relevant to the
"scheduler"; I will deal with performing cohort-scheduling on just socket
sends and receives for now and demonstrate simple echo-server logic:

/* pseudo-code; all error handling omitted */

#define IO_RECV 0x1UL
#define IO_SEND 0x2UL
#define WORK_BATCH_SIZE 128

struct per_thread {
struct per_io* issue_recvs;
struct per_io* issue_sends;
struct per_io* process_recvs;
struct per_io* process_sends;
};

struct per_socket {
SOCKET hsck;
};

struct per_io {
OVERLAPPED ol;
struct per_socket* sck;
struct per_io* next;
unsigned long flags;
WSABUF buf;
size_t buf_size;
};

static int
process_recv(
per_thread* thread,
per_io* io
) {
/* we received data, echo it back */

io->next = thread->issue_sends;
thread->issue_sends = io;

return 0;
}

static int
process_send(
per_thread* thread,
per_io* io
) {
/* we sent data, receive more */

io->buf.len = io->buf_size;
io->next = thread->issue_recvs;
thread->issue_recvs = io;

return 0;
}

static int
issue_recv(
per_thread* thread,
per_io* io
) {
/* actually issue the recv */

io->flags |= IO_RECV;

WSARecv(io->sck->hsck, &io->buf, ..., ..., &io->ol, ...);

return 0;
}

static int
issue_recv(
per_thread* thread,
per_io* io
) {
/* actually issue the send */

io->flags |= IO_SEND;

WSASend(io->sck->hsck, &io->buf, ..., ..., ..., &io->ol, ...);

return 0;
}

/* here is the worker thread loop */
static void
worker_thread(void) {
struct per_thread thread = { NULL };
struct per_io* io;
int batch_size;

for (;;) {
/* gather work */
for (batch_size = 0; batch_size < WORK_BATCH_SIZE; ++batch_size) {
io = NULL;
DWORD bytes = 0;
if (! GetQueuedCompletionStatus(..., &bytes, ..., &io, 0)) {
if (GetLastError() == WAIT_TIMEDOUT) {
if (batch_size) { break; }
io = NULL;
bytes = 0;
GetQueuedCompletionStatus(..., &bytes, ..., &io, INFINITE);
} else {
/* handle error! */
--batch_size;
continue;
}
}

/* batch the completeion */
if (io->flags & IO_SEND) {
io->flags &= ~IO_SEND;
io->next = thread->process_sends;
thread->process_sends = io;
} else {
io->flags &= ~IO_RECV;
io->next = thread->process_recvs;
thread->process_recvs = io;
}
}

/* process the sends */
while (thread->process_sends) {
io = thread->process_sends->next;
process_send(thread, thread->process_sends);
thread->process_sends = io;
}

/* process the receives */
while (thread->process_recvs) {
io = thread->process_recvs->next;
process_recv(thread, thread->process_recvs);
thread->process_recvs = io;
}

/* issue sends */
while (thread->issue_sends) {
io = thread->issue_sends->next;
issue_send(thread, thread->issue_sends);
thread->issue_sends = io;
}

/* issue recvs */
while (thread->issue_recvs) {
io = thread->issue_recvs->next;
issue_recv(thread, thread->issue_recvs);
thread->issue_recvs = io;
}
}
}

This allows me to process similar operations together in batches which will
give me the performance benefits described in the paper. What do you think
about my batching algorithm? Do you have any suggestions?

Thanks.

David Schwartz

unread,

Sep 25, 2008, 6:23:32 PM9/25/08

to

On Sep 24, 11:32 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> This allows me to process similar operations together in batches which will
> give me the performance benefits described in the paper. What do you think
> about my batching algorithm? Do you have any suggestions?

I think you're stabbing yourself doing this. The whole point of GQCS
is that it schedules for you. What happens if you take a page fault
while you have all these completion events queued up for your
processing?

DS

Chris M. Thomasson

unread,

Sep 26, 2008, 6:48:25 PM9/26/08

to

"David Schwartz" <dav...@webmaster.com> wrote in message
news:464c13b9-8e8c-4a27...@b2g2000prf.googlegroups.com...

On Sep 24, 11:32 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > This allows me to process similar operations together in batches which
> > will
> > give me the performance benefits described in the paper. What do you
> > think
> > about my batching algorithm? Do you have any suggestions?

> I think you're stabbing yourself doing this. The whole point of GQCS
> is that it schedules for you.

I think I can still benefit from the IOCP scheduling and still take
advantage of batching similar operations into cohorts... The algorithm is
extremely simple.

> What happens if you take a page fault
> while you have all these completion events queued up for your
> processing?

Humm... Well, that would require a wait which could stall some operations
for a moment. However, I still think the possible benefits of
cohort-scheduling would outweigh some of the caveats. On page 3 figure 2 in
the paper it shows what happens when they assembled multiple calls to
`WriteFile' into a cohort and processed it all at once. IMHO, the numbers
are fairly impressive indeed. I think I could possible reap the same
performance enhancement if I assemble cohorts of WSASend, WSARecv, AcceptEx,
ConnectEx, WriteFile, ReadFile, TransmitPackets, ect. calls and process them
all at once. I think its a worthwhile experiment. Humm... Did you read the
paper? I think it has some merit...

David Schwartz

unread,

Sep 26, 2008, 7:12:52 PM9/26/08

to

On Sep 26, 3:48 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> Humm... Well, that would require a wait which could stall some operations
> for a moment.

Which defeats most of the entire point of IOCP -- avoiding ambush,
avoiding having 'too many' threads running, avoiding having 'too few'
threads running, and so on.

Now if the scheduler could do this, and when you call GQCS, it prefers
an event similar to the one this thread grabbed last, that would be
great. All upside, very little downside.

> However, I still think the possible benefits of
> cohort-scheduling would outweigh some of the caveats. On page 3 figure 2 in
> the paper it shows what happens when they assembled multiple calls to
> `WriteFile' into a cohort and processed it all at once. IMHO, the numbers
> are fairly impressive indeed. I think I could possible reap the same
> performance enhancement if I assemble cohorts of WSASend, WSARecv, AcceptEx,
> ConnectEx, WriteFile, ReadFile, TransmitPackets, ect. calls and process them
> all at once. I think its a worthwhile experiment. Humm... Did you read the
> paper? I think it has some merit...

I think it has some merit too. As an enhancement to the scheduler, it
seems to have significant promise. I think your proposed "naive"
implementation in user space is going to do more harm than good
though. I could be wrong.

It's possible a more sophisticated implementation that "preferred"
cohorts but didn't tie up an operation irrevocably to a thread that's
doing something else might work out well. I'd have to give it quite a
bit more thought though.

For example, you could have a thread empty the IOCP queue and move the
events into a per-thread job queue. Each thread works its own queue,
but if another thread has literally nothing to do, it does work
stealing. That way, if a thread does get stuck, at least no jobs will
get stuck if the server is otherwise idle.

This doesn't address the "too many threads wind up running" case. But
as long as you block on GQCS when your queue is empty, before building
another one, I think you're safe.

DS

Chris M. Thomasson

unread,

Sep 26, 2008, 7:52:58 PM9/26/08

to

"David Schwartz" <dav...@webmaster.com> wrote in message

news:412a9da0-eef8-49f5...@w24g2000prd.googlegroups.com...

On Sep 26, 3:48 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > Humm... Well, that would require a wait which could stall some
> > operations
> > for a moment.

> Which defeats most of the entire point of IOCP -- avoiding ambush,
> avoiding having 'too many' threads running, avoiding having 'too few'
> threads running, and so on.

Thats a valid point.

> Now if the scheduler could do this, and when you call GQCS, it prefers
> an event similar to the one this thread grabbed last, that would be
> great. All upside, very little downside.

Indeed!

> > However, I still think the possible benefits of
> > cohort-scheduling would outweigh some of the caveats. On page 3 figure 2
> > in
> > the paper it shows what happens when they assembled multiple calls to
> > `WriteFile' into a cohort and processed it all at once. IMHO, the
> > numbers
> > are fairly impressive indeed. I think I could possible reap the same
> > performance enhancement if I assemble cohorts of WSASend, WSARecv,
> > AcceptEx,
> > ConnectEx, WriteFile, ReadFile, TransmitPackets, ect. calls and process
> > them
> > all at once. I think its a worthwhile experiment. Humm... Did you read
> > the
> > paper? I think it has some merit...

> I think it has some merit too. As an enhancement to the scheduler, it
> seems to have significant promise. I think your proposed "naive"
> implementation in user space is going to do more harm than good
> though. I could be wrong.

Well, I guess I could make the behavior optional. AFAICT, it would not be
hard at all. The easiest way would be to simply set the batch size to 1. Or,
I could allow the user to set a flag in the global framework configuration
state that would simply force the server to use an IOCP worker thread
routine that did not perform batching... E.g:

unsigned IOCP_Batch_Mode_Worker(void*) {
[...];
}

unsigned IOCP_Normal_Mode_Worker(void*) {
[...];
}

#define FLAG_BATCH_MODE 0x1

struct Framework_Config {
unsigned Flags;
}

void Thread_Pool_Create() {
unsigned (*fp_worker) (void*) =
(Framework_Config::Flags & FLAG_BATCH_MODE)
? IOCP_Batch_Mode_Worker
: IOCP_Normal_Mode_Worker;
for (int size; size < cpu_count * 2; ++size) {
_beginthreadex(..., ..., fp_worker, ..., ..., ...);
}
}

Humm... I think simply setting the batch size to 1 would be so much easier!

;^)

> It's possible a more sophisticated implementation that "preferred"
> cohorts but didn't tie up an operation irrevocably to a thread that's
> doing something else might work out well. I'd have to give it quite a bit
> more thought though.

For sure. I want to code this thing, but cannot do anything until I draft
out the internal model. I have some big decisions to make.

> For example, you could have a thread empty the IOCP queue and move the
> events into a per-thread job queue. Each thread works its own queue,
> but if another thread has literally nothing to do, it does work
> stealing. That way, if a thread does get stuck, at least no jobs will
> get stuck if the server is otherwise idle.

> This doesn't address the "too many threads wind up running" case. But
> as long as you block on GQCS when your queue is empty, before building
> another one, I think you're safe.

That's an interesting approach. Wow, I really need to think about this...
Humm...

Chris M. Thomasson

unread,

Oct 29, 2008, 4:21:48 PM10/29/08

to

Humm. Well, it seems that I can just use a single function call to get a
batch of IO completions in one shot:

GetQueuedCompletionStatusEx

I think that is a _very_ nice addition to IOCP API.

Chris M. Thomasson

unread,

Mar 25, 2009, 9:49:56 AM3/25/09

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:3aGCk.64037$PK.2...@newsfe04.iad...

> Here is the idea:
>
> http://research.microsoft.com/users/larus/Papers/usenix02_cohort.pdf

> [...]

The darn link is busted! Here is the paper from alternative source:

http://ksuseer1.ist.psu.edu/viewdoc/summary;jsessionid=561FAA388F8ED54C7CEFB29BC7BE3F6A?doi=10.1.1.72.9809

;^|