Concurrent scheduling

2 views
Skip to first unread message

John Skaller2

unread,
Feb 16, 2020, 7:20:43 PM2/16/20
to Gordon Childs, felix google
NOTE: posted to general Felix list.

Just so you can see, here is some of the problematic new code:

sync_sched::fstate_t sync_sched::frun()
{
dispatch:
if (ft == 0) {
spinguard dummy(active->lockneeded,&(active->active_lock));
ft = active->pop_front();
}
if (ft == 0) {
return blocked;
}
request = ft->run(); // run fthread to get request
if(request == 0) // euthenasia request
{
spinguard dummy(active->lockneeded,&(active->active_lock));
ft = 0;
goto dispatch;
}

You can see, there is a conditionally executed custom spinlock used to access
the scheduler’s active list. The lock is needed if multiple pthreads are
accessing the same list. However the lock isn’t used if only one thread
is operating on the list.

I’m not sure why I did that, maybe for performance, and maybe just so the operation
is the same when not using the concurrent coroutine feature.

You can start concurrency by svc_spawn_process. If you do that in Felix code
by spawn_process procedure, the svc is intercepted by sync_sched, which doesn’t
know what to do with it. So it returns a “delegated” state. The caller, usually
async_sched, then handles it, and calls do_spawn_process().

================

If you look in async_sched, there’s lots of locking and crap happening.
The problem is not just coding, but design issues.

What do you do if there are async events pending?

With one pthread, you just hang on a semaphore, trying to dequeue
the fthread from the wait list. Actually it should spin checking for user
cancellation or something.

With multiple threads, all should be trying for the async event queue.
But wait! What happens when we’re out of sync to process, but some
thread still run? We don’t want to actually cancel the idle threads because
new work can be spawned by a running fibre. And any fibre can make
async requests. So until ALL the fibres have stopped AND there is no
event pending, we should keep all the pthreads active.

getting that logic right is tricky! Its even harder, when it is time to return
control to the user, to terminate all the concurrent threads except the
one that actually called the run function, that one has to return.
It has to be that one!

=============

Note that you can have a STACK of schedulers!!!!!

Any Felix procedure can call “run” which is a subroutine NOT a coroutine.
That’s the secret that took me 20 years to figure out!!!!

Now here’s the FUN part. WHen you call run from inside a *concurrent* process
running on a scheduler .. what happens to the OTHER threads????

You just spawned a new scheduler which itself can run concurrent threads,
but now, there are still threads running in the parent.

Now it’s not clear what happens with async I/O. There’s only one async I/O
queue .. one demux. I think. I’m confused!

Also, *synchronous* coms can now occur because threads in the parent
are still running. Now it seems, a fibre on the child can MIGRATE to the
parent! After all, if there is a shared channel, the child does a write,
then a parent thread can do a read on that same channel.

in this case my spinlock, which is PER SCHEDULER is going to fail
to protect the accesses because there are two schedulers involved.
Note any running fibre, in any thread, on any scheduler, can do
schannel I/O on the same channel.

Its worse! What happens if you start another pthread directly
with spawn_pthread? You’re not supposed to use schannel
I/O between pthreads because originally it wasn’t synchronised.

But wait! What if we use a global spinlock??
That is, put it on the GC object.

now any pthreads can safely do I/O and in fact we can get rid
of pchannels altogether!

The PROBLEM is that spinlocks should never protect complex
operations. You should use a standard mutex lock for that.
By “complex” operation I mean anything that can do a memory
allocation or wait for async I/O.

In particular, anything that can cause a context switch in the OS
***INCLUDING PREEMPTIONS*** should not use a spinlock.
In principle spinlocks should mask interrupts to prevent OS pre-emptions.
Otherwise the lock holder might be pre-empted, and the other threads
waiting on the lock will be logically stalled, but physically eating
CPU up spinning.

====

So there is still some heavy design work to do. The core problem
is that we really want to mask interrupts when a spinlock is held,
but the OS rarely provides a way to do that. Windows used to have
BEGIN/END_CRITICAL_SECTION which is what we want,
but what normally happens is you use an OS mutex, and that tells
the OS to try to avoid pre-emptions. But it doesn’t force interrupts
off, because the OS should pre-empt the lock holder if the lock
holder is taking too long. Or ends up calling malloc() or doing something
else. In this case, all the threads waiting on the lock should be
pre-empted as well, since none can run. This means the OS
will run threads that will allow the lock holder to continue instead
of hanging.

So we have a catch-22 situation. We should be using OS locks,
but they’re too slow. But spinlocks can never work correctly,
because the OS can pre-empt the lock holder, no matter what.

Spinlocks work based on statistics: if the actions of the lock
holder are fast .. such as just doing a couple of memory accesses,
the chances of being pre-empted are low. EXCEPT if the memory
is paged out to disk. Heck, if it isn’t in the L1 cache, we have problems.
And if it is is, but another process is using the same memory,
which is ALWAYS the case because that’s why we’re protecting it,
then the hardware has to do cache-coherence messaging
to synchronise the caches.

Apple apparently and Linux for sure I think, makes mutex locks
spin, with exponentially increasing delays, until it gives up
and calls the OS lock. This is actually the correct method.
Just use “accelerated” system mutex.

But tests on my Mac show, this doesn’t work in that, it
kills performance.

By far the best solution is to use atomic I/O, but that only
works with a single read or write or special instuction.



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Feb 19, 2020, 7:39:06 AM2/19/20
to Gordon Childs, felix google


> On 19 Feb 2020, at 19:08, Gordon Childs <gordon...@gmail.com> wrote:
>
> On Mon, 17 Feb 2020 at 01:20, John Skaller2 <ska...@internode.on.net> wrote:
> now any pthreads can safely do I/O and in fact we can get rid
> of pchannels altogether!
>
> What are pchannels again?

Channels for pthreads to communicate on. They use locked monitor objects.
They’re heavy weight. The guarrantee that neither pthread can progress until
the data is exchanged, to there is a transfer acknowledge sequence.
Since these are pthreads, the locking uses OS mutex.


>
> I'm trying to understand this bit - so anything can suffer a pre-emption (I haven't been in a position to mask interrupts since the 90s), but something that can cause one should not use a spin lock -

It’s a contradiction. That’s the problem. One uses atomic instructions
if possible since these cannot be pre-empted.

The problem is that whilst system mutex do the right thing, the cost is very high.
If you have two CPU’s each running a Felix coroutine that need to communicate,
one would like to use an extremely lightweight method to protect the operation.
Spinlocks are extremely lightweight. The data transfer is extremely fast so the spinlocks
would nomally only be help very briefly, so the waiting CPU would only spingt for a few
cycles. Unless the lock holder is pre-empted.

Calling malloc inside a spinlock protected critical section is also a no-no because it
might call the OS.

> so the example would be running a felix thread that fails to acquire a mutex? Because then multithreaded felix would eat CPU without making any progress?

This is not a problem for OS locks because the OS just deschedules the pthread
waiting to acquire the lock. But with a spinlock, the OS doesn’t know a CPU is spinning.


> I'm a fan of single-threaded felix but it's crazy to leave all those cores unused.

Yes.

> I've been thinking about apple's GCD approach where you have a queue that acts like it has a certain thread affinity but is actually backed by a thread pool (1 per core?) yet work still happens serially while tasks bounce around threads. Maybe it's not relevant here because using just one won't get all the cores busy, but once you use a few then chances are good I suppose that you're getting good usage.

The problem with thread pools is that they don’t scale, must never be nested,
and impose heavy restrictions on what you can actually do without
risking a deadlock.

Basically you have to have some kind of messaging buffer, you can’t use locks.
A thread might be running and blocked waiting for a lock to be released,
when the holder of the lock is actually not running. So the pool deadlocks.
The problem is, when you’re blocked you want the OS to deschedule your
thread, but if it does, the thread pool doesn’t know. So you cannot use
locks, You have to send a message and deschedule from the pool.


> Can't you wrap a mutex in a spinlock?

No, because someone waiting on the spinlock will wait for a long time
if the thread holding the spinlock fails to acquire the mutex. in that
case the thread that fails to acquire the mutex is descheduled by the OS
but the one waiting on the spinlock is not, so it spins a lot until finally
pre-empted.


>
> I'm confused about the felix scheduler in general. Shouldn't it be single threaded? How would you like felix to work with multiple cores?

The original scheduler was single threaded. Each pthread ran its own scheduler.

A coroutine running inside a scheduler can create a nested scheduler, but that
freezes the current scheduler when starting the new one, thus forming
a stack of schedulers.

But the problem is like this: consider a pipeline:

A |-> B |-> C |-> D |-> E

and assume all the coroutines B,C,D are pure (don’t have side effects),
assume A is also a pure source (eg serialising a list read as a parameter),
and E just does output.

In this case, all the coroutines can be run concurrently because their
only interactions with the outside world are on channels (except E
which just prints to the console).

So there’s no good reason not to have a pool of pthreads grabbing
active coroutines out of the scheduler queue.

All we need is the schannel I/O and changes to the scheduler list
to be critical sections.

One of the problems with the old design is that an schannel is a linked
list of waiting coroutines. And the list nodes have to be allocated whenever
you push a continuation onto the list.

So I changed the design. I put an extra slot in the fthread objects
so they can be linked into the schannel list without an allocation.

Similarly the same slot can be be used for the scheduler list.
So now, coroutine read/write operations don’t require any allocations.

struct RTL_EXTERN fthread_t // fthread abstraction
{
con_t *cc; ///< current continuation
fthread_t *next; ///< link to next fthread, to be used in scheduler queue and schannels
fthread_t(); ///< dead thread, suitable for assignment
fthread_t(con_t*); ///< make thread from a continuation
svc_req_t *run(); ///< run until dead or driver service request
void kill(); ///< kill by detaching the continuation
svc_req_t *get_svc()const; ///< get current service request of waiting thread
private: // uncopyable
fthread_t(fthread_t const&) = delete;
void operator=(fthread_t const&) = delete;
};

Notice the next pointer?

So now, operations can use spinlocks, because they no longer do malloc()s.
Also the operations are thousands of times faster, even in single threaded more,
for the same reason: no allocations are required to do scheduling or schannel I/O.

I have a quadcore. It runs 8 hyperthreads. I have got about 7x performance
with the new design, fully saturating all the cores. Of course it only of benefit
if the coroutines do heavy work. Otherwise cache coherence issues between
the cores actually leads to lower performance than a single thread.

At the moment the spinlock is in the scheduler queue:

struct RTL_EXTERN fthread_list {
….
::std::atomic_flag qisblocked;
...
bool lockneeded;
::std::atomic_flag active_lock;


The “lockneeded” is set to false if there’s only one pthread accessing
the scheduler queue, in which case locking is skipped. The extra test
has a tiny impact on single threaded performance. The flag is set
to true when you say “spawn_process”. That creates another thread
to pull work off the same scheduler list as the caller. So now the coroutines
on the scheduler list run concurrently, maybe. Each spawn_processes
adds a new pthread. The API is a hack, but i’m more concerned with
how to decide when its safe to run coroutines concurrently.

the problem is that shared memory access used to be safe but now
isn’t unless protected by critical sections. Which means I have to
provide some way for the user to create them.






John Skaller
ska...@internode.on.net





Reply all
Reply to author
Forward
0 new messages