> On 27 Nov 2018, at 06:14, Keean Schupke <
ke...@fry-it.com> wrote:
>
> Hi John,
>
>
> The lesson from JavaScript / Node.js / Go is that we should prefer asynchronous co-operative context switching in user space with a fixed hardware thread-pool size, to using synchronous code with hardware context switching. You seem to agree with this.
Yes. However in Felix the programmer has the choice how to do things.
As mentioned, there are problems with user space context switching coupled
with shared memory access. If you use pure coroutines that communicate
only with channels, the problem doesn’t arise. As in Go, the main synchronisation
method is to use channels. However shared memory access must also be supported.
For coroutines, there is no issue: no locks are required because of the total
ordering guarantee. But this assurance does NOT extend to general
processes/threads/goroutines or whatever you want to call concurrent logical
threads of control. You need locks, transactional memory backed by lock free
techniques, or whatever.
With OS pthreads you can use locks, condition variables etc for shared memory
access synchronisation, but you CAN NOT freely do this when running logical
threads from a pool. If you have to use a lock, the lock has to be adapted for use
in the pool. As I explained a hard lock waiting for data will lock up a physical
pool thread forever if the supplier of the data is in the queue, because the queue
can’t be emptied because the thread that would empty it is locked up: deadlock.
Instead, to do a lock, the procedure being run has to be dumped out of the pool
and put away somewhere until the data is available. If the writer is then pulled
out of the queue and run, it too will lock up trying to send the data, so it has to
be dumped out of the pool as well. Then the user multiplexing system has to
arrange the transfer is done and the locks release OUTSIDE the running thread
pool before putting the two procedures back into the queue, ready to go.
One might arrange this by actually queing another job for the pool that does
the transfer. The point being the two user threads, reader and writer,
are not in the queue whilst they’re logically locked.
This is just an example: the point is that at least locking must now be
implemented in user space too. It’s not just a matter of a queue of jobs
which the phyiscal threads pull off the queue and run until they’re
complete. There has to ALSO be machinery to handle locking
OUTSIDE the worker threads because it cannot be done inside
a worker thread.
The key point here is that ordinary OS pre-emptive threads already
handle this. Locks encourage the thread to be descheduled. If the OS
is smart and the locks are OS locks, releasing a lock may encourage
the OS to reschedule the job.
The OS can do this kind of thing MORE efficiently than user space
handling because it is designed to do so.
The only reason user space threads are faster is that they’re applied
in contexts where the full set of possible options can’t happen.
so the scheduler can be more efficient. If they can happen, the OS
scheduler is going to do a better job.
> I understand you are explaining how Felix currently works. I am considering how it should work ideally to allow it to compete efficiently with languages like Go. The way I would do this would be to utilise the channel read writes as async points, effectively yielding the CPU, allowing the co-routines to share the processor efficiently.
It won’t work because channel I/O is not the only way coroutines communicate.
They can share memory as well. Currently this does not require locking.
Also, the current scheduler doesn’t require yielding CPU or any kind of
synchronisation so it is MUCH faster that what you propose, that is,
the thruput is much higher.
The problem is the current scheduler doesn’t allow concurrent execution.
That is, although the thuput is maximal, the real time performance is not,
because (obviously) only one CPU can be utilised.
At present IF you can partition your job into separate threads,
meaning the coroutines in each thread do not communicated,
then you can get maximal thruput AND maximal real time performance.
Felix can and does provide a way to do that
pfor i in 1 .. 1000 do …
will in fact run N concurrent loops of size 10000/N, each in a separate
pthread. It is up to the user to ensure the code in the body of the loop
is such that there is no interaction between iterations. The compiler
doesn’t check.
Quite clearly this is a SPECIAL CASE.
So the problem is how to generalise the existing system to get better
real time performance, without sacrificing thruput.
The idea is to allow more than one thread to processes a coroutine queue.
This will only work if the coroutines do not interact using shared memory.
As long as they use channels, its fine.
The code to implement this is almost complete. When you start a pthread
up it runs the coroutines without any locking. If you start additional
threads to service the queue, locking is switched on.
This is actually the easy part. The hard bit is making idle worker threads
hang around until the termination condition is met.
Previously the termination condition was simply that the current coroutine
was suspended and the scheduler had to find another one, only,
the queue was empty.
What happens then is that the scheduler looks to see if there are any
async tasks (socket I/O or timer alarms) pending. If so, it blocks
on the queue of completed async requests. (If the requests fail
all hell breaks lose!). When a request is complete, the relevant fibre
is pulled off the ready queue and put onto the active queue,
and the scheduler can start running the fibre.
The async queue is managed with a condition variable.
The sync queue doesn’t require any locking because there is only one thread
running the scheduler AT THE MOMENT.
To allow multiple threads to run the scheduler AND fetch completed async
requests back to the queue required the code to be modified.
> Regarding synchronization and locks, we should not really need to worry about this. A co-routine effectively runs as a sequence of asynchronous callbacks, and _must_ be non-reentrant. So if a co-routine only updates local state, we don’t need any locks.
Yes, we have to lock the channel I/O, but the user doesn’t have to worry about that.
The thing is coroutines DO update non-local state. THAT IS THE WHOLE POINT!
The whole point of coroutines is that they’re synchronous. Its sequential programming.
Share memory access is never contended so no locks are required.
Do you see the problem? The idea is to UPGRADE the system to run SOME coroutines
concurrently: the ones that you correctly identify only update locak store.
Actually the condition is stronger: the actual condition is that no event of the
coroutine is visible externally. So for example printing to the console is externally
visible even though there is no non-local store access.
Actually its even worse: READING external store which could change can’t be
allowed either (in coroutines to be run concurrently).
If we can update Felix so AS AN OPTIMISATION it can safely run some coroutines
concurrently AND MAKE IT TRANSPARENT or nearly so, then we will crap
all over Go because we will have shared memory access WITHOUT LOCKING
AS WELL AS CONCURRENCY.
So the key problem is theoretical: given a set of coroutines how can you tell
which ones can be run concurrently? And if you can tell, how do you actually
do it?
The current model is simple: if ALL the coroutines can be run concurrently,
then use a thread pool, otherwise don’t.
However that model can be improved radically. I think the idea is to
examine the interaction graphs, and try to find an “isolated subgraph”
which can run concurrently.
But another idea is to provide concurrency WITH RESTRICTIONS.
One restriction would be that if you have two routines that make
matching I/O requests they’re run PAIRWISE SERIALLY.
That is, one has to run until it suspends before the other,
so they are not allowed to run concurrently.
Exactly how to implement this I’m not at all sure at the moment.
Pairwise serialisation looks very promising because it provides a
LOCAL TOTAL ORDERING GUARANTEE if you get my drift.
The data analogue of this exists: computers run segmented memory
under C. Objects, including arrays, use contiguous store, so the store
inside an object is totally ordered by address, however separate
(discrete) objects are unordered.
I don’t know what you call this ordering officially, but I call it
a local total order. Ordering is total locally, but not globally.
I suspect this is how to run coroutines with concurrency,
but i need to implement stuff and test it to actually see how it
works and how it performs. At present, I don’t know how to
implement local total ordering although i can think of ways to do it
that might work.
There is clearly some notion of “affinity” here. If two coroutines
communicate on a channel, they have the same CPU affinity,
which ensure they’re executed in a totally ordered manner
as required.
However, how to organise what you call “work stealing” whilst
preserving affinity is not clear.
The “affinity” we required is NOT TRANSITIVE.
The usual CPU affinity of pthreads IS TRANSITIVE.
If T1 has affinity A and requires T2 to have the same affinity,
and if T2 requires T3 to have the same affinity, then T1 and T3
have the same affinity.
This is no good. It is a well known problem with the idea of affinity
in OS design. Too much affinity and everything runs on the same CPU
and you lose the concurrency.
The problem is obvious in the thread pool running things model.
Another way to look at it is that if A < B where < means total order,
and B < C, then by definition, A < C. Total ordering is transitive.
We cannot allow that. We have to weaken the ordering requirements.
But for lock free channel I/O we have to have pairwise ordering,
otherwise we have to use a lock.
My current implementation says: if there is exactly one thread
servcing the list, dont use a lock. Otherwise use a lock.
And note this only applies to channel I/O not shared memory access.
So here is an idea: we use SUBSTRUCTURAL LOGIC.
IN particular SEPARATION LOGIC. We already know how to use
separation logic to analyse memory acccess: Facebook actually
has a software that can do it. For C and Javascript.
So the idea is roughly: if we can *separate* two coroutines, then
they can run concurrently, otherwise they’re coupled and have to be
run serially.
—
John Skaller
ska...@internode.on.net