Re: [felix-lang/felix] doc updates (#129)

9 views
Skip to first unread message

John Skaller2

unread,
Nov 18, 2018, 9:50:05 AM11/18/18
to felix google
I have no idea how to review a PR. I think for simple changes, just do them and commit
directly to the master. That way, I see each commit. There’s a mailing list that gets
all the commits.

felix-...@googlegroups.com

as well as build notices. If you commit stuff to a clone, I don’t see it.
I just get to review a whole bunch of change at once .. if I can figure out
how to do that.

Note for me, with limited bandwidth email is always better than web activity.

The main rule is try not to break the build. If you’re doing an insignificant change,
discuss first on the mailing list, then just do it. I can always undo it :-)

Use a branch if the change is experimental, significant, contentious, etc.

Be careful with the compiler and library as follows: sometimes things look like
they’re wrong and need to be fixed, or some extension or change looks
simple, but there can be subtle reasons why things are the way they are.
I often get caught by this myself.


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Nov 18, 2018, 10:17:04 AM11/18/18
to felix google
There are a number of unsatisfactory things that are worth thinking about.

Abstract arrays is one nasty. I’m talking about varray and darray,
and also carray.

To explain the problem: in Felix, I have solved a major problem in C++,
by eliminating lvalues and references. The way the model works is very nice:

All values are immutable. However if you store a value in a variable,
then its an object, and mutable. The way this works is simple: you modify
an object you have to use a pointer (in principle). So for example
there ONLY way to modify an object is with the procedure

proc storeat[T] ( p: &T, T);

For example

var x = 1;
storeat (&x, 2);

Syntactic sugar is provided by the parser:

&x <- 2;

Also, assignment is “in principle” syntactic sugar:

x = 2; // means &x <- 2;

You can also get a pointer if you create an object on the heap:

var p: &int = new 42;

[In act there are read only, write only and read/write pointers just to make things more interesting]

The key problem is products. In C++ you write

x.field = newvalue;

which relies on x being an lvalue, and then x.field is an lvalue too. This is a broken idea.
In Felix, given

struct X { field:int; }
var a = X 1;
&a.field <- 42;

is how you do it. The reason this works is that “field” is NOT a field. Its a function!
In fact, it is a projection function. It’s a first class function:

var projection = field of X;
println$ projection a;

Note that

a.field = field a;

because operator . is reverse application NOT field access.

Now here is the killer: projections are OVERLOADED for pointers as well as values.
Hence

&a.field <- 42;

works. Here field is the pointer projection of type

&X -> &int

instead of the value projection

X -> int

The machinery is beautiful. Everything *except* the storeat procedure is purely functional!
No lvalues, no references, pointer formation is a structural typing combinator.

Now here’s the problem. A varray is a library datatype which has a bounded extent
set at construction time, and can be dynamically extended up to that extent, or shrunk
to 0 length. It is an intrinsic of great importance because in Felix ALL objects are
varrays underneath: the GC treats all objects as varrrays.

Now, the library varray is just a pointer to the actual array. So when you pass it,
it is “pass by reference”.

So the problem is: because its a pointer already, it doesn’t follow the pattern
of fixed length arrays. you can write

var a = varray (1,2,3,4);
println$ a. 0; // prints 1

using 0 as a projection index. But the library actually provides:

fun get[T] (a:varray[T], int n);
proc set[T] (a:varray[T], int n, T v);

No pointers. Because varray is already a pointer (abstracted).
There is also

fun stl_begin[T] (a:varray): +T;

which returns a C pointer an incrementable C pointer that allows
increment, dereference, and storeat. But this is just ugly.

So in summary: I can’t figure out how to make the syntax for varray
work “like an array”. :-)

Get and set methods suck. So does having to call stl_begin.

The problem is “in theory”

a . 0

should be a POINTER to the first element, not the value of the first element.
But that would confuse the user. Woudn’t it? Well its me that’s confused.




John Skaller
ska...@internode.on.net





John Skaller2

unread,
Nov 18, 2018, 7:18:53 PM11/18/18
to felix google
So here’s my thoughts so far.

First, instead of doing a lot of analysis to decide if its safe to run coroutines
concurrently, I propose to just let the programmer shoot themselves in the foot.

In other words, just provide a “spawn_process” operation, and let the programmer
decide if its safe or not.

Now, the implementation. This is HARD because generally I try to extend existing
code without breaking it. So a correct design here might be a complete rewrite!
That’s scary!

One idea is this: have a global lock which can be used to serialise everything.
We know in advance this is overkill. But it should allow easy implementation,
and model testsing, and allows more localised locking later.

What we can do is modify channel read/write, so that its locked.
Here’s what a channel I/O op does:

1. When the program needs to do channel I/O it executes a special
instruction called a Supervisor Request. The actual request is stored
in the requesting procedure object in a fixed placed. The SVC instruction
then simply returns control.

2. The scheduler picks up that the current continuation has suspended
with a request. If the request is channel I/O then:

a) if the channel has a matching continuation suspended, it is added
to the scheduler active list and the current continuation is resumed.

b) if the channel has no matching continuation, the current continuation
is added to the channel wait list, and the scheduler picks a new fibre
to run from the active list, and resumes it.

The sequence above is a critical section. So all we need to do is wrap
it in a lock/unlock and the system is serialised and thread safe.

We also have to take care with ALL operations that play with the
scheduler queue (other than channel I/O).

It should be easy. This does NOT give us concurrency.
It also does NOT make it safe to migrate coroutines from one
pthread to another.

To get concurrency we need a set of threads that SHARE
a single scheduler queue. That’s about it! We just need to
add a spawn_process command which is like spawn_fthread
except it puts the spawned continuation on the process queue
instead of its own queue.



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Nov 18, 2018, 9:36:13 PM11/18/18
to felix google
Well .. I just added a mutex to the gc profile object, and lock it during synch ops,
include read/write.

But now, one test case fails with a segfault.
It doesn’t make any sense. Adding a lock cannot cause a fault. It might cause
a deadlock .. but a segfault? That’s impossible.

Unfortunately lldb is a heap of crap and doesn’t actually show the code that went
wrong. Apple’s debug handling is really bad.



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Nov 19, 2018, 6:34:57 AM11/19/18
to felix google


> On 19 Nov 2018, at 13:36, John Skaller2 <ska...@internode.on.net> wrote:
>
> Well .. I just added a mutex to the gc profile object, and lock it during synch ops,
> include read/write.
>
> But now, one test case fails with a segfault.
> It doesn’t make any sense. Adding a lock cannot cause a fault. It might cause
> a deadlock .. but a segfault? That’s impossible.

I’m right! I commented out the locking. It still segfaults.



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Nov 19, 2018, 7:31:29 AM11/19/18
to felix google


>
> I’m right! I commented out the locking. It still segfaults.

Ok this is complete and utter nonsense. I refetched the whole repo, rebuilt,
all OK.

I added a mutex to the gc profile, everthing stiill OK.

I added a pointer to the mutex, without actually setting it or using it
and I get a segfault.

To confirm I commented it out. No segfault.

Either C++ is complete rubbish, or the build system is mixing up the binaries
and headers so they don’t match. Adding an unused pointer non-static
member to a class cannot do anything at all (the class isn’t copied either).


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Nov 19, 2018, 7:55:38 AM11/19/18
to felix google
Ok, its the build command. Somehow the binaries and headers are getting out of sync.
Adding a field to one class in the header changes some binaries but not others.

Make rebuild works. Make rtl does not.

Hmm . ok ..

rtlbase:
# =========================================================
# rebuild rtl
# =========================================================
${LPATH}=build/release/host/lib/rtl flx_build_rtl --target-dir=build/release --target-bin=trial

The output binaries are going into build/release/trial. But the headers are going into
build/release/share. So the headers are upgraded but the binaries are not.
Or something like that … :-)


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Nov 19, 2018, 1:54:53 PM11/19/18
to felix google
OK! So the model is becoming clearly. Here’s what my current thoughts are:

At the moment, when you spawn a procedure as a prthead, a fibre scheduler
is run by the thread. The procedure you gave is made into a fibre and put
on the scheduler list. Ignoring async I/O, when the scheduler list is empty
and the currently running fibre terminates, the pthread terminates.

A fibre can start a new scheduler. Schedulers are SUBROUTINES.
So when this happens the current fibre (and its scheduler) is suspended
while the nested scheduler runs. It resumes when that scheduler finishes.

So far so good. But there’s an issue! Its been worrying me for a while.

When a fibre does I/O on an schannel, it is suspended and put onto
the channel until a matching request is made, and the scheduler then
grabs another fibre to run .. or terminates if there aren’t any.

This is all fine provided channels only connect fibres running
on the SAME scheduler. Now if you have a channel connected
to a parent (or grand parent) that’s fine. An unsatisfied request leaves
the fibre on the channel until its scheduler terminates and the parent
wakes up, at which point its active fibres can satisfy the I/O request.
This means the original fibre can MIGRATE down the scheduler stack
from a child to a parent. Its also possible for fibres to migrate UP the scheduler
stack from a parent to a child.

Both these migrations are sound because they preserve the total event ordering
coroutines guarrantee.

HOWEVER if the channel crosses a *pthread* boundary all we’re in trouble.

First, the total event ordering assurance is broken. Technically the assurance is
only with a single pthread.

Second, there could be a race on the channel which will corrupt the whole
system.

Because scheduling is dynamic, Felix cannot stop this.

The code I am working on fixes the race problem. There is now a global
lock which is used to mark critical sections. So reads and write on schannels
are now serialised, globally. There are more operations to serialise.

The ordering problem is bad though. So here’s a new idea. Each fibre
carries a pointer to a particular scheduler queue. Whenever a running
fibre is suspended it goes on THAT queue as active. If an I/O request
is made, and is matches, the two concerned coroutines are put
back on their original queue.

This makes migration IMPOSSIBLE. It also means that a scheduler
can die because it has no active fibres left to run, but is queue persists,
and eventually a fibre it used to run goes back on the active queue .. even though
the scheduler is dead. So, the queues have to be garbage collected!

My brain hurts a bit thinking about this but it seems sound: a fibre on a dead
queue can’t ever run. It also can’t be deleted until the whole queue is deleted,
However fibre left hanging on otherwise unreachable channels are already
reaped by the collector, and so their link to the queue does not consititute
a reaching path from the root.

Now, with serialised I/O operations and “queue affinity” as described
already something new and wonderful can happen: two coroutines
running on schedulers of DIFFERENT THREADS can now safely
communicate.

At present we have to use pchannels for that. These are monitors.
And the block the whole thread, not just the coroutine doing the pchannel I/O.

So now, there’s more. I have added a new command:

spawn_process

What this does at the moment is set a “mobility” flag to true.
You must ONLY call this procedure on pure coroutines.

The idea (not yet implemented) is that you can spawn a pthread
and GIVE it a queue instead of it making one. So you make
a queue, then you make several pthreads use that queue.

When you spawn process, the fibre is put on that queue,
not the queue of the scheduler whose fibre is making
the spawn_process call. So that fibre now has a pool
of pthreads grabbing fibres from the queue and running
them *concurrently*. The actual code running on these
pthreads is the same! The ONLY difference is that the queue
is shared by several schedulers. The big difference is that
when any pthread’s current fibre suspends and it needs a new
one, it grabs it from the shared queue, so all the fibres
with affinity for that queue can run concurrently.

I want to note that whilst these threads can be pre-empted
the FIBRES cannot be. Once a fibre is running on a particular
thread, it keeps running until it yields. Context switching WITHIN
a pthread is still entirely cooperative, even if the thread is swapped
out or pre-empted the coroutine can’t be swapped out for another
one on the same thread.

So the upshot is:

* currently non-migratory behaviour is preserved.
* there is an overhead from locking on every I/O request now
(and some others like spawning: access to queues have to be serialised)
* previously hard to predict migratory behaviour is now impossible
* specified fibres can run concurrently
* ALL fibres can communicate with schannels

The caveat with schannel communication is that for non-trivial data
like lists, we have to remember that with concurrency, the writer could
destroy the data before the reader has grabbed it. This wasn’t possible
before because once the reader has the data, it can’t be interrupted.
However, Felix has a hack that ensured the reader always goes first
after a transmission so the writer cannot mess up the data by going
first before the reader has it. However in any case the library
functions actually copy the data to be written onto the heap,
and then write the pointer.

What is actually needed to make this stuff safe is to require
the data to be sent is Uniquely typed and that it is MOVED
down the schannel. In other words having written the data,
the writer can no longer access it.

So the mobile flag I just added is wrong. What is actually required
is simpler and saner: the fibre has to have a queue pointer,
and we need spawn_process to be the same as spawn_fibre
except spawn_fibre fixes the queue to be the same as the spawner,
whereas spawn process requires passing it as an argument.

Ok so the model is getting there!




John Skaller
ska...@internode.on.net





John Skaller2

unread,
Nov 20, 2018, 10:08:25 PM11/20/18
to felix google
>
> HOWEVER if the channel crosses a *pthread* boundary all we’re in trouble.
>
> First, the total event ordering assurance is broken. Technically the assurance is
> only with a single pthread.
>
> Second, there could be a race on the channel which will corrupt the whole
> system.
>
> Because scheduling is dynamic, Felix cannot stop this.


> The ordering problem is bad though. So here’s a new idea. Each fibre
> carries a pointer to a particular scheduler queue. Whenever a running
> fibre is suspended it goes on THAT queue as active. If an I/O request
> is made, and is matches, the two concerned coroutines are put
> back on their original queue.
>
> This makes migration IMPOSSIBLE. It also means that a scheduler
> can die because it has no active fibres left to run, but is queue persists,
> and eventually a fibre it used to run goes back on the active queue .. even though
> the scheduler is dead. So, the queues have to be garbage collected!
>
> My brain hurts a bit thinking about this but it seems sound: a fibre on a dead
> queue can’t ever run. It also can’t be deleted until the whole queue is deleted,
> However fibre left hanging on otherwise unreachable channels are already
> reaped by the collector, and so their link to the queue does not consititute
> a reaching path from the root.

Hum. No. Having one thread “put back” a coroutine on another threads
active list, when that thread is about to terminate, is a race, it would be madness
to have the routine run sometimes and not others.

So either we use the migration model, or bad interthread schannel communication.

Migration actually *ensures* a total ordering of reader and writer, i.e. local
totality (precisely because the routines run on the same thread).

However it breaks the rule that the routines on a single thread *at any one time*
are totally ordered *for all time* globally.

So I need to think of use cases. And its fun: consider a thread A with a routine
A1 that needs some data the a routine B1 on thread B has. Then a sync on
an schannel can move B1 over to A so the transfer is synchronous. Then B1
can sit until more data arrives from say B2 on a channel, which drags B1
back to thread B.

So B1 is acting like a ferry, shuttling between threads.

This can happen now (it isn’t safe because the I/O ops
aren’t thread safe).

Also at present, Felix is using a C++ ::std::list for the active fibres.
This requires rooting the elements which is costly. I think i will
change to an slist, which is already garbage collected. I just need
to then make the sync objects garbage collected as well.

If I do that then all i need to do is provide a spawn_process function
which is the same as spawn_pthread EXCEPT it takes an existing
active list as an argument instead of creating a new one.

The by serialising access to the active list .. we get concurrency.

It would be up to the user to ensure only pure coroutines got posted to
that active list. The PROBLEM then is maintaining that when
you have migratory routines.



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Nov 21, 2018, 7:55:25 AM11/21/18
to felix google
Another approach to this: what is the simplest change to the system that
introduces concurrency? What does it imply?

The answer is: the change is utterly trivial: Here is some code:

void async_sched::do_spawn_pthread()
{
fthread_t *ftx = *(fthread_t**)ss.request->data;
if (debug_driver)
fprintf(stderr, "[prun: spawn_pthread] Spawn pthread %p\n", ftx);
gcp->collector->add_root(ftx);
std::list<fthread_t*> *pactive = new std::list<fthread_t*>;
pactive->push_front(ftx);
void *data = new async_sched(world,debug_driver, gcp, pactive);
flx_detached_thread_t dummy;

This is the start of the routine that is used when the client says

spawn_pthread procedure;

The first line extracts the fthread of the procedure:

fthread_t *ftx = *(fthread_t**)ss.request->data;

Now we make that a GC root:

gcp->collector->add_root(ftx);

Now the cirtical line: we create a new active list for our new pthread:

std::list<fthread_t*> *pactive = new std::list<fthread_t*>;

This is the list the synchronous scheduler uses to handle fibres,
in particular, do schannel I/O operations.

Next we push our initial fibre onto that list:

pactive->push_front(ftx);

and finally we create a new asynchronous scheduler object:

void *data = new async_sched(world,debug_driver, gcp, pactive);

This object also contains the synchronous scheduler. By way of explanation the
scheduler is divided into two pieces: the synchronous scheduler does synchronous
things like schannel I/O, but it it gets a request to do something asynchronous
like create a new pthread, it returns control to the async scheduler saying
it can’t handle the request.

The split is important because the user can create a new synchronous scheduler
and run it. Its a subroutine. So you can have a stack of synchronous schedulers.
If an async level request is issued onto a user synchronous scheduler
all hell breaks lose because it is not owned by an asynchronous scheduler
to which the request can be delegated.

Now here’s the point: the simplest change is to add a routine:

do_spawn_process ..

which is identical EXCEPT that instead of creating a new active list,
it just uses the caller’s current one.

That’s it! The new pthread will share the active list with its creator.

There are a couple of possible variations: one is to just create N threads
at once instead of just one. The difference is in the user invocation,
instead of saying

spawn_process procedure;

the user says

spawn_pthreads 4 procesure;

which makes 4 threads.

Now here’s the issue
================

A scheduler dies when the active list is empty AND the currently running
fibre stops. It can stop due to for example an I/O request on a channel.

If this happens, and there are no more fibres on the active list,
then there is no way for the fibre suspended on the channel to send or receive any
data because there are no active fibres left to do it.

So we can safely KILL the pthread because it cannot progress.

There is an existing EXCEPTION: if an async I/O operation is pending,
such as a timed wait of socket I/O request, the thread BLOCKS waiting
for the request to be satisfied. That puts the fibre waiting for the event
to complete back on the active list.

That’s how Felix does async I/O.

The PROBLEM now is that, ignoring pending async I/O, when the active list
is empty, and a pthread’s running fibre goes inactive, the pthread will terminate.
The problem is that THE OTHER THREAD is still running a fibre which
could put new fibres on the shared active list.

The termination condition has to be that the active list is empty AND
NO PTHREAD SHARING THE ACTIVE LIST is still running a fibre.

Its not clear how to enforce this. (Obviously it can be done, with, say,
a counter of pthreads that share the active list and which are running,
when it drops to 0 ALL the pthreads terminate together).

But my basic design principle is that the right condition should
“drop out” of the data structures naturally, not by bolting on a hack.

Maybe its not a hack, I dont really know. The model would say
N thread stay alive, even when only one has work left to do.



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Nov 21, 2018, 11:35:21 AM11/21/18
to felix google
Ok, its done. It crashes because there are no locks yet.

However check this out:

var x = 0;
noinline proc f (i:int) () {
++x;
}

spawn_pthread {
var start = time();
println$ "Startup thread";
for i in 0 ..<1000000 perform schedule_fthread (f i);
println$ "creation time: " + (time() - start).str;
start = time();
println$ "Scheduling done x="+x.str;
schedule_fthread {
println$ "final x =" + x.str;
println$ time() - start;
};
//spawn_process { println$ "EXTRA"; };
};

~/felix>flx tp
Startup thread
creation time: 34.3739
Scheduling done x=0
final x =1000000
0.813906

It takes 35 seconds to create 100K closures and post them onto the
scheduler list. schedule_fthread is the same as spawn_fthread except
spawn causes the spawnee to run immediately, whereas schedule
causes the current fibre to continue instead.

I have to fix that. 35 seconds is slow. I’m not sure what is slowing it yet,
the allocation or the posting. Probably the allocations.

But look at the time taken to *execute* all 100K coroutines!

Its under 1 second. So the context switching time is in excess of
100K context switches per second. And I think I can make that
a LOT faster too.

Note, the 100K or so objects are not collected :-)

The idea was the spawn_process shares the same queue of active fibres.
So each pthread should run 50K fibres. So it should run a bit faster.
However locks will be required .. and it might well run slower :)


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Nov 21, 2018, 6:25:21 PM11/21/18
to felix google

> //spawn_process { println$ “EXTRA"; };

I did this but I don’t like it. I’m leaning towards:

spawn_pthread N { .. };

i.e. just throw in a count of how many threads you want to run the job.

The reason is this:

If the count is 1, the default, there is no locking and no special stuff to keep
the threads active.

If the count is >1 then accesses to the queue and schannel read/writes and a few
other things are protected by a mutex. The mutex can be specific to that thread set.

In addition, some artifice will be needed to keep all the threads working
until the active queue is empty and ALL the threads are out of a work.

Also at present the user can run a synchronous scheduler:

run { … };

is a subroutine that runs a (stacked) scheduler and returns when there’s
no work to do. However this scheduler cannot spawn pthreads or do async I/O.

I need to look at that. Note in my test case:

spawn_pthread {
var start = time();
println$ "Startup thread";
for i in 0 ..<1000000 perform schedule_fthread (f i);
println$ "creation time: " + (time() - start).str;
start = time();
println$ "Scheduling done x="+x.str;
schedule_fthread {
println$ "final x =" + x.str;
println$ time() - start;
};
//spawn_process { println$ "EXTRA"; };
};

I had to use “scheduile_fthread” to ensure the 100K fibres got stacked up
before trying to execute any of them, so that the extra pthread got a chance to
eat some of them off the queue. (It crashed but that’s expected because I have
no locking yet). The test case isn’t good, because at 100K jobs processed in 0.8
seconds its hard to see how a second thread can do anything but degrade the performance.
The fibres need to do more work to see a speedup.

Anyhow, somehow spawning a pool of threads to run a fibre set seems more
symmetrical than my previous design and has the clear advantage i mentioned,
that for the special case of 1 we can omit the locking.

The other thing is that, with serialision, for that set of fibres, the I/O events
are still totally ordered! Events *between* synchronisations are not, they
can occur “concurrently” or with “indeterminate ordering”.

Also with this design the issue of “migration” does not arise. More precisely,
the existing issue remains. What actually happens is that with the new design,
migration is fine between the specified thread set because there is actually
NO migration in terms of swapping to different scheduler queues .. since there
is, by specification, only one queue.

There’s a subtle point in all the above. We need to do this multi-thread spawning
INSIDE a run command. The reason is that if coroutines inside a run command
share memory with the parent, its safe in the sense that mutations to the parent
data frame are safe because the parent is suspended. This means, for example,
a full pipeline can be run, where the transducers are pure, but the source
and sink are not.



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Nov 24, 2018, 11:42:39 PM11/24/18
to felix google
So I’m up to the nasty bit. I have to examine closely how asynchronous I/O works.

At present, when a pthread runs out of coroutine to run, it checks if there
are any pending events waiting for service. If there are, it blocks on the
queue containing completed async requests until some are available,
which it puts on the synchronous scheduler list, and jumps back to
servicing fibres. If there are no pending requests, the pthread terminates.

The behaviour is sound for one thread. A coroutine/fibre requesting async
handling (timer wait or socket I/O) makes the request which causes it
to be made into a root, and the request is sent to the demux thread.
(If there is no demux thread one is started).

Demux monitors events with epoll, kqueue, poll, select, or whatever.
When an event happens, the I/O transfer routine in the request is
run. If the request is not finished, we go back to waiting. Once the
request is done, demux thread puts the fibre on a ready list.
This is the list that the fibre scheduler pthread checks.

So now, we have a *single* list of active coroutines shared by
multiple pthreads. Getting this running is easy. Fetching from
the list is easy. Choosing what to put on this shared list is not
but I’m ignoring that for the moment :)

The HARD bit termination. The problem to start is that there
is one “ready” list for every pthread, and one demux thread
is started for each one. This is not what we want! We really only
want ONE demux thread, in fact one globally is enough,
or perhaps a small pool of them.

Now suppose I can fix that so there is only one async ready list.
Note that these lists are already thread safe and GC aware.

The termination condition for a pthread sharing an active list
is non-trivial. The condition is:

(a) No pthread is running a fibre
(b) the active list is empty
(c) there are no pending events

Conditions (b) is easy to measure.
I can implement (a) easily I think, with a counter IN the active list.
I cannot implement (c) easily at the moment because there are multiple
ready queues. So this has to be fixed! The obvious solution is to put
the ready queue in the object containing the active list.

Now, if a thread finds the active list is empty, what does it do?
Suppose there are other pthreads running a coroutine.
Then, it has to suspend until either:

1. A new fibre gets put on the active list
2. The conditions (a), (b), and (c) are met (terminate)
3. There is a fibre on the ready list to schedule (go do it)

So we have to make the pthread suspend by using a condition variable
and make sure it is woken up on any state change.

I think the cost is minimal in the sense that there is no waiting or state change
whilst all pthreads are running stuff off the active list. So I think this is sane.

Making a SINGLE demux thread and wait list for a pthread group is tricky.
It should work. But its really hard to tell when code assumes the 1-1 invariant
that is now being broken.


John Skaller
ska...@internode.on.net





Keean Schupke

unread,
Nov 25, 2018, 12:20:53 AM11/25/18
to felix-l...@googlegroups.com
Hi John,


As the optimal number of threads to run depends on the CPU at runtime, the compiler cannot know how many threads are optimal.

This means the correct architecture is to build a thread pool at runtime with one thread per CPU core, (or two per core with hyper threading) and that is all the threads you should ever create.

Now ignoring thread locality (we should prefer to re-use the same core for a continuation of the same process to preserve memory locality in the caches) it's simply a matter of assigning the suspended computation to _any_ thread in the pool when the async IO completes.

If we think of memory locality we optimally want one queue of suspended computations per core (and hence per thread). The keeps things serialised.

However for optimal utilisation we want a single queue of suspended computations shared by all threads. This makes things a parallel as (efficiently) possible.

Perhaps a good compromise is to have one queue per thread in the pool, but allow idle threads to steal work from other queues. This minimises lock contention, but alleviates the case where one thread ends up doing all the work. (Because threads only try and access other work queues when idle there is no lock contention if all cores have work are are running)

There needs to be some heuristic that governs work stealing. It is worth stealing work if running it on another thread (core) is going to finish sooner than waiting for the current thread attached to that queue to get around to competing all the tasks in its queue up to that point. To could even steal work speculatively if power consumption is not important. To do this idle threads could 'borrow' the work from other queues without removing it from the original queue.  You then can keep the first set of results from the race, and then terminate that work on the loser. This could potentially slow things down though as other queued work may not get done.

So the optimal strategy would involve somehow estimating the time to copy cached data to a different core, plus the time to do the work, and let an idle threads steal this work if that total estimated time transfer+execution were to be less than the total execution (sum) time for all work ahead of this in the current CPU queue.


Cheers,
Keean.



--
You received this message because you are subscribed to the Google Groups "Felix Language" group.
To unsubscribe from this group and stop receiving emails from it, send an email to felix-languag...@googlegroups.com.
To post to this group, send email to felix-l...@googlegroups.com.
Visit this group at https://groups.google.com/group/felix-language.
For more options, visit https://groups.google.com/d/optout.

John Skaller2

unread,
Nov 25, 2018, 2:11:40 PM11/25/18
to felix google

>
> This means the correct architecture is to build a thread pool at runtime with one thread per CPU core, (or two per core with hyper threading) and that is all the threads you should ever create.

Yes, approximately *assuming* this is the only process running.

>
> Now ignoring thread locality (we should prefer to re-use the same core for a continuation of the same process to preserve memory locality in the caches) it’s simply a matter of assigning the suspended computation to _any_ thread in the pool when the async IO completes.

I wish it were that simple. The suspended computations are fibres (coroutines) not processes.
Coroutines systems guarantee total ordering. For example a coroutine can happily
modify the locals of another via a pointer without locking.

So at present, the user launches pthreads, and each one has a queue of
suspended coroutines.

Note these are *active* meaning ready to run. Now to do synchronous I/O,
suspensions are placed on a channel until a match occurs. That is,
if you do a read on channel X, then you’re suspended until there is a write
on channel X. At that time both suspension is placed back on the active list.

In order to preserve total ordering, both have to go back on the same list.
Note that there is no locking on synchronous channel I/O.

So my design modification is: the user can add a worked to a thread.
At this time, channel I/O for that set of threads is serialised with a lock
and the total ordering guarantee is lost.

So *locally* there is a single queue and a set of threads running it,
the number is up to the programmer. At the moment.

The problem of load balancing is secondary at the moment!
The main problem is keeping the threads alive.

At present, when a thread runs out of fibres to run, it checks if there
are any async operations pending. If so it blocks on the async read
list, which is where fibres whose async requests have been satisfied are put.
When one is put there the thread grabs it and puts it on the active list,
then starts running it.

Otherwise the thread terminates.

With multiple threads .. this algorithm will not work.
I have to use a condition variable to suspend a thread when
it runs out of jobs, because other threads can add jobs back
to the active list. When that happens they have to signal the
thread waiting on the condition variable to check again
for work to do. All the threads die when there is no work
to do.

Also remember Felix uses a world stop garbage collector so ACTUAL
mutexes cannot be used as locks or a GC world stop request would
lead to a deadlock. So Felix uses spinlocks instead, actually it uses
timed waits on C++ locks so it can cycle around checking for world-stop.
The thread must call a special yield function on noticing the flag, which
suspends it on a condition variable but ALSO copies the stack pointer
to a thread control object. The GC needs to know each threads machine
stack, so it can stack the stack for roots.

So the algorithm is a tad tricky to get right. More precisely, contrary to
Stepanov, algorithms are always easy .. its the getting the right
data structure that’s tricky.

NONE of this answers the question, when is it safe to run two coroutines
concurrently. The theoretical answer is: when the required total ordering
breach is unobservable.

In practice, you can tell some coroutines are PURE, which means,
all their interfacing is done with I/O channels. In that case, they’re
also processes (provided the channels I/O is serialised).

For example in a typical pipelines:

source |-> transducer1 |-> transducer2 |-> …. |-> sink

the transducers are pure. The source is also often pure.
However the sink may not be.


John Skaller
ska...@internode.on.net





Keean Schupke

unread,
Nov 25, 2018, 2:16:08 PM11/25/18
to felix-l...@googlegroups.com
Hi John,

Yes I think one work-queue per thread will preserve total ordering.

How much useful parallelism will you get without work-stealing though (letting idle threads steal work from threads with queues of work to do)?

Keean.




John Skaller2

unread,
Nov 25, 2018, 5:01:05 PM11/25/18
to felix google


> On 26 Nov 2018, at 06:15, 'Keean Schupke' via Felix Language <felix-l...@googlegroups.com> wrote:
>
> Hi John,
>
> Yes I think one work-queue per thread will preserve total ordering.

That what I have now but ..

> How much useful parallelism will you get without work-stealing though (letting idle threads steal work from threads with queues of work to do)?

There are no idle threads. Excepting waiting for pending async requests, if a thread
runs out of work, it terminates.

The reason is, if a single thread running coroutines has no work at any point
in time, it will never have any work at any future point in time.

The reason is: when fibres communicated on channels, they block (on write)
or starve (on read) until the request is matched. Such a requests cannot be
matched if there is no running fibre to do the matching request, or no
active fibre waiting to run. So once a thread runs out of active fibres
on the scheduler list it can safely terminate. Any fibre left blocked
or starving will remain so forever so the GC reaps them.

In fact, they’re reaped by the GC as soon after they’re blocked or starving
as it runs assuming they’re unreachable. Provided schannels are anonymous,
this is more or less guarranteed because fibres are anonymous, they have
no id or handle to name them.

So the problem isn’t stealing work. If we start 4 pthreads on one active list,
*assuming* the fibres on that list and all they spawn onto that list are pure
(or at least serialise impure operations), then the actual problem is keeping
the idle threads alive.

You might have 2 idle threads and two thread working. Eventually a working
thread may cause new fibres to appear on the active list so all the threads
can work again.

Obviously the way to do this is with a condition variable. The idle threads go
to sleep until a signal wakes them up to check if there’s some work.

The threads all die when there is no thread left to signal them.
[So the last thread that would go to sleep waiting for work actually
sets a euthenasia flag and then signals the others to wake up and
suicide, then suicides itself]


If I can get all this working, which is in progress, two questions remain:

(A) How to avoid the “tragedy of the commons” in which every man, woman,
child, and their dog, think its a good idea to hire a thread pool.

(B) how to determine which *coroutines* are pure, or pure enough,
that they can be run concurrently.

In principle (B) seems hard. If you let any coroutines run concurrently
there is no universal ordering guarantee. What you get instead
is a pariwise synchronisation barrier, so coroutine X and Y, when
they synchronise, agree all the events of X did before the sync,
Y agrees they were before the sync too. Ditto for after the sync.
So the sync is a an ordering barrier, but it is only *Pairwise*.

Pairwise ordering is enough for processes, this is Hoare’s CSP.
But processes are pure by definition. Coroutines/fibres are
not. We have pointers, we have shared state that can be modified.

So here’s the deal: what I am trying to do is take the strong total ordering
guarrantee, and do two things:

1. As an optimisation ONLY, run some fibres concurrently,
IF and ONLY IF the ordering guarrantee is EFFECTIVELY maintained.

What this means is that this is a temporal version of a GC collecting
unreachable memory. Now, instead, we’re running things concurrently
if, despite the loss of ordering guarrantee, no one can notice the difference.

For example if you have a pipeline, and pure components acting as transducers,
then running the transducers concurrently is safe, because pure means there
are no side effects for an observer to detect a breach of ordering, and because
a transducer is a read - compute - write - redo it type of loop, the only events
are read and write in a loop, and only that coroutine, its supplier and its consumer
can notice the events. Its supplier only sees it reading what the supplier writes
and the consumer only sees it writing what the consumer reads and this
is somehow (quite obviously) transitiive. The ordering assurance between an
I/O sync a long way up the pipe with one a long way down the pipe is lost,
but there is no observer that can notice this.


This is one thing that functional programming model gets totally and completely
wrong. Computer Science is just that: science. It must deal ONLY with what is
observable. There is no notion of observation in FP so it is not a theory of
computing at all. I have previously pointed out that if you do add a notion
of observation to FP, you have to conclude a list is infinite, because you can
only observe the length of a tail, and that is a lower bound on the length
of the list only. There is no way to observe an upper bound! Because the
pointers in a singly linked list only point towards the end!


2. Weaken the guarrantee somehow, so the resulting system is still
useful but no longer totally ordered. In other words look for a “half way house”
between total order and total chaos.

The core question might be rephrased as this: with fibres, when a suspension
runs is indeterminate, but suspensions interleave. We just don’t know in what
order they run, but we do know that whatever the order, the resulting
synchronisations AND other events (such as console writes, or mutations,
or file system changes) are totally ordered.

Although this seems weak, it is possible to impose any order you want
using channels. So actually the system is maximally powerful: the indeterminacy
can be fully removed if you want and replaced by complete determinism.

Now we want to optimise it so we can do something concurrently.
Obviously CODE BETWEEN EVENTS can always be run concurrently
with ANY other code between events.

The thing is, what is an event? The answer is: it is something observable.
So I I say “the coroutine is pure” I am specifying there are no “other”
events that synchronisations which are observable.

But that is too strong. Sometimes, other events are fine, because those
that can observe them are limited. For example if one routine
can observe an event of another, but no one else can, then pariwise
total ordering is enough. We don’t need global total ordering.

Basically:

X —> E1 —> IO —> ….

Y —> IO —> O1 —> …

Here X does event E1, then IO. Y synchronises with X via IO and then
does observation O1 of event E1. If this is the ONLY observation of E1,
then the pairwise synchronisation of X and Y is quite enough to EXCLUDE
the above possibility. Pairwise sync means there is a R/W (producer/consumer)
barrier and so X and Y must agree the event E1 occured before IO.

This is hard to understand but it is exactly what producer/consumer memory
barriers do now. The provide only LOCAL consistency. Locks on the other
hand provide global consistency. That’s equivalent to two cores both
doing a r/w barrier (instead of one doing r and one w).





John Skaller
ska...@internode.on.net





Keean Schupke

unread,
Nov 25, 2018, 8:03:09 PM11/25/18
to felix-l...@googlegroups.com
I think you are confusing hardware threads with the software tasks as I will call them.

You only ever want a thread pool with one hardware thread per core. This is fixed and never changes. Threads in this pool can idle, but we never destroy them or add to them.

These hardware threads multiplex the asynchronous tasks by way of work queues, so each async callback sits in a work-queue and the next available hardware thread picks the next ready to run callback.

So on a 4 core hyoerthreaded CPU there will always be exactly 8 hardware threads, there is no point in creating more than this.

If we start with a main program that initiates some IO in parallel this is determined by the programmatic parallelism, say we want to read in 100 files, we just launch 100 async requests I parallel, so there are 100 callbacks pushed into the waiting queues. When the cometion event happens (Io interrupt) each callback is moved to the ready queue. 

The 8 hardware threads for our CPU now can pick work from the ready queue.

The decision is whether to have one global ready queue or 8 hardware thread specific ready queues.

To guarantee callback serialisation we need all 100 callbacks to be run by the same hardware thread, this can be done using one wait/run queue per hardware thread. However doing this will limit our actual parallelism, infact we would only have asynchronous concurrency and no parallelism in this example.

If we provide no primitives for launching threads (why would we, the number is fixed by the number of CPU cores), then how do we ever get any parallelism?

We could introduce a process/task concept with a 'fork' instruction like 'C', but does the user really care about this? It's an easy option for the language developer, but doesn't necessarily help the programmer.

The alternative is to impose some locality constraints on the asynchronous callbacks, for example disallowing access to external mutable state, each callback exists as an actor of co-routine. In this case we can safely run all the callbacks in parallel potentially (limited by the actual number of hardware threads).

So an example might be a program to found the number of words in all the files in a programming project with 100 files. We create 100 parallel tasks, each counting words in a single file, these get multiplexed onto the 8 hardware threads in our system, and then a final 'collector' sums the results of the separate tasks together to give a final total. Each of the 100 counting tasks needs to be sequential, but they have no interdependence so we don't care about ordering between tasks, execpt the final sum task must also be sequential and this is the last callback from each individual counting program.

So coming back to the beginning, with one work queue per hardware thread all 101 tasks above execute on the same thread (and with the correct affinity settings the same CPU core) which is not what we want. We do not want to code anything about the number of hardware threads because this is a runtime optimisation, we don't know how many cores the CPU will have when we write the program. 

So we need work stealing so some of the 7 idle hardware threads can do something useful, otherwise everything runs asynchronously and concurrently on a single core.


Cheers,
Keean.









John Skaller
ska...@internode.on.net





John Skaller2

unread,
Nov 26, 2018, 5:46:06 AM11/26/18
to felix google


> On 26 Nov 2018, at 12:02, 'Keean Schupke' via Felix Language <felix-l...@googlegroups.com> wrote:
>
> I think you are confusing hardware threads with the software tasks as I will call them.

No, i dount that.

>
> You only ever want a thread pool with one hardware thread per core. This is fixed and never changes. Threads in this pool can idle, but we never destroy them or add to them.

I agree .. in principle. The problem is, there’s no way to get that.
Even if you can get it for a particular application .. other processes can be running.

In Felix at the moment, there is a thread pool which does (by default) what you say.
However its a software managed pool, and there are restrictions on what the jobs
submitted to it can and cannot do. Jobs running in pool cannot use long term locks
directly, for example. So one has to arrange that when a long term lock is required,
the job is pulled out of the pool.

Felix ALSO allows the user to run pthreads.

Is this optimal? No of course not. [See next reply too]



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Nov 26, 2018, 6:06:49 AM11/26/18
to felix google
>
> So coming back to the beginning, with one work queue per hardware thread all 101 tasks above execute on the same thread (and with the correct affinity settings the same CPU core) which is not what we want. We do not want to code anything about the number of hardware threads because this is a runtime optimisation, we don’t know how many cores the CPU will have when we write the program.

Yes, I agree in principle.

Go does this right. I think Haskell does too. Not sure in either case.

But this is not the issue here. The problem is how to elevate coroutines to processes
more or less transparently, so the ability to run things concurrently is an optimisation.

There are two, related features of this problem:

(1) desiging an abstract model
(2) implementing it

At present I’m just implementing it and making the client programmer choose
how to use it. If they use it wrong, it won’t work. I need to check the implementation
is performant and doesn’t have races etc etc.

Once there’s a way to actually make things run concurrently I can examine
how to control its use.

Felix already has some concurrency. You can spawn pthreads. Async I/O
is automatic already. There is a thread pool. And there are some special
statements:

concurrently (p1, p2, p3,p4);

does a fork/join on pthreads. There’s also

pfor i = 1 .. 100 do … done

which automatically splits the loop into N pieces and runs them
concurrently .. don’t use this if the loop bodies aren’t independent
or at least properly coupled.

But what i’m doing now is far more general: optimising the execution of
coroutines by using concurrency. As above, initially, the user will decide
how many threads to use, and the user is responsible for ensuring
the coroutines run by the pool they create don’t interact the wrong way.

I would prefer to do this automatically, by analysis. But the first step
is to make it happen at all. IMHO.

If you have a way to idenify exactly how to manage it so only coroutines
that don’t interact the wrong way (eg using pointers to access shared state)
are put on that queue, i’d like to hear it.

Remember, the compiler at present doesn’t know what a coroutine is.
It knows functions, generators, and procedures. You can say a function
is pure. I don’t think it makes sense to say a procedure is pure.

All procedures are coroutines, its just that some do schannel I/O and
some don’t, some are launched as fibres and some are just called.
TO run a fibre you just say

spawn_fthread p;

where p is a procedure of type 1->0, i.e. accepting a unit argument.
You can call the proceedure too. Whether its part of an existing
ffibre or not depends on whether you call it or spawn it, i.e.
the distinction is dynamic, not static.



John Skaller
ska...@internode.on.net





Keean Schupke

unread,
Nov 26, 2018, 7:48:56 AM11/26/18
to felix-l...@googlegroups.com
Hi John,

I must be missing something, because I just find your response to be confusing:

On Mon, 26 Nov 2018 at 11:06, John Skaller2 <ska...@internode.on.net> wrote:
>
> So coming back to the beginning, with one work queue per hardware thread all 101 tasks above execute on the same thread (and with the correct affinity settings the same CPU core) which is not what we want. We do not want to code anything about the number of hardware threads because this is a runtime optimisation, we don’t know how many cores the CPU will have when we write the program.

Yes, I agree in principle.

And in practice :-) 

Go does this right. I think Haskell does too. Not sure in either case.

Go yes, Haskell no.
 
But this is not the issue here. The problem is how to elevate coroutines to processes
more or less transparently, so the ability to run things concurrently is an optimisation.

Coroutines create asynchronous tasks on the work queue, just like external IO, so treat a channel read/write as something that causes a continuation so be saved to the current CPU work queue.

Felix already has some concurrency. You can spawn pthreads. Async I/O
is automatic already. There is a thread pool. And there are some special
statements:

The user should never spawn threads. The number of actual threads should be fixed by the hardware. This is why I think you are confusing hardware threads and async tasks.
 
        concurrently (p1, p2, p3,p4);

does a fork/join on pthreads. There’s also

I disagree, there should be exactly 8 threads running (remember in a multi process CPU, all cores are assigned to your process at the same time, so you either have 8 threads running on the CPU or zero). Using thread affinity you can assure that each thread runs on one core and does not move between cores.

So this should create "p1" "p2" "p3" and "p4" as new async tasks and put them in the current CPU core work queue, so 4 tasks multiplex over 1 hardware thread.

If we allow work-stealing, other cores can steal the work from this CPUs work-queue, providing it does not change the semantics of the result, and the overall performance would be better than leaving it in this queue (determined heuristically).

But what i’m doing now is far more general: optimising the execution of
coroutines by using concurrency. As above, initially, the user will decide
how many threads to use, and the user is responsible for ensuring
the coroutines run by the pool they create don’t interact the wrong way.

If the user decides on the number of threads, it should be a command line parameter of the runtime, not something fixed at compile time. My laptop has a different number of CPU cores to my desktop, and I don't want to be editing code and recompiling as I move between them.
 
I would prefer to do this automatically, by analysis. But the first step
is to make it happen at all. IMHO.

I agree, see above, but this should _never_ be something set in the program code.
 
If you have a way to idenify exactly how to manage it so only coroutines
that don’t interact the wrong way (eg using pointers to access shared state)
are put on that queue, i’d like to hear it.

The easy way is to make each co-routing only schedule its continuations into the same work queue, and to make access to anything outside the co-routine read only. 

Remember, the compiler at present doesn’t know what a coroutine is.
It knows functions, generators, and procedures. You can say a function
is pure. I don’t think it makes sense to say a procedure is pure.

Procedures can be "regular" that is their output only depends on their inputs. 

All procedures are coroutines, its just that some do schannel I/O and
some don’t, some are launched as fibres and some are just called.
TO run a fibre you just say

        spawn_fthread p;

I think you should never be spawning stuff like this, use structured primitives like "concurrently" instead.

where p is a procedure of type 1->0, i.e. accepting a unit argument.
You can call the proceedure too. Whether its part of an existing
ffibre or not depends on whether you call it or spawn it, i.e.
the distinction is dynamic, not static.

Maybe I don't like "spawn_fthread" because it mixes terminology with hardware threads. I would prefer:

async p

OR

nowait p

From a language semantics point of view this all can be related to Promises in a language like JavaScript. Promises are approximately a Monad (JavaScript made a stupid choice about recursive unwrapping with technically makes JavaScript promises not a Monad, but we can consider promises to be Monadic). Originally Javascript has:

p.then(x => q(x))

OR

all([p, q, r]).then([x, y, z] => s(x, y, z))

So here "all" does what concurrently does. In the newer syntax you can write these:

x = await p
q(x)

[x, y, z] = await [p, q, r]
s(x, y, z)

So the default is that every "async" statement automatically "forks", unless you explicitly ask to wait for a result.

If we ignore side-effects then we could simply say any result automatically creates an 'await'.

I think in many ways it is simpler to assume every statement automatically 'awaits', and you explicitly mark the statements you don't want to wait for.


Keean.

John Skaller2

unread,
Nov 26, 2018, 1:45:49 PM11/26/18
to felix google

> But this is not the issue here. The problem is how to elevate coroutines to processes
> more or less transparently, so the ability to run things concurrently is an optimisation.
>
> Coroutines create asynchronous tasks on the work queue, just like external IO, so treat a channel read/write as something that causes a continuation so be saved to the current CPU work queue.

Coroutines create synchronous tasks on the work queue. There’s nothing asynchronous
about them. That’s the whole point: coroutines are synchronous.

The order in which continuations in the active set run is indeterminate
in theory (but fully determinate in practice). The order of events
is total. Its plain old sequential programming.

A channel read, or write, which is not matched, *removes* the continuation
from the working set and stores in on the channel. When it is matched,
a read with a write or a write with a read, both are put back into the
working set. The actual rules are: the reader starts running, the writer
is at the head of the active list but not running yet.

So please note: the specification and implementation use a fully synchronous
algorithm to schedule fibres (at the moment). The theoretical model
specifies that when a thread selects a continuation to run it uses the
Set Theoretic Axiom of Choice. It picks a continuation at random.
The implementation is fully determinate. Given any fibrated program the
order in which code evaluates is completely predictable.
However you need to know implementation details to predict it.

Those details are: spawning a fibre causes the spawned fibre to run
immediately and the spawner hoes on the head of the queue.
Channel I/O stops the first requestor running, and all subsequent
requests of the same kind (read or write), and they’re pushed
onto the channel FILO (like a stack, I’m using singly linked list).
A matching (opposite) operation pulls the head of the channel
list off, pushes the writer on the scheduler active list,
and runs the reader. The implementation is 100% determinate
and synchronous. You only get asynchronous behaviour if you make
an asynchronous I/O request. There are only two: a timer wait and
a socket I/O request (including waiting for a connection).
In this case, behaviour the implementation is no longer deterministic.
The actual I/O transfer is done concurrently by a separate dedicated
pthread (running a procedure called “demux”). Once completed,
the waiting continuation is (eventually) put back in the synchronous
scheduler’s queue. So the fibre continues *synchronously*.
Determinism of the implementation is lost. But the logical behaviour
of the program remains fully totally ordered and synchronous.
There is a gain in performance but the abstract logic is not impacted.

Asynchronous I/O requests (sockets, timers) remove the
continuation from the working set. When the request is satisfied
it is put into a separate “ready” set. The scheduler runs the working set
until it is empty and only then pulls ready continuations into the working
set. If there are none to pull, the scheduler terminates.

The actual data transfer for socket I/O is asynchronous, that is,
it happens *concurrently* with the coroutines scheduling.
Literally it is done by a special thread.

Everything in the main thread is completely synchronous,
there is no concurrency.

The above is what happens NOW. The fact continuations are completely
synchronous allows you to do things like use a pointer to modify shared
state without any precautions: no locks are required.

this is true, even if one of the coroutines is waiting on an external
event.

Don’t confuse indeterminate selection of the next continuation to run
with asynchronous behaviour. An iterator running through a hash table
is fully synchronous, it finds all elements in the table, the order
in which it finds them, however, is indeterminate. The order of recovery
of the elements is total. (In reality of course the order is deteminate,
it ust depends on history and implementation details).

When you run two threads to pull continuation off the active set,
THEN the behaviour has an element of concurrency to it,
and is now partially asynchronous because now two continuations
can run concurrently, in parallel. There is an element of synchronicity
too: at the read write (and spawn) events, which are synchronisation
points.

So what has happened is “synchronisation points” have been elevated
from “indetermiate order” to “possibly concurrent order”.

In fact, now a channel I/O has exactly the WRONG effect, because the
reader and writer, which transfer only a POINTER, now are certain
to run on different pthreads if the working set was empty.

So the aim here is to improve the real time performance of the
existing fully synchronous logic, but leveraging the fact that
in the abstract, the user is not supposed to rely on the implementation
details which allow them to predict the order in which things run.
The difficulty is that shared memory operations will no longer work
as they did beore. Before you didn’t know if a pair of writes to
shared memory occur before or after some reads in another fibre,
but you did know either both write happened, then both reads,
or both reads, then both writes.

Once two pthread run stuff instead of one, the reads and write could
interleave. So fibres that do this kind of thing cannot be run on
a scheduler that uses a pool of 2 or more pthreads to execute stuff.

The problem is how to decide if we can get away with running certain
coroutines with a pool greater than size 1 or not.

My current proposal is to ignore this, and make the user responsible
for deciding :-)

In detail the user simply says “spawn extra worker threads” on a particular
pthread’s scheduler.

Actually its more difficult than that, because synchronous schedulers
(which do synchronous channel I/O) can be stacked. That is, the user
can invoke a scheduler:

run procedure;

starts a new scheduler. It is a *subroutine call* so the current scheduler
is suspended until it completes. This nested scheduler cannot
spawn pthreads or do asynchronous I/O (the system aborts if you try).

>
> Felix already has some concurrency. You can spawn pthreads. Async I/O
> is automatic already. There is a thread pool. And there are some special
> statements:
>
> The user should never spawn threads. The number of actual threads should be fixed by the hardware.

Spawning a pthread just means creating an asynchronously executed process for which
the local events are total ordered but not synchronised with events of other pthreads.

I think you’re confused not me: fibres are completely synchronous.
Pthreads are not.

It happens in Felix, when you spawn a pthread, it is an OS pthread.
That is, the OS pre-emptively multiplexes them and distributes the workload
amoung the CPUs.

The only problem with this model is performance. I AGREE with you that
it would be better if the user level threads (NOT fibres) were multiplexed
by the Felix OS, not the system OS, by using a fixed pool of OS pthreads.

But it is non-trivial to do, because asynchronously (concurrently) executing
threads of control synchronise with locks, and there are two uses:

(a) mapping short sequences of instructions into a single atomic sequence
(b) waiting for an asynchronous event

The type of use of a lock for serialisation, namely (a), is fine in a thread pool.
The type of use of a lock (b) for a wait, cannot be permitted. You cannot
have a thread in the pool lock up. So for this use of a lock, you have to dump
the logical pthread out of running state so the worker thread (logical CPU).
can do something else.

The problem with doing this is that one ends up *emulating* what the
OS already does more efficiently. The primary advantages of the thread
pool is that it reduces pre-emptions. However it INCREASES the cost
of synchronisation.

> This is why I think you are confusing hardware threads and async tasks.
>
> concurrently (p1, p2, p3,p4);
>
> does a fork/join on pthreads. There’s also
>
> I disagree, there should be exactly 8 threads running (remember in a multi process CPU, all cores are assigned to your process at the same time, so you either have 8 threads running on the CPU or zero).

Never heard that one before. AFAIK, the OS can do whatever it likes.
In fact Meltdown and such like side-channel attacks occur because this isn’t so.

Linux certainly doesn’t have any processes at all. The kernel doesn’t
know what a process is. Linux only has threads. The only difference
between a process and a thread on Linux is that processes happen
not to share any mutable address space. Not all Unix do this,
but Linux always has. Linux emulates processes, they’re actually
pthreads. So I don’t see how an SMT processor like this even can
obey the rule. AFAIK it runs random threads on the avaiable cores,
including threads from other processes. Affinity was only introduced
recently as part of the scheduler, primarily to ensure the same thread
continued on the same CPU (for cache reasons). Unfortunately affinity
has negative side effects. If you spawn a thread if it has the same affinity,
interleaving is improved (cache misses reduced) but concurrency (real
time performance) is degraded. If you use a different CPU you get better
RT performance at the cost of reduced thruput.


> Using thread affinity you can assure that each thread runs on one core and does not move between cores.

OSX has no such concept. It is Linux specific.

>
> So this should create “p1" "p2" "p3" and "p4" as new async tasks and put them in the current CPU core work queue, so 4 tasks multiplex over 1 hardware thread.

It doesn’t. It runs them on 4 cores. Under OSX anyhow. So achieves a real time
performance improvement (at the cost of some loss of total thruput).



John Skaller
ska...@internode.on.net





Keean Schupke

unread,
Nov 26, 2018, 2:14:59 PM11/26/18
to felix-l...@googlegroups.com
Hi John,


The lesson from JavaScript / Node.js / Go is that we should prefer asynchronous co-operative context switching in user space with a fixed hardware thread-pool size, to using synchronous code with hardware context switching. You seem to agree with this.


I understand you are explaining how Felix currently works. I am considering how it should work ideally to allow it to compete efficiently with languages like Go. The way I would do this would be to utilise the channel read writes as async points, effectively yielding the CPU, allowing the co-routines to share the processor efficiently.

Regarding synchronization and locks, we should not really need to worry about this. A co-routine effectively runs as a sequence of asynchronous callbacks, and _must_ be non-reentrant. So if a co-routine only updates local state, we don't need any locks.

If channels are asynchronous, then we don't need locks for channels because they work effectively by message passing.

The only place you would need a lock would be when you append a new message to a channel that is shared by multiple writers, or read from a channel shared by multiple readers, but I think you can use atomic-swap operations on pointers to make this lockless.

So I don't see any need for locks to support fully asynchronous co-routines.


Cheers,
Keean.




John Skaller2

unread,
Nov 26, 2018, 8:55:53 PM11/26/18
to felix google


> On 27 Nov 2018, at 06:14, Keean Schupke <ke...@fry-it.com> wrote:
>
> Hi John,
>
>
> The lesson from JavaScript / Node.js / Go is that we should prefer asynchronous co-operative context switching in user space with a fixed hardware thread-pool size, to using synchronous code with hardware context switching. You seem to agree with this.

Yes. However in Felix the programmer has the choice how to do things.

As mentioned, there are problems with user space context switching coupled
with shared memory access. If you use pure coroutines that communicate
only with channels, the problem doesn’t arise. As in Go, the main synchronisation
method is to use channels. However shared memory access must also be supported.

For coroutines, there is no issue: no locks are required because of the total
ordering guarantee. But this assurance does NOT extend to general
processes/threads/goroutines or whatever you want to call concurrent logical
threads of control. You need locks, transactional memory backed by lock free
techniques, or whatever.

With OS pthreads you can use locks, condition variables etc for shared memory
access synchronisation, but you CAN NOT freely do this when running logical
threads from a pool. If you have to use a lock, the lock has to be adapted for use
in the pool. As I explained a hard lock waiting for data will lock up a physical
pool thread forever if the supplier of the data is in the queue, because the queue
can’t be emptied because the thread that would empty it is locked up: deadlock.

Instead, to do a lock, the procedure being run has to be dumped out of the pool
and put away somewhere until the data is available. If the writer is then pulled
out of the queue and run, it too will lock up trying to send the data, so it has to
be dumped out of the pool as well. Then the user multiplexing system has to
arrange the transfer is done and the locks release OUTSIDE the running thread
pool before putting the two procedures back into the queue, ready to go.
One might arrange this by actually queing another job for the pool that does
the transfer. The point being the two user threads, reader and writer,
are not in the queue whilst they’re logically locked.

This is just an example: the point is that at least locking must now be
implemented in user space too. It’s not just a matter of a queue of jobs
which the phyiscal threads pull off the queue and run until they’re
complete. There has to ALSO be machinery to handle locking
OUTSIDE the worker threads because it cannot be done inside
a worker thread.

The key point here is that ordinary OS pre-emptive threads already
handle this. Locks encourage the thread to be descheduled. If the OS
is smart and the locks are OS locks, releasing a lock may encourage
the OS to reschedule the job.

The OS can do this kind of thing MORE efficiently than user space
handling because it is designed to do so.

The only reason user space threads are faster is that they’re applied
in contexts where the full set of possible options can’t happen.
so the scheduler can be more efficient. If they can happen, the OS
scheduler is going to do a better job.

> I understand you are explaining how Felix currently works. I am considering how it should work ideally to allow it to compete efficiently with languages like Go. The way I would do this would be to utilise the channel read writes as async points, effectively yielding the CPU, allowing the co-routines to share the processor efficiently.

It won’t work because channel I/O is not the only way coroutines communicate.
They can share memory as well. Currently this does not require locking.

Also, the current scheduler doesn’t require yielding CPU or any kind of
synchronisation so it is MUCH faster that what you propose, that is,
the thruput is much higher.

The problem is the current scheduler doesn’t allow concurrent execution.
That is, although the thuput is maximal, the real time performance is not,
because (obviously) only one CPU can be utilised.

At present IF you can partition your job into separate threads,
meaning the coroutines in each thread do not communicated,
then you can get maximal thruput AND maximal real time performance.

Felix can and does provide a way to do that

pfor i in 1 .. 1000 do …

will in fact run N concurrent loops of size 10000/N, each in a separate
pthread. It is up to the user to ensure the code in the body of the loop
is such that there is no interaction between iterations. The compiler
doesn’t check.

Quite clearly this is a SPECIAL CASE.

So the problem is how to generalise the existing system to get better
real time performance, without sacrificing thruput.

The idea is to allow more than one thread to processes a coroutine queue.
This will only work if the coroutines do not interact using shared memory.
As long as they use channels, its fine.

The code to implement this is almost complete. When you start a pthread
up it runs the coroutines without any locking. If you start additional
threads to service the queue, locking is switched on.

This is actually the easy part. The hard bit is making idle worker threads
hang around until the termination condition is met.

Previously the termination condition was simply that the current coroutine
was suspended and the scheduler had to find another one, only,
the queue was empty.

What happens then is that the scheduler looks to see if there are any
async tasks (socket I/O or timer alarms) pending. If so, it blocks
on the queue of completed async requests. (If the requests fail
all hell breaks lose!). When a request is complete, the relevant fibre
is pulled off the ready queue and put onto the active queue,
and the scheduler can start running the fibre.

The async queue is managed with a condition variable.
The sync queue doesn’t require any locking because there is only one thread
running the scheduler AT THE MOMENT.

To allow multiple threads to run the scheduler AND fetch completed async
requests back to the queue required the code to be modified.


> Regarding synchronization and locks, we should not really need to worry about this. A co-routine effectively runs as a sequence of asynchronous callbacks, and _must_ be non-reentrant. So if a co-routine only updates local state, we don’t need any locks.

Yes, we have to lock the channel I/O, but the user doesn’t have to worry about that.

The thing is coroutines DO update non-local state. THAT IS THE WHOLE POINT!

The whole point of coroutines is that they’re synchronous. Its sequential programming.
Share memory access is never contended so no locks are required.

Do you see the problem? The idea is to UPGRADE the system to run SOME coroutines
concurrently: the ones that you correctly identify only update locak store.

Actually the condition is stronger: the actual condition is that no event of the
coroutine is visible externally. So for example printing to the console is externally
visible even though there is no non-local store access.

Actually its even worse: READING external store which could change can’t be
allowed either (in coroutines to be run concurrently).

If we can update Felix so AS AN OPTIMISATION it can safely run some coroutines
concurrently AND MAKE IT TRANSPARENT or nearly so, then we will crap
all over Go because we will have shared memory access WITHOUT LOCKING
AS WELL AS CONCURRENCY.

So the key problem is theoretical: given a set of coroutines how can you tell
which ones can be run concurrently? And if you can tell, how do you actually
do it?

The current model is simple: if ALL the coroutines can be run concurrently,
then use a thread pool, otherwise don’t.

However that model can be improved radically. I think the idea is to
examine the interaction graphs, and try to find an “isolated subgraph”
which can run concurrently.

But another idea is to provide concurrency WITH RESTRICTIONS.
One restriction would be that if you have two routines that make
matching I/O requests they’re run PAIRWISE SERIALLY.
That is, one has to run until it suspends before the other,
so they are not allowed to run concurrently.

Exactly how to implement this I’m not at all sure at the moment.
Pairwise serialisation looks very promising because it provides a
LOCAL TOTAL ORDERING GUARANTEE if you get my drift.

The data analogue of this exists: computers run segmented memory
under C. Objects, including arrays, use contiguous store, so the store
inside an object is totally ordered by address, however separate
(discrete) objects are unordered.

I don’t know what you call this ordering officially, but I call it
a local total order. Ordering is total locally, but not globally.

I suspect this is how to run coroutines with concurrency,
but i need to implement stuff and test it to actually see how it
works and how it performs. At present, I don’t know how to
implement local total ordering although i can think of ways to do it
that might work.

There is clearly some notion of “affinity” here. If two coroutines
communicate on a channel, they have the same CPU affinity,
which ensure they’re executed in a totally ordered manner
as required.

However, how to organise what you call “work stealing” whilst
preserving affinity is not clear.

The “affinity” we required is NOT TRANSITIVE.

The usual CPU affinity of pthreads IS TRANSITIVE.
If T1 has affinity A and requires T2 to have the same affinity,
and if T2 requires T3 to have the same affinity, then T1 and T3
have the same affinity.

This is no good. It is a well known problem with the idea of affinity
in OS design. Too much affinity and everything runs on the same CPU
and you lose the concurrency.

The problem is obvious in the thread pool running things model.
Another way to look at it is that if A < B where < means total order,
and B < C, then by definition, A < C. Total ordering is transitive.

We cannot allow that. We have to weaken the ordering requirements.
But for lock free channel I/O we have to have pairwise ordering,
otherwise we have to use a lock.

My current implementation says: if there is exactly one thread
servcing the list, dont use a lock. Otherwise use a lock.
And note this only applies to channel I/O not shared memory access.

So here is an idea: we use SUBSTRUCTURAL LOGIC.

IN particular SEPARATION LOGIC. We already know how to use
separation logic to analyse memory acccess: Facebook actually
has a software that can do it. For C and Javascript.

So the idea is roughly: if we can *separate* two coroutines, then
they can run concurrently, otherwise they’re coupled and have to be
run serially.




John Skaller
ska...@internode.on.net





Keean Schupke

unread,
Nov 27, 2018, 2:29:19 AM11/27/18
to felix-l...@googlegroups.com
Shelby3 has done some nice analysis of modern CPUs with many cores like AMD Threadripper (32cores, 64hyperthreads) and the Intel competitor (24cores, 48hyperthreads) and shared memory performance just does not scale.

We really need to copy data so each core works on a local copy for mutable, or a cached version for immutable data to get the real performance gains form many-core processors.

When shared mutable state no longer offers a performance advantage, and is harder to reason about, why would you use it? Why complicate a language by even including it if it offers no advantages?


Cheers,
Keean.

blu...@fi.muni.cz

unread,
Nov 27, 2018, 4:56:26 AM11/27/18
to 'Keean Schupke' via Felix Language
Hi,

'Keean Schupke' wrote:
> When shared mutable state no longer offers a performance advantage, and is
> harder to reason about, why would you use it? Why complicate a language by
> even including it if it offers no advantages?

I think the point you are missing is that coroutines (at least in the
form used in Felix) are not a mechanism for parallelism. (By the way,
I think John is using the term "concurrency" wrongly.) Coroutines are
mainly about control-flow. The fact that they sometimes can be run in
parallel is just an added bonus.

If you have several coroutines working together to solve a given task
(on a single CPU), it can make sense to use shared memory for
communication.

Achim
signature.asc

Keean Schupke

unread,
Nov 27, 2018, 5:03:25 AM11/27/18
to felix-l...@googlegroups.com
> If you have several coroutines working together to solve a given task
(on a single CPU), it can make sense to use shared memory for
communication.

Actually it does not in my opinion. If there are never two coroutines running genuinely in parallel then the memory is never really shared. What you have is "sequential ownership" that is there is only ever one co-routine accessing the memory in any one "critical-section", assuming we have collaborative multiplexing of concurrent tasks (so no pre-emption). In this case we can imagine that we transfer ownership of the shared buffer when we transfer control to another co-routine via a channel. This can be literally by passing a pointer and length (memory extent) over the channel. If the pointer to the buffer has a linear type, then we enforce that only one co-routine at a time has access, and we have a zero-shared-memory system with no performance loss. There is no need for shared memory, and in fact unless you have real parellelism the memory is not shared in any case.

Cheers,
Keean.


John Skaller2

unread,
Nov 27, 2018, 5:52:21 AM11/27/18
to felix google


> On 27 Nov 2018, at 18:29, 'Keean Schupke' via Felix Language <felix-l...@googlegroups.com> wrote:
>
> Shelby3 has done some nice analysis of modern CPUs with many cores like AMD Threadripper (32cores, 64hyperthreads) and the Intel competitor (24cores, 48hyperthreads) and shared memory performance just does not scale.

I believe that, but it’s still necessary because small scale performance matters too.

Using shared memory is equivalent to channel communications, only the CPU does the
required inter-cache synchrononisation with the help of barriers.

[The problem is that the design of modern CPU is rubbis but that’s another story]


> We really need to copy data so each core works on a local copy for mutable, or a cached version for immutable data to get the real performance gains form many-core processors.

Copying only works for small data. Immutable data and functional
programming doesn’t work either. Linear or uniqueness typing probably helps.

>
> When shared mutable state no longer offers a performance advantage, and is harder to reason about, why would you use it? Why complicate a language by even including it if it offers no advantages?

One wouldn’t. But it does.

There are two pictures here:

(1) coroutines, total order, shared memory no problem because no locking,
limited to one thread. Logically its sequential programming. No concurrency.
Easy to reason about.

(2) processes, no shared memory, message passing only,
can run any number of threads. Concurrency. Effectively Impossible to
eason about.

I call these systems CSC (Communicating sequential coroutines) and CSP
(Communicating Sequential Processes).

If you don’t believe my claim that CSP is too complex to reason about go
do some research into it. Read Hoare’s paper (not the original, the formal
academic one: its quite readable).

The numbert of semantic rules is HUGE.
There is an algorithm for proving a CSP system does not deadlock.
It takes ages to run and took years to develop. It proves the system
is impossible for humans. (Note: it doesn’t tell you anything about
your system *except* that it doesn’t deadlock). We all know,
the only way humans can program is with divide and conquer:
we have to have modularity. CSP has none of that (in itself).

So I seek a middle ground. Something that is not much more complex
than CSC, allows some concurrecny, but can still be reasoned about.

We know from large scale data systems, running stuff in the “cloud” with
hundreds to thousands of processors, the share memory model isn’t tenable.
But on the smaller scale, it is not only tenable, its the only way to obtain
performance and small scale semantics, at least with modern processors.
And there’s no way known to write scalable code for “the cloud” anyhow:
certain problems, sure.

=============

Shared memory is essential for coroutines. The reason is not obvious.
Consider ordinary functional procedural code which uses a pipelines
of coroutines. Suppose you’re implementing a fold. So the data comes down
the pipelines, gets transformed and folded and you get a stream out of
the partial fold.

But you don’t want a stream, but a single value. The way to do this is to
cross back from the co-processing model to the sub-processing model.

You start by creating a scheduler, then you put the pipeline stuff
into it, then you CALL it. Note: its a subroutine call. The subroutine returns
when there’s no work to do.

Ok, so how do you get the answer out? The problem is that pipeline
transducers are infinite loops. The source is not. The source is passed
a list of stuff to fold on construction, injects the elements into the pipelines,
and just drops dead at the end. Note: NO TERMINATOR.

So the sink, what does it do? The asnwer is, you pass it a pointer to the
variable in the caller to hold the final result. Each partial result is stored
there, each one is overwritten by the next one. When the pipline
stops working, the scheduler returns, and the result is in your local variable.

The coroutines in the pipeline don’t share memory, but the caller of the
pipline and the sink of the pipeline do (via a pointer).

It has taken 20 years to find this solution. Coroutines process streams,
and streams are always

(a) finite
(b) unbounded

Contrary to the WRONG CLAIMS of functional programming people,
lists are

(a) infinite
(b) bounded

LIst tails end, streams do not. However lists are inifinite, only their
tails are finite. Streams on the other hand must be finite or operations
on them couldn’t produce (observable) results.

Note that this sharing is completely safe and requires no locking!
This is because the caller is suspended whilst the result variable is
modified. The variable is shared, but NOT at the same time.

Pipelines are the dual of monads. The point here is to run the
dual (pipeline) instead of the monad. We want both models to
co-exist.

Stream processing pipeline *transducers* can almost always be run
concurrently since they ONLY do message passing via channels.
But the source and sink are a bit different: together with the
“run” subroutine, they act as a bridge across the model bounary.

My point is the ONLY viable solution to general problems *necessarily*
involves an interaction of the two models. FP is spatial and atemporal,
coroutines are temporal and aspatial (if I can invent a word).
You need both space and time to solve problems. You HAVE to use
both models together.


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Nov 27, 2018, 6:05:25 AM11/27/18
to felix google


> (By the way,
> I think John is using the term “concurrency" wrongly.)

I use the term literally, NOT in the modern sense in which
parallelism and concurrency are distinguished, because the
distinction is vague and lacks mathematical precision.

Concurrent literally means “at the same time”.


> Coroutines are
> mainly about control-flow.

In *sequential programs*. That is, singly threaded code.
There are multiple *logical* threads of control (called fibres),
but only one physical thread. This means a single monotonic
clock is enough to order all events. The semantic rules don’t
specify the order deterministically, but whatever the order is,
it is total.

The ordering constraint is enough to ensure no locking
is required for synchronisation: you don’t need to serialise
things because they’re serial already. YOu do need to make
sure that in the middle of an “atomic” operation consisting
of multiple steps you don’t yield control.


> The fact that they sometimes can be run in
> parallel is just an added bonus.

Yes, however Keean is not entirely off the mark either.
I want to go past just “a lucky optimisation”. In the first
instance I need rules to know when the optimisation has
no impact on the observable operation.

However it would be nice then to widen the scope
of the optimisation by weakening the total ordering
requirement.

Ultimately, it would be nice if the structure, in extreme form,
just magically turned into processes.

In other words I want a bridge between coroutines and proceses
which can be easily (if not transparently) moved.

In this sense Keeane is correct to consider that extreme end point
as well.

But the key thing is to find a way to span the gap.
If we can do that, Go is dead. Because it can’t.
It doesn’t even collect stalled processes.
In theory, there is no alternative, you must collect them
because there’s no way to decide from inside the program
when to kill them. Go tries, it is a shambles. I know people that
write Go. They have to use “backchannels” to synchronise
and arrange termination. The design of Go is wrong.


John Skaller
ska...@internode.on.net





Keean Schupke

unread,
Nov 27, 2018, 6:06:01 AM11/27/18
to felix-l...@googlegroups.com
Lots of stuff there, we need to agree the basics before wasting time in the details.

If we pass a linear type over the channel between co-routines we are guaranteed that only one co-routine can access the memory at one time. There is no sharing.

How is this slower than using shared memory?

Keean.


John Skaller2

unread,
Nov 27, 2018, 10:50:13 AM11/27/18
to felix google


> On 27 Nov 2018, at 22:05, 'Keean Schupke' via Felix Language <felix-l...@googlegroups.com> wrote:
>
> Lots of stuff there, we need to agree the basics before wasting time in the details.
>
> If we pass a linear type over the channel between co-routines we are guaranteed that only one co-routine can access the memory at one time. There is no sharing.
>
> How is this slower than using shared memory?

It isn’t, but it isn’t sharing memory, its transfering ownership.

There’s at least one case, the one i already described which uses a nested scheduler,
where it can’t work because the you cannot send data down a channel to the caller
of a nested scheduler from a coroutine running in the nested scheduler, because the caller
is suspended until the scheduler terminates.

Within a single scheduler it MAY be equivalent, except you will need to send extra data
down the channel.

Of course with coroutines you don’t have to bother doing this. Your argument is,
we can have concurrent processes by sending the shared data around along
channels, so we don’t need to share memory via copies of pointers.

Clearly the code structure would be different.

Apart from the nested scheduler issue, there’s another issue I think:
when you have a product type, and you share one component of it
with one routine and another component with another. That’s easy
if the routine is a coroutine and you use pointers for access.

I don’t know if the uniqueness typing system can handle that.



John Skaller
ska...@internode.on.net





Keean Schupke

unread,
Nov 27, 2018, 11:23:43 AM11/27/18
to felix-l...@googlegroups.com
On Tue, 27 Nov 2018 at 15:50, John Skaller2 <ska...@internode.on.net> wrote:
> How is this slower than using shared memory?

It isn’t, but it isn’t sharing memory, its transfering ownership.

Right, but the semantics are more precise, and should allow better optimisation.
 
There’s at least one case, the one i already described which uses a nested scheduler,
where it can’t work because the you cannot send data down a channel to the caller
of a nested scheduler from a coroutine running in the nested scheduler, because the caller
is suspended until the scheduler terminates.

You write the data to the buffer (shared memory) and then send the ownership back when you resume the caller. I don't see a problem here, you are not writing the data into the channels, you are writing the data into a "shared" buffer exactly as you would now. The caller is suspended so it cannot see any of these changes, nor access the "shared" memory at the same time, so transferring ownership back to the caller when it resumes is not going to cause any change in behaviour, and because it is a tighter constraint that permits more optimisations in the compiler.

Within a single scheduler it MAY be equivalent, except you will need to send extra data
down the channel.

As I said you don't send any extra data, you just send a 'capability' which gives access to the memory buffer.
 
Of course with coroutines you don’t have to bother doing this. Your argument is,
we can have concurrent processes by sending the shared data around along
channels, so we don’t need to share memory via copies of pointers.

Not quite, my point is we can have mutable memory buffers that we can treat as local to each co-routine, and transfer the ownership of the buffers along the channels by using linear types. This way we are guaranteed only one co-routine can access each buffer at a time, and we therefore don't need locks on the buffers. You just need a way of sending an uncopyable ownership token over a channel.

Clearly the code structure would be different.

If would be similar, but you would want to keep all your mutable state in a data structure that you can transfer ownership of, rather than just mutating data in scope using ambient authority.
 
Apart from the nested scheduler issue, there’s another issue I think:
when you have a product type, and you share one component of it
with one routine and another component with another. That’s easy
if the routine is a coroutine and you use pointers for access.

It would be the same here, but you need to make sure you have an array of linear pointers, not a linear pointer to an array :-)
 
I don’t know if the uniqueness typing system can handle that.

I remember that uniqueness types are not a full implementation of linear typing? The critical bit is that when you write a pointer to a channel, you must not be allowed to re-use the pointer afterwards (the local copy of the pointer must be invalidated when the pointer is sent over a channel.


Cheers,
Keean.

John Skaller2

unread,
Nov 27, 2018, 4:22:31 PM11/27/18
to felix google


> On 28 Nov 2018, at 03:23, Keean Schupke <ke...@fry-it.com> wrote:
>
>
> On Tue, 27 Nov 2018 at 15:50, John Skaller2 <ska...@internode.on.net> wrote:
> > How is this slower than using shared memory?
>
> It isn’t, but it isn’t sharing memory, its transfering ownership.
>
> Right, but the semantics are more precise, and should allow better optimisation.

Yes, it seems good when appropriate. With stream processing i expect it would
be applicable a lot of the time.

One thing I wrestle with is that “rvalues” or “expressions” are intrinsically unique.

The other is the mentioned problem of “deep” vs “shallow” copy and move.
In typical FPL with immutable data, a theorem says a pointer to T is isomorphic to T.
Because of this, a stack is a list, and, you can use pointers (boxes), moves and
copies are equal, you get sharing, and, generally objects (contiguous) and
data structures (linked objects) are equivalent.

As soon as you introduce mutable store, the equivalences are lost.
However modifications are much faster because they don’t reqquire rebuilding
a whole new data structure.

So now, what can linearity do? I’m still confused about linear and uniqueness
typing but i’m beginning to think what happens is that

(a) only pointers can be unique
(b) unique propagatesto the whole data structure

For example a pointer to a list node, if unique, implies any part of the list
can be modified, not just the head node. To say this more technically
you own EVERYTHING you can reach through the pointer.

Also suggested: weaker than R/W ownership: multi-read, single write.

> You write the data to the buffer (shared memory) and then send the ownership back when you resume the caller. I don’t see a problem here,

I will descrbe it again then: you cannot send ownership back when you resume the caller.
You cannot do anything in the writer when you resume the caller. The writer is stalled
waiting for more data to come in, data which will never come in. The writer doesn’t
know the stream has ended. It isn’t allowed to know, and it isn’t possible to know.
There is no bound on the stream, that is, there’s no end marker.
Streams are finite (they end) but they’re unbounded (you cannot detect the end if you’re processing it
with a coroutine).

The exhaustion of the stream is detected by the scheduler *indirectly*, and *as an optimisation*
in some sense: it is detected by observing there is no running coroutine and no coroutines
in the active list. The scheduler doesn’t know what the coroutines it was running were doing.

So the scheduler returns, resuming its caller in the usual way of a subroutine.
The sink object in the pipeline is unreachble. Its dead. If the GC has run, it doesn’t exist.

This is the core logic of coroutines. Its why Go is wrong and Felix is right.
It is the core problem I solved: how do you make an observation of the complete
result of a stream process?

The stream process cannot make a single valued terminal observation.
It can only make a single valued initial observation.

Functional list processing can only make terminal observations.
They cannot make initial observations (because they start with the tail
of a list, not the whole list).

There is a powerful diagram here: hard to draw but

list -> stream -> transducers _> sink -> list

It is isomorphic to a monadic chain of binds.


We start with a LIFT, lifting a list to a stream, and end with a DROP, dropping
the stream back to a (reversed) list. If the transducers are folding instead of mapping

A nested scheduler with a sink using a pointer to the caller state implements DROP.
LIFT of a purely functional list is the opposite, which turns out to be just closing the
source coroutine with a list parameter over the list argument.

I can show you the actual code:

chip source_from_list[T] (a:list[T])
connector io
pin out: %>T
{
for y in a perform write (io.out,y);
}

chip function[D,C] (f:D->C)
connector io
pin inp: %<D
pin out: %>C
{
while true do
var x = read io.inp;
var y = f x;
write (io.out, y);
done
}

chip sink_to_list[T] (p: &list[T])
connector io
pin inp : %<T
{
while true do
var x = read (io.inp);
p <- Cons (x,*p);
done
}

and to use it to do a map:

~/felix>cat tmp.flx
var result = Empty[int];
fun square (x:int) => x * x;
run (source_from_list ([1,2,3,4]) |-> function square |-> sink_to_list &result);
println$ result;
~/felix>flx tmp.flx
list(16, 9, 4, 1)

What i’ve been trying to say isn’t just a theory. Its real working code.
The above example show plainly that you HAVE to use a sink with
a pointer to the store of the caller of the scheduler to tranfer the
result of a stream process run in I-SPACE back to R-SPACE.

I is “imaginary space” which is temporal, it consists of events and has
limited local store. R-space is spatial, lots or memory, but only limited time.

R space is your usual functional space. Functions have limited time to return
a result. If they can’t they’re not functions. Functions have to terminate
to produces a terminal observation (encapsulated in the codomain type).

Coroutines are temporal, they have limited storage, but they run forever,
just as in R space data lives for every (persistent).

Do you see the pattern here? Coroutines don’t terminate. They’re
infinite loops. But the streams they process are finite, They end,
but they’re NOT BOUNDED. More precisely they have an initial bound
but no final one. Wheres in I-space we have inductive types that have
no initial bound, but a terminal one.

This is the ESSENTIAL SYMMETRY of computer programming.
Functional programming is impossible. All programs MUST be a mix
of I and R space. Its the only way to get bounds at both ends,
in other words, observations. Computer science is SCIENCE
it isn’t declarative. It isn’t a static model, its active. You run a simulation
and you make measurements.

An observation or measurement is a TRANSITION between I and R space.

in the code above the transition is done by

(a) preparing an empty list
(b) binding a pointer to it into the sink
(c) binding the initial list into the source
(d) calling the nested scheduler subroutine run

Steps (a)-(d) occur in R-space, the pipeline itself runs in I-space.

The transition cannot use channels. That’s the point. It uses parameter
passing on the LIFT transition; that is, a read pointer,
and a write pointer on the DROP.

Notice that the sink could save the list and write the whole list
every time, so, a write only pointers is enough. Similarly the
startup only requires a read-only pointer.

in the above process, the pointer transfers are transfer ownership
*temporarily*. Uniqueness types or linear typing is not required here
because its assured by the fact the nested scheduler is a SUBROUTINE.

As a final note, in Felix the “program” you run is actually a coroutine.
So you actually start off in I-space not R-space.

In other words, Felix is a control flow language. Functional code,
or expressions, are nested in imperative statements.

the key thing here is that this is imperative AND functional programming
mixed. Its the right way to do imperative programming because its
dual, or, somehow adjoint to the functional model.

The big problem is that the types that explain the events are NOT
isormophic to finite state machines as inductive types with
equ-recurisve types are. Hoare developed type systems for CSP,
and coroutines are a subset of CSP where “concurrent processes”
are replaced by “indeterminately ordered sequential coroutines”.

Type systems describing channel interactions are called session types.

The Felix model is a subset of CSP. Its a *tractable* subset.

So now, I want to extend it.

FIRST by using concurrency as an optimisation. No change to any logic.

SECOND by weakening the total ordering constraint.

We do not want full CSP. We have a better method: we can switch between
R space and I space instead.

This is my innovation. I know how to switch.

I will also say: a unified model of the whole R+I space cannot be built
with a functional model. The core entities are routines and continuations.
Routines can be called but they do not return. They can, however,
do a tail call of a continuation passed as an argument: continuation
passing style.

Its known how to model functional code with CPS. Its also pretty obvious
how to model coroutines.


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Nov 27, 2018, 4:34:33 PM11/27/18
to felix google

> I don’t know if the uniqueness typing system can handle that.
>
> I remember that uniqueness types are not a full implementation of linear typing? The critical bit is that when you write a pointer to a channel, you must not be allowed to re-use the pointer afterwards (the local copy of the pointer must be invalidated when the pointer is sent over a channel.
>

Uniqueness types and linear types are categorical duals.
[Can’t find a reference at the moment]

Linear typing only applies to functions, uniqueness typing only applies to locations.

Rust uses uniquness types, i believe the implementation is complete and sound.

Haskell is going to use linear types because it works best with Haskell’s key
optimisation: fusion. According to what I hear and SPJ says.


John Skaller
ska...@internode.on.net





Keean Schupke

unread,
Nov 27, 2018, 6:13:21 PM11/27/18
to felix-l...@googlegroups.com
As long as the buffer you are writing to in the sink has a linear/uniqueness type, then it should be okay. It dosn't matter whether it is passed to the function, procedure or co-routine via an argument or a channel.

Keean.


John Skaller2

unread,
Nov 27, 2018, 8:31:21 PM11/27/18
to felix google


> On 28 Nov 2018, at 10:13, Keean Schupke <ke...@fry-it.com> wrote:
>
> As long as the buffer you are writing to in the sink has a linear/uniqueness type, then it should be okay. It dosn’t matter whether it is passed to the function, procedure or co-routine via an argument or a channel.

I think you’re missing the issue. With coroutines as it stands, the DROP operation
I showed doesn’t need a uniqueness type because its storing the data atomically
and the owner of the store can’t access it until the coroutines is dead.

For channel I/O uniqueness may or may not be essential but it provides
an assurance the value is stable, so the reader can access it at any time.

This maters because the write could be resumed() before the reader.
In Felix at the moment the reader is always resumed() first precisely
to give the reader a chance to do something with the value before
it can change.

That’s the current state. The ISSUE is when you introduce concurrency,
Then uniqueness isn’t enough. The transfer in the DROP in particular
has to be atomic.

That’s the problem. You cannot use locks in coroutines, and you don’t need them.
So the question is, if we elevate some coroutines to processes and they’re not
pure .. for example the DROP operation done by the sink using a pointer,
then how can be we sure the operation is atomic?

You see, at the moment, you can write a product one component at a time
with any delays between because you can’t be pre-empted. Nothing can
any sequence of operations (between synchronisation points such
as channel I/O) is automatically atomic.

Once we allow the code between sequence points of two routines to run
concurrently the interleaving of reads and write on shared memory
is no longer assured to be serialised as it was before.

The point is that since this IS the actual case, we have to ensure
coroutines involved in potential races do not get run concurrently.

At present, just because a sink has a pointer to some store, doesn’t
mean the compiler knows where it points.

However, it may not be necessary. It would seem to be enough if we use
separation logic to prove that two coroutines are separated from each other.


John Skaller
ska...@internode.on.net





Keean Schupke

unread,
Nov 28, 2018, 4:44:27 AM11/28/18
to felix-l...@googlegroups.com
Thoughts inline:

On Wed, 28 Nov 2018 at 01:31, John Skaller2 <ska...@internode.on.net> wrote:


> On 28 Nov 2018, at 10:13, Keean Schupke <ke...@fry-it.com> wrote:
>
> As long as the buffer you are writing to in the sink has a linear/uniqueness type, then it should be okay. It dosn’t matter whether it is passed to the function, procedure or co-routine via an argument or a channel.

I think you’re missing the issue. With coroutines as it stands, the DROP operation
I showed doesn’t need a uniqueness type because its storing the data atomically
and the owner of the store can’t access it until the coroutines is dead.

This is no longer true with parallelism (more than one thread running this). The uniqueness type on the buffer ensures that no other function or coroutine has access to the buffer at the same time, hence it makes it sage to use this in a parallel context.
 
That’s the current state. The ISSUE is when you introduce concurrency,
Then uniqueness isn’t enough. The transfer in the DROP in particular
has to be atomic.

It is atomic because you should not allow pre-emption, so having entered the co-routine with a linear/unique type we are guaranteed nothing else can access that buffer until we transfer ownership of the buffer over a channel, or return the buffer from the function.
 
That’s the problem. You cannot use locks in coroutines, and you don’t need them.
So the question is, if we elevate some coroutines to processes and they’re not
pure .. for example the DROP operation done by the sink using a pointer,
then how can be we sure the operation is atomic?

No locks, because the type system have made the guarantees, we don't need locks.
 
You see, at the moment, you can write a product one component at a time
with any delays between because you can’t be pre-empted. Nothing can
any sequence of operations (between synchronisation points such
as channel I/O) is automatically atomic.

pre-emption does not matter if nothing else can access the buffer in the critical region. In effect a linear/uniqueness type guarantees exclusive access hence is the static equivalent of a lock. In theory we only even need locks when we cannot statically prove linear access.
 
Once we allow the code between sequence points of two routines to run
concurrently the interleaving of reads and write on shared memory
is no longer assured to be serialised as it was before.

But if the type is linear, none of that code is allowed to touch the buffer, so we don't care.


Keean.

John Skaller2

unread,
Nov 28, 2018, 7:20:59 AM11/28/18
to felix google


> On 28 Nov 2018, at 20:44, Keean Schupke <ke...@fry-it.com> wrote:
>
> Thoughts inline:
>
> On Wed, 28 Nov 2018 at 01:31, John Skaller2 <ska...@internode.on.net> wrote:
>
>
> > On 28 Nov 2018, at 10:13, Keean Schupke <ke...@fry-it.com> wrote:
> >
> > As long as the buffer you are writing to in the sink has a linear/uniqueness type, then it should be okay. It dosn’t matter whether it is passed to the function, procedure or co-routine via an argument or a channel.
>
> I think you’re missing the issue. With coroutines as it stands, the DROP operation
> I showed doesn’t need a uniqueness type because its storing the data atomically
> and the owner of the store can’t access it until the coroutines is dead.
>
> This is no longer true with parallelism (more than one thread running this). The uniqueness type on the buffer ensures that no other function or coroutine has access to the buffer at the same time, hence it makes it sage to use this in a parallel context.

See below. No it doesn’t. Uniqueness of several pieces doesn’t assure serialisation
of their use.

Share memory accesses do not have to consist of a single piece
of data, they often consist of several operations. For example
it may involve a swap. Swaps are not intrinsically atomic.

>
> That’s the current state. The ISSUE is when you introduce concurrency,
> Then uniqueness isn’t enough. The transfer in the DROP in particular
> has to be atomic.
>
> It is atomic because you should not allow pre-emption, so having entered the co-routine with a linear/unique type we are guaranteed nothing else can access that buffer until we transfer ownership of the buffer over a channel, or return the buffer from the function.

You are assuming the buffer tranfer is atomic. It may not be.
Even if a pointer store is atomic, we may not be transfering one buffer
via a pointer.

I understand your argument. It just isn’t sound because it is limited by
the assumption an operation is intrinsically atomic. That may be the case
for tranfering a pointer. Whether it is unique or not is irrelevant to atomicity
of the operation.

You can look at it this way: if I had to transfer two pointers from a reader
to a writer, then the reader mustn’t start reading until both are transfered.
So the store of the two pointers is a critical section. If the reader and
write execute concurrently, the store must be serialised.



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Nov 28, 2018, 7:21:57 AM11/28/18
to felix google
>
> But if the type is linear, none of that code is allowed to touch the buffer, so we don’t care.

Yeah we do: the *ownership* has to be transfered atomically.
Uniqueness is not enough.


John Skaller
ska...@internode.on.net





Keean Schupke

unread,
Nov 28, 2018, 7:52:07 AM11/28/18
to felix-l...@googlegroups.com
Hi John,


Actually linearity guarantees this, it's what "linear" actually means. We can draw a line between each access in time, and in the graph there is only even one entry edge and one exit edge to each access.

So linearity does guarantee this from a type system perspective

Of course to be sound your semantics have to match this, and this does require transfer of ownership to be atomic. There are many ways to ensure this, the simplest is just to make it a function of your scheduler.

Another is to use the x86 atomic operations, and using SIMD registers you can atomically swap up to 256bits these days.


Keean.


John Skaller2

unread,
Nov 28, 2018, 7:56:09 AM11/28/18
to felix google
The thing is you have to look at the actual code.

Schannel transfers have to be serialised in the presence of concurrency.
It has nothing to do with the data transfers, schannels transfer a single machine
address which is probably atomic. There are no type issues, only one type can
be transfered.

The problem is any concurrently executing process that owns the channel
can read and write on it. There is no constraint for a single reader and
single writer.


void sync_sched::do_sread()
{
muxguard dummy(active_lock);
readreq_t * pr = (readreq_t*)request->data;
schannel_t *chan = pr->chan;
if(debug_driver)
fprintf(stderr,"[sync: svc_read] Fibre %p Request to read on channel %p\n",ft,chan);
if(chan==NULL) goto svc_read_none;
svc_read_next:
{
fthread_t *writer= chan->pop_writer();
if(writer == 0) goto svc_read_none; // no writers
if(writer->cc == 0) goto svc_read_next; // killed
readreq_t * pw = (readreq_t*)writer->get_svc()->data;
if(debug_driver)
fprintf(stderr,"[sync: svc_read] Writer @%p=%p, read into %p\n",
pw->variable,*(void**)pw->variable, pr->variable);
if (pr->variable && pw->variable)
*(void**)pr->variable = *(void**)pw->variable;
if(debug_driver)
fprintf(stderr,"[sync: svc_read] current fibre %p FED, fibre %p UNBLOCKED\n",ft, writer);

// WE are the reader, stay current, push writer
// onto active list
active->push_front(writer);
collector->add_root(writer);
show_state();
return;
}

The code for a write operation is identical, except reader and writer roles are swapped.

you have to appreciate that an schannel is a pair of lists of continuations,
one list of readers, one list of writers, and an assurance only one of the two
lists is populated.

The RAII lock at the start of the routine accepts a pointer to a mutex.
If the pointer is NULL, there’s no lockling.

If the scheduler is run by one thread, there’s no locking, if there
are two threads, locking is enabled. Its done dynamically.




John Skaller
ska...@internode.on.net





Keean Schupke

unread,
Nov 28, 2018, 7:59:39 AM11/28/18
to felix-l...@googlegroups.com
So that should be good enough right? If a write to the channel is atomic (due to locking) and a read from the channel is atomic, then you guarantee linearity of any pointers written in providing the pointer is invalidated after write.

Keean.


Keean Schupke

unread,
Nov 28, 2018, 8:06:59 AM11/28/18
to felix-l...@googlegroups.com
Another point to consider is that if the type-system forbids a certain operation, like accessing a pointer after its written to a channel, then code that violates this will not compile. So we can guarantee that as long as you don't hand code assembly language, you cannot break the rules, and hence do not need a runtime check. So although it may be possible to break the atomicity of the transfer of ownership, as long as the type system catches the violations and prevents them from compiling the system is sound.

Keean.

John Skaller2

unread,
Nov 28, 2018, 8:20:49 AM11/28/18
to felix google


> On 28 Nov 2018, at 23:59, 'Keean Schupke' via Felix Language <felix-l...@googlegroups.com> wrote:
>
> So that should be good enough right? If a write to the channel is atomic (due to locking) and a read from the channel is atomic, then you guarantee linearity of any pointers written in providing the pointer is invalidated after write.

The library write copies the data to to be transfered onto the heap and passes the pointer.
Linear typing isn’t required, the pointer is forgotten by the library.

The problem is the copy is only shallow. Linearity doesn’t address this issue
because the theory is developed for the functional programming model,
where it doesn’t matter because you don’t copy anything anyhow due
to boxing and immutability.

Felix has uniqueness types, and it seems fine for some uses.
The enforcement is only first order (it cannot track pointer aliases).
It is also only an ownership transfer contract, it relies on the claim
of ownership by the sender being valid.

So that’s the first issue. But channel I/O is only one issue.

The shared memory issue is like this:

For coroutines running on a scheduler with a single thread, you can do this, roughly:

yield();
p1 <- v1;
dostuff();
p2 <- v2;
stuff();
yield();

The transfer of v1 and v2 is atomic. IF the two pointers point into the same
coroutines’s locals, then that coroutine will either observe they have
v1 AND v2 or neither.

Throw in concurrency and the guarrantee of atomicity by serialisation
is lost unless the stores are guarded somehow.

There is no obvious way to do the guarding. The code above doesn’t
actually know where the two pointers point. So there is ONLY one
option here: disqualify this routine from running on a scheduler
that uses more than one thread. We also have to disqualify any
coroutines that own the store to which the pointers point.

Clearly in any application, if we needed to do something *knowing
we were writing processes not coroutines* we wouldn’t be using
unguarded shared memory in the first place.

the problem is, it is not enough to disallow pre-emptions for the
code to work, we have to disallow concurrency during the critical
section (between the two yeilds) but that is undefined for the
receivers of the values .. we cannot see them in the code above.

Pointers (and schannels) isolate code into modules.
Pointers are not very good when you have concurrency.
But they’re extremely good for serial programming.



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Nov 28, 2018, 8:34:20 AM11/28/18
to felix google


> On 29 Nov 2018, at 00:06, 'Keean Schupke' via Felix Language <felix-l...@googlegroups.com> wrote:
>
> Another point to consider is that if the type-system forbids a certain operation, like accessing a pointer after its written to a channel, then code that violates this will not compile. So we can guarantee that as long as you don't hand code assembly language, you cannot break the rules, and hence do not need a runtime check. So although it may be possible to break the atomicity of the transfer of ownership, as long as the type system catches the violations and prevents them from compiling the system is sound.

Yes. That isn’t the case in Felix though.

Uniqueness types provide a contract, not a global guarrantee:

fun f (p: uniq (+char)) => // uniquely owned C array required

var x = c”Hello”;
.. f x .. // TYPE ERROR

var y = box (c”Hello”); // box makes a value uniquely typed
.. f y .. // OK
y = box (c”World”); // OK

The last assignment is valid because the previous value has been moved out.

This level of the typing works. Note that the user has to CLAIM ownership
here with the box() operator. The system does not ensure this claim
is valid. All it ensures is that when you try to call a function with a uniq
parameter type, you give it a uniq value.

If you take the address of a uniquely typed variable, all bets are off.
Felix cannot track loads and stores via a pointer. Its a reasonable
compromise: there is a guarantee, Felix validates argument passing
contracts. The guarrantee is not global.

I think that is reasonable, however. What is assured is clear,
there is a requirement to follow a protocol/usage rules, but
there is also enforcement.

However I do NOT like the operation much.

The problem is that “rvalues” are intrinsically unique.
All expressions are unique, no matter what their type.
Only variables can fail to be unque. But my type system
doesn’t work like that.

The right way is to assume everything is unque and annotate
variables that are shared. This is what Rust does (modulo
the copyability type class).

On theo other linear functions are not really that useful in a language
like Felix that is strongly control flow oriented.


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Nov 28, 2018, 4:34:49 PM11/28/18
to felix google
So the last commit actually should be enough to safely run suitable coroutines
concurrently, if used right. You can spawn a new thread to run on the SAME
scheduler queue by calling

spawn_process routine;

instead of

spawn_fthread routine;

which starts a new scheduler.

You cannot do async I/O at the moment. I will get to that in a second.
You cannot use shared memory, the coroutines have to be pure.
If you do side effects like writing to standard output, the order will be
indeterminate.

The code does not handle termination correctly. Assuming no async I/O,
one of the threads will exit if there is no work for it to do. This is wrong
because another thread could be working and create new work that
the thread could do. But it is good enough to do some measurements
by using

schedule_fthread proc

to push jobs onto the end of the queue so they don’t execute
whilst we’re pushing them (spawn_fthread runs the spawned fibre
immediately). This gives us time to spawn a process thread to add to
the pool of threads running the jobs.

I haven’t tested it yet :-)

However there is a serious problem and it is the result of an existing
serious problem that I have been glossing over :-)

The user can create a scheduler and call it with the run() procedure.
Run() is an ordinary subroutine, it returns when the nested scheduler
terminates. However you cannot do async operations on the nested
scheduler. You cannot do async I/O, and you cannot spawn new pthreads.

The reason is that the scheduler is split into two pieces. There is a subroutine
which does synchronous scheduling only: channel reads and writes,
spawning of fibres (fthreads).

If a request is made that the synchronous scheduler cannot do,
it suspends and returns the request to the asynchronous scheduler.

The problem is, this only works for base scheduler of a pthread.
A nested scheduler is run as a subroutine and the returned delegation
request is not honoured,I think the program just aborts if you try it.

The problem is that when you create do a run(), you are doing
it from within a single coroutine running on ONE of the threads
of the asynchronous scheduler. So the nested scheduler is again
single threaded. I have upgraded the code, so the nested scheduler’s
queue has contains an async read queue, but this is not enough.

The way async I/O works is that when a coroutine does an async
request the async scheduler checks if a demix thread is running,
and if not starts one. The demux thread is an event monitor using
epoll, kqueue, etc, so it can do socket I/O asynchronously.

In the upgraded model, there can still be a demux thread for
every pthread, even if two pthreads are sharing the job queue.
Both threads will put ready routines on the same ready queue,
this should be OK because its all using condition variables and locks.
(the queue is multiiproducer multi-consumer thread safe).

So IF a nested scheduler could do async I/O, it would no longer
be assured to get he requestor back, EVEN THOUGH it only being
run by one pthread.

The problem is noty just that the whole design was for a single thread,
but that the nested scheduler could never handle async operations
in the first place, because I see no natural model of what to do with
async ops.

I can’t just upgrade run to use an async scheduler, because an
async scheduler object is assumed 1-1 with a pthread.



John Skaller
ska...@internode.on.net





Keean Schupke

unread,
Nov 28, 2018, 6:16:28 PM11/28/18
to felix-l...@googlegroups.com
More thoughts inline:

On Wed, 28 Nov 2018 at 13:20, John Skaller2 <ska...@internode.on.net> wrote:


> On 28 Nov 2018, at 23:59, 'Keean Schupke' via Felix Language <felix-l...@googlegroups.com> wrote:
>
> So that should be good enough right? If a write to the channel is atomic (due to locking) and a read from the channel is atomic, then you guarantee linearity of any pointers written in providing the pointer is invalidated after write.

The library write copies the data to to be transfered onto the heap and passes the pointer.
Linear typing isn’t required, the pointer is forgotten by the library.

The problem is the copy is only shallow. Linearity doesn’t address this issue
because the theory is developed for the functional programming model,
where it doesn’t matter because you don’t copy anything anyhow due
to boxing and immutability.

Ownership needs to form a DAG, and all the pointers need to be owning. Then sending the root of the DAG through a channel is enough to transfer ownership of the whole structure. The issue is when you have non-owning pointers into that DAG to allow access. So when sending over a channel we have to make sure that no pointers are held after the channel write.
 
Felix has uniqueness types, and it seems fine for some uses.
The enforcement is only first order (it cannot track pointer aliases).
It is also only an ownership transfer contract, it relies on the claim
of ownership by the sender being valid.

Tracking aliases seems a useful approach

So that’s the first issue.  But channel I/O is only one issue.

The shared memory issue is like this:

For coroutines running on a scheduler with a single thread, you can do this, roughly:

        yield();
        p1 <- v1;
        dostuff();
        p2 <- v2;
        stuff();
        yield();

The transfer of v1 and v2 is atomic. IF the two pointers point into the same
coroutines’s locals, then that coroutine will either observe they have
v1 AND v2 or neither.

You would not allow this. The pointers would point to a memory buffer with unique ownership, so they cannot also be 'locals' in another co-routine.

Instead you declare the buffer, which is used for "local" storage in co-routine A, and then passed using a channel to co-routine B. Local variables would be private to the co-routine.


Keean.

John Skaller2

unread,
Nov 29, 2018, 7:39:55 AM11/29/18
to felix google

>
> Ownership needs to form a DAG, and all the pointers need to be owning.

But that’s not how it is. A list with nodes:

struct node { next: &node; data: int };
typedef list = &node;

Then making

uniq list

has no impact on anything but the first pointer.

You’re right, all the pointers need to be uniq. But they aren’t.

I think this means “uniq” is wrong.


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Nov 29, 2018, 7:42:40 AM11/29/18
to felix google
>
> For coroutines running on a scheduler with a single thread, you can do this, roughly:
>
> yield();
> p1 <- v1;
> dostuff();
> p2 <- v2;
> stuff();
> yield();
>
> The transfer of v1 and v2 is atomic. IF the two pointers point into the same
> coroutines’s locals, then that coroutine will either observe they have
> v1 AND v2 or neither.
>
> You would not allow this.

But it is allowed. Coroutines are ordinary sequential programming.

Ok, so its not allowed in processes. I agree.

That’s why I originally said, if a set of coroutines is pure,
they can be elevated to processes. Pure means no effects
other than channel I/O.



John Skaller
ska...@internode.on.net





Keean Schupke

unread,
Nov 29, 2018, 7:51:53 AM11/29/18
to felix-l...@googlegroups.com
You could write this instead:

data <- yield(data)
p1 <- data.v1
dostuff()
p2 <- data.v2
stuff();
data <- yield(data)

We store the shared variables 'v1' and 'v2' in a buffer region 'data', and then exchange the ownership of this buffer with whatever we are yielding to.

Keean.


John Skaller2

unread,
Nov 29, 2018, 8:25:37 AM11/29/18
to felix google
So this isn’t very convincing but it demonstrably works:

//////////
fun thrid: 1 -> address ="pthread_self()"
requires Posix_headers::pthread_h;

proc p (i:int) () {
println$ #thrid,i;
}
var q = p;

proc many()
{
for i in 1..20 perform schedule_fthread (q i);
println$ "Spawned";
spawn_process { println$ "New thread"; };
}

many();
/////////////


~/felix>flx tmp
Spawned
New thread
(70000ef1f000, 1)
(70000ef1f000, 3)
(70000ef1f000, 4)
(70000ef1f000, 5)
(70000ef1f000, 6)
(70000ef1f000, 7)
(7fff945b3380, 2)
(70000ef1f000, 8)
(70000ef1f000, 10)
(70000ef1f000, 11)
(7fff945b3380, 9)
(70000ef1f000, 12)
(7fff945b3380, 13)
(70000ef1f000, 14)
(7fff945b3380, 15)
(7fff945b3380, 17)
(70000ef1f000, 16)
(7fff945b3380, 18)
(70000ef1f000, 19)
(7fff945b3380, 20)


I tried with a longer run, and with more threads. It works but there is a bug,
removing a root which is not a root *sometimes*. Probably a race.

[gc] GC ERROR: REMOVE ROOT WHICH IS NOT ROOT

Occurs in different places in longer runs. Not surprised,
should be easy to fix.

There is no termination handling. It’s not required for this test case.





John Skaller
ska...@internode.on.net





John Skaller2

unread,
Nov 29, 2018, 8:43:29 AM11/29/18
to felix google
So there’s a simple solution to the pointer problem:

In one coroutine:

critical_section {
p1 <- v;
};

In another containing the store to which teh pointer p1 above
points you have to write:

var x : int;
critical_section { &x <- thevalue; };

I’ll think about critical sections for expressions later.

This is not implemented but the implementation is trivial: a GLOBAL lock
is set during the critical section.

This is INEFFICIENT because it serialises all critical sections, even if they’re
unrelated and will cause delays if the lock is heavily contented.

Generally using locks is ONLY allowed if the code inside:

(a) cannot block
(b) does no allocations

therefore critical sections are also unsafe.

However it works independently of how many threads are in use,
it works for single threaded processing as well as many.

It also works across pthread boundaries, and it works with
thread pools.

Provided the two conditions above are met.

So it is not ideal. But it is a way to allow some impure coroutines
to be elevated to processes.





John Skaller
ska...@internode.on.net





John Skaller2

unread,
Nov 29, 2018, 4:18:06 PM11/29/18
to felix google
Well .. this is the most obvious proof of a race ever!

[sync] unrooting fthread 0x7fb3ff454cf0
[gc] Remove root 0x7fb3ff454cf0
(70000d50c000, 2288)
[sync] unrooting fthread 0x7fb3ff454ee0
[sync] unrooting fthread 0x7fb3ff454ee0
[gc] Remove root 0x7fb3ff454ee0
(70000d50c000, 2290)
[gc] GC ERROR: REMOVE ROOT 0x7fb3ff454ee0 WHICH IS NOT ROOT
[sync] unrooting fthread 0x7fb3ff4552c0
Shell terminated by signal SIGABRT

Somehow the same fthread is being unrooted twice.

The code doing this is immediately clear:

sync_sched::fstate_t sync_sched::frun()
{
if (debug_driver)
fprintf(stderr,"[sync] frun: entry ft=%p, active size=%zu\n", ft,active->size());
dispatch:
if (ft == 0) {
muxguard dummy(active_lock);
ft = active->pop_front();
}
if (ft == 0) return blocked;
request = ft->run(); // run fthread to get request
if(request == 0) // euthenasia request
{
muxguard dummy(active_lock);
if(debug_driver)
fprintf(stderr,"[sync] unrooting fthread %p\n",ft);
collector->remove_root(ft);
ft = 0;
goto dispatch;
}


The variable “ft” is a non-static member of the sync scheduler.
There is one sync scheduler for each async scheduler and one
async scheduler per pthread. Only the list of active fthreads is
shared.

So no locking is required to use “ft”, the current pthread’s
running fibre.

So, two pthreads must be running the same fthread ..??



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Nov 29, 2018, 5:52:56 PM11/29/18
to felix google


> On 30 Nov 2018, at 08:17, John Skaller2 <ska...@internode.on.net> wrote:
>
> Well .. this is the most obvious proof of a race ever!
>
> [sync] unrooting fthread 0x7fb3ff454cf0
> [gc] Remove root 0x7fb3ff454cf0
> (70000d50c000, 2288)
> [sync] unrooting fthread 0x7fb3ff454ee0
> [sync] unrooting fthread 0x7fb3ff454ee0
> [gc] Remove root 0x7fb3ff454ee0
> (70000d50c000, 2290)
> [gc] GC ERROR: REMOVE ROOT 0x7fb3ff454ee0 WHICH IS NOT ROOT
> [sync] unrooting fthread 0x7fb3ff4552c0
> Shell terminated by signal SIGABRT

Oh great. I love threads.

i added debugs to prove two threads try to unroot the same fthread. Check! Yes!

Ok, so i added debugs to try to find how two pthread acquire the same fthread.
But I can’t see it on the console.

So I redirect I/O to a file.

Guess what? It works when I do that :-)

I also get this sometimes:
flx_run(38001,0x700009621000) malloc: *** error for object 0x7fd3504d82f0: pointer being freed was not allocated

I guess that has to be the GC freeing something that should have been a root.

good thing my Mac terminal program iTerm2 has a log! Brilliant!!

(700005c96000, 751)
[sync] pthread 0x700005b90000 unrooting fthread 0x7fdbd845ded0
[sync] pthread 0x700005c13000 unrooting fthread 0x7fdbd845e0c0
[sync] pthread 0x700005d19000 fetching fthread 0x7fdbd845e490 *****************************
[sync] pthread 0x700005b0d000 fetching fthread 0x7fdbd845e490 ******************************
(700005d19000, 752)
[sync] pthread 0x700005c96000 unrooting fthread 0x7fdbd845e2b0
[sync] pthread 0x700005b90000 fetching fthread 0x7fdbd845e8a0
(700005b90000, 754)
[sync] pthread 0x700005c13000 fetching fthread 0x7fdbd845ea60
[sync] pthread 0x700005d19000 unrooting fthread 0x7fdbd845e490 *****************************
(700005c13000, 755)
[sync] pthread 0x7fff945b3380 fetching fthread 0x7fdbd845e9b0
[sync] pthread 0x700005c96000 fetching fthread 0x7fdbd845ee50
(7fff945b3380, 756)
[sync] pthread 0x700005b90000 unrooting fthread 0x7fdbd845e8a0
(700005c96000, 757)
[sync] pthread 0x700005c13000 unrooting fthread 0x7fdbd845ea60
[sync] pthread 0x700005d19000 fetching fthread 0x7fdbd845edc0
(700005d19000, 758)
[sync] pthread 0x700005a07000 fetching fthread 0x7fdbd845f180
(700005a07000, 759)
[sync] pthread 0x700005c96000 unrooting fthread 0x7fdbd845ee50
[sync] pthread 0x700005b90000 fetching fthread 0x7fdbd845f350
(700005b90000, 760)
[sync] pthread 0x700005c13000 fetching fthread 0x7fdbd845f4a0
(700005c13000, 761)
[sync] pthread 0x700005d19000 unrooting fthread 0x7fdbd845edc0
[sync] pthread 0x700005a8a000 fetching fthread 0x7fdbd845f630
[sync] pthread 0x700005c96000 fetching fthread 0x7fdbd845f9f0
(700005a8a000, 762)
(700005c96000, 763)
[sync] pthread 0x700005b90000 unrooting fthread 0x7fdbd845f350
[sync] pthread 0x700005c13000 unrooting fthread 0x7fdbd845f4a0
[sync] pthread 0x700005d19000 fetching fthread 0x7fdbd845fbe0
(700005d19000, 764)
[sync] pthread 0x700005b0d000 unrooting fthread 0x7fdbd845e490 *********************
[sync] pthread 0x700005c96000 unrooting fthread 0x7fdbd845f9f0
[sync] pthread 0x700005b90000 fetching fthread 0x7fdbd845fdd0
[sync] pthread 0x700005c13000 fetching fthread 0x7fdbd845ffc0
(700005b90000, 765)
(700005c13000, 766)
[sync] pthread 0x700005d19000 unrooting fthread 0x7fdbd845fbe0
[gc] GC ERROR: REMOVE ROOT 0x7fdbd845e490 WHICH IS NOT ROOT
[sync] pthread 0x700005b90000 unrooting fthread 0x7fdbd845fdd0
[sync] pthread 0x700005c13000 unrooting fthread 0x7fdbd845ffc0
Shell terminated by signal SIGABRT

So there it is.

dispatch:
if (ft == 0) {
muxguard dummy(active_lock);
ft = active->pop_front();
fprintf(stderr,"[sync] pthread %p fetching fthread %p\n",pthread_self(),ft);
}

Assuming the lock is working .. this is impossible. .. checking .. argghh the lock isn’t set!

Aha. Yes it is, but NOT in the first thread. The pointer is in the scheduler.
Design fault!


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Nov 29, 2018, 8:15:22 PM11/29/18
to felix google
Fixed!

So the current test runs .. i will try to figure out how to time it with 1 thread
and 4 threads .. see if it runs faster. However the code can’t do I/O
because that would dominate the context switching times.

What the code does at the moment is: provided there is no async I/O,
then when the queue is empty and a pthread finishes its current job,
then it just exits.

This is wrong because another pthread running a job might spawn
more fibres to run. So I have to make the pthread hang around
until ALL the pthreads are out of work, then they can all exit.

To make them hang around they will have to sit on a condition
variable. When they get out of the wait, they have to go back
and try to start working again. Its OK if they end up waiting again.
To wake them up, we can use a signal when the queue is empty
and a new fibre gets pushed onto it.

When there is NO work left, a signal can wake them up
so they can terminate.

Sounds good but there are some issues.

The first issue is async I/O. We know when there is pending
I/O and we know when there are completed async jobs waiting
to be requeued. Currently an out of work thread checks
if there are pending jobs, and if so it hangs on the completed queue.
It will fetch any job there immediately or wait until there is one.

This only works if there is one pthread. If several block on the
queue, one may get released and the others left blocked
even though there are no longer any async jobs left.

So I’m thinking of a TRICK. When a pthread starts it
increments the async count. When it runs out of work,
it checks if the async count is 1.

If the count is 1, it decrements the count, signals, and terminates.

Otherwise, it decrements the count and hangs on the async queue.
This is fine even if there are no async jobs provided we can bump it out.

Whenever the active list is empty and we add a job, we signal.
This bumps threads out of the hang on the queue. They have to add
1 to the async count on the way out.

The logic of the trick is, from the point of view of a pthread,
if another pthread is running .. it is an async process with respect
to this thread. It just doesn’t queue an event on the async queue,
it queues them on the sync queue.

Which is interesting. The reason for TWO queue is that in single threaded
mode, fetching from the sync queue without a lock is faster than
using the async queue which HAS to use locks because the stuff
put onto it is put there by the demux thread.


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Nov 29, 2018, 8:53:22 PM11/29/18
to felix google
OK so the first timing:


fun thrid: 1 -> address ="pthread_self()"
requires Posix_headers::pthread_h;

proc p (i:int) () {
var j = 1;
println$ #thrid,i;
for i in 1..10 do
++j;
done
}
var q = p;

var start = time();
proc many()
{
for i in 1..200000 perform schedule_fthread (q i);
println$ "Spawned";
start = time();
schedule_fthread { println$ "Elapsed " + (time() - start).str; };
spawn_process { println$ "New thread"; };
spawn_process { println$ "New thread"; };
spawn_process { println$ "New thread"; };
/*
spawn_process { println$ "New thread"; };
spawn_process { println$ "New thread"; };
spawn_process { println$ "New thread"; };
spawn_process { println$ "New thread"; };
*/
}

many();

With 1 thread: 23 seconds.
With 4 threads: 97 seconds.

The threads slow it down heaps.

Changing the numbers so the jobs take longer and there are less of them:

10,000 iterations per job, 200 jobs:

4 threads: 66 seconds.
1 thread: 7 seconds.

Checking the Mac activity monitor shows

(a) all 4 CPUs are used for the single threaded job. The Mac appears
to have anti-affinity.

(b) for the multi-thread job, around 60% of the time is spent in the OS.

If I run 3 threads, only 3 CPU’s get loaded.

So here’s the thing:

Its NOT the locks. This procedure runs a long time:

proc p (i:int) () {
var j = 1;
println$ #thrid,i;
for i in 1..1000000 do
++j;
done
}

so there’s is hardly any locking at all. Its till uses 60% or more time in the OS.

My only conclusion: OSX totally sucks at multi-threading.

With many threads the maximum CPU use I can get is 125%.
The hyperthreads are never used.

With 1 thread, i get 100 CPU use. Note, this is 25% per real CPU.
ALL the CPUs are used.

Someone has to try this on Linux.



John Skaller
ska...@internode.on.net





Keean Schupke

unread,
Nov 30, 2018, 2:40:16 AM11/30/18
to felix-l...@googlegroups.com
The only info I can find on Mac OSX threading being slow is that the locking is slow due to not having user-space spin locks. Each lock call goes to the kernel, and then lock contention is why you see the time spent in kernel not user-space. Also Mac OSX seems to distribute the locks more evenly which doubles the contention compared to linux. Linux seems to let a thread keep the lock more often.

Maybe implement your own user-space spin locks, and only call the kernel if you are spinning for a long time (greater than the kernel call overhead)?


Keean.

John Skaller2

unread,
Nov 30, 2018, 3:02:20 AM11/30/18
to felix google


> On 30 Nov 2018, at 18:40, 'Keean Schupke' via Felix Language <felix-l...@googlegroups.com> wrote:
>
> The only info I can find on Mac OSX threading being slow is that the locking is slow due to not having user-space spin locks.

But they do. OSX locks are renowned for high performance BECAUSE they start of
as user space spinlocks, and only degrade to the OS locks after a certain number of spins.
AFAIK. Don’t quote me, I can’t remember where I read that.

> Each lock call goes to the kernel, and then lock contention is why you see the time spent in kernel not user-space. Also Mac OSX seems to distribute the locks more evenly which doubles the contention compared to linux. Linux seems to let a thread keep the lock more often.
>
> Maybe implement your own user-space spin locks, and only call the kernel if you are spinning for a long time (greater than the kernel call overhead)?

Yeah, i have actually. It calls OS delay function instead. I can’t do a hard spin loop
in the Felix mutex though, then i’d be relying on pre-emptions to bump out
of the loop. The reason is this mutex is meant for general locking.

I could do another one specially for fast locks.

But locks aren’t the problem. Running the test with coroutines that run for a LONG time,
the only locking is fetching them from the queue. So the lock time is a fraction of a second.
The coroutines run for many seconds.

No difference. It isn’t the locks. Unless they’re held by mistake until the coroutine ends
which I doubt. The code actually works, its just slow.

However I tried running my matrix multiply test using the Felix thread pool.
And that hangs completely. It used to work. Using top .. yes, the process is
running with 0% CPU usage. It used to flog the CPUs.

I did change something which makes “pfor” not work, the handling of
parameters. But the thread pool pfor procedure can be called directly.

The thing is, even the singly threaded version isn’t running.

The demo is here:

https://github.com/felix-lang/felix/blob/master/src/packages/pthreads.fdoc

Line 718.

So unless the compiler/library upgrades screwed something badly .. the problem
has to be in OSX.

Actually the LInux build is stalling on the Async I/O test. Which runs just fine
on OSX. .. hmmm … :-)

Something is badly screwed up. :-)



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Nov 30, 2018, 3:02:31 AM11/30/18
to felix google
>
> With 1 thread: 23 seconds.
> With 4 threads: 97 seconds.
>
> The threads slow it down heaps.

Something weird. I have previously run a matrix multiply using pfor,
which uses the thread pool. That speeds up significantly using 4 threads.

Its behaving as if the lock isn’t being relesed…

If the code were Felix, sure.. there’s no RAII. But its C++.

muxguard::muxguard (::std::mutex *p): m(p) { if (m)m->lock(); }
muxguard::~muxguard () { if (m)m->unlock(); }


Hmm .. ok .. OSX 10.13 is utter and complete rubbish.

Running the matrix multiple test now.. CPU load .. close to zero.
Stalled .. ok .. known working process. Not working.



John Skaller
ska...@internode.on.net





Keean Schupke

unread,
Nov 30, 2018, 3:15:13 AM11/30/18
to felix-l...@googlegroups.com


On Fri, 30 Nov 2018, 08:02 John Skaller2 <ska...@internode.on.net wrote:


> On 30 Nov 2018, at 18:40, 'Keean Schupke' via Felix Language <felix-l...@googlegroups.com> wrote:
>
> The only info I can find on Mac OSX threading being slow is that the locking is slow due to not having user-space spin locks.

But they do. OSX locks are renowned for high performance BECAUSE they start of
as user space spinlocks, and only degrade to the OS locks after a certain number of spins.
AFAIK. Don’t quote me, I can’t remember where I read that.

I thought that was how Linux spinlocks work (you can see the code). MacOSX ones seem to be slower, so the speculation is they don't do this (to save battery power).

> Each lock call goes to the kernel, and then lock contention is why you see the time spent in kernel not user-space. Also Mac OSX seems to distribute the locks more evenly which doubles the contention compared to linux. Linux seems to let a thread keep the lock more often.
>
> Maybe implement your own user-space spin locks, and only call the kernel if you are spinning for a long time (greater than the kernel call overhead)?

Yeah, i have actually. It calls OS delay function instead. I can’t do a hard spin loop
in the Felix mutex though, then i’d be relying on pre-emptions to bump out
of the loop. The reason is this mutex is meant for general locking.

I could do another one specially for fast locks.

But locks aren’t the problem. Running the test with coroutines that run for a LONG time,
the only locking is fetching them from the queue. So the lock time is a fraction of a second.
The coroutines run for many seconds.

The fact the time is in kernel space suggests otherwise?

No difference. It isn’t the locks. Unless they’re held by mistake until the coroutine ends
which I doubt. The code actually works, its just slow.

However I tried running my matrix multiply test using the Felix thread pool.
And that hangs completely. It used to work. Using top .. yes, the process is
running with 0% CPU usage. It used to flog the CPUs.

I did change something which makes “pfor” not work, the handling of
parameters. But the thread pool pfor procedure can be called directly.

The thing is, even the singly threaded version isn’t running.

The demo is here:

https://github.com/felix-lang/felix/blob/master/src/packages/pthreads.fdoc

Line 718.

So unless the compiler/library upgrades screwed something badly .. the problem
has to be in OSX.

Actually the LInux build is stalling on the Async I/O test. Which runs just fine
on OSX. .. hmmm … :-)

Something is badly screwed up. :-)

This is always an option. :-)


Keean.

John Skaller2

unread,
Nov 30, 2018, 3:29:57 AM11/30/18
to felix google


> On 30 Nov 2018, at 19:02, John Skaller2 <ska...@internode.on.net> wrote:
>
>>
>> With 1 thread: 23 seconds.
>> With 4 threads: 97 seconds.
>>
>> The threads slow it down heaps.
>
> Something weird.

Argghh. Its me. It has to be. Async-04 stalls forever, Its a pthread
spawning test.

Starting to make sense.

Felix locks use C++ mutex with a timed wait internally.

The time is set to 1 second. If the lock isn’t released after
a second, it comes out of the wait, checks if the garbage collector
needs a world stop, then goes back into a timed wait.

Stuff would be extremely slow (but still work) if it got stuck for 1 second
per lock.

I have no idea how this relates to spawn_pthread failing.

However my big change was to make two functions:

spawn_pthread
spawn_process

which call a common function

impl_spawn_pthread

The main difference is that spawn_process uses the same coroutines
queue as its caller, spawn_thread makes a fresh one.

However the queue is radically changed. Still .. the single threaded
version works. And all the tests including async-04 use single threaded
coroutine queues (i.e. each thread has its own queue).

But something is screwed up. Apologies to OSX .. temporarily.
It has to be my fault .. there’s proof I’m an idiot :-)


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Nov 30, 2018, 3:55:11 AM11/30/18
to felix google
[FLX_DEBUG_THREADS] Threads debug enabled
INITIALISING THREAD CONTROL OBJECT
Adding thread 0x7fff945b3380 base 0x7ffee38fa648, count=1
Removed thread 0x7fff945b3380, count=0
Adding thread 0x7fff945b3380 base 0x7ffee38fa6a8, count=1
Spawned Thread 0x70000f4c7000 start stack base = 0x70000f4c6e98, tc=0x7fe9094029
20
Thread registering itself
Adding thread 0x70000f4c7000 base 0x70000f4c6e98, count=2
Registered: Spawned Thread 0x70000f4c7000 stack base = 0x70000f4c6e98
ehd->spawner_lock = 0x7ffee38fa5f8
Thread 0x70000f4c7000 acquired mutex
Thread 0x70000f4c7000 notifying spawner it has registered itself
Thread 0x70000f4c7000 releasing mutex
Thread 0x70000f4c7000 yielding
[yield: thread=0x70000f4c7000]
Thread 0x70000f4c7000 running client code
Thread 0x70000f4c7000 unregistering
Removed thread 0x70000f4c7000, count=1
[yield: thread=0x7fff945b3380]
[yield: thread=0x7fff945b3380]
[yield: thread=0x7fff945b3380]
[yield: thread=0x7fff945b3380]
[yield: thread=0x7fff945b3380]
[yield: thread=0x7fff945b3380]
[yield: thread=0x7fff945b3380]
[yield: thread=0x7fff945b3380]
[yield: thread=0x7fff945b3380]

und so weiter …

loop non-finite .. :)



void thread_control_t::yield()
{
//fprintf(stderr,"Thread control yield starts\n");
::std::unique_lock< ::std::mutex> m(stop_mutex);
if(debug)
fprintf(stderr,"[yield: thread=%p]\n", (void*)mythrid());
//fprintf(stderr,"Unsafe stop check starts\n");
unsafe_stop_check();
//fprintf(stderr,"Unsafe stop check done\n");
}


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Nov 30, 2018, 7:21:03 AM11/30/18
to felix google

> On 30 Nov 2018, at 19:55, John Skaller2 <ska...@internode.on.net> wrote:
>
> [FLX_DEBUG_THREADS] Threads debug enabled
> INITIALISING THREAD CONTROL OBJECT
> Adding thread 0x7fff945b3380 base 0x7ffee38fa648, count=1
> Removed thread 0x7fff945b3380, count=0

^^^^^^^^^^

> Adding thread 0x7fff945b3380 base 0x7ffee38fa6a8, count=1
> Spawned Thread 0x70000f4c7000 start stack base = 0x70000f4c6e98, tc=0x7fe9094029
> 20
> Thread registering itself
> Adding thread 0x70000f4c7000 base 0x70000f4c6e98, count=2
> Registered: Spawned Thread 0x70000f4c7000 stack base = 0x70000f4c6e98
> ehd->spawner_lock = 0x7ffee38fa5f8
> Thread 0x70000f4c7000 acquired mutex
> Thread 0x70000f4c7000 notifying spawner it has registered itself
> Thread 0x70000f4c7000 releasing mutex
> Thread 0x70000f4c7000 yielding
> [yield: thread=0x70000f4c7000]
> Thread 0x70000f4c7000 running client code
> Thread 0x70000f4c7000 unregistering
> Removed thread 0x70000f4c7000, count=1
> [yield: thread=0x7fff945b3380]
> [yield: thread=0x7fff945b3380]
> [yield: thread=0x7fff945b3380]
> [yield: thread=0x7fff945b3380]
> [yield: thread=0x7fff945b3380]
> [yield: thread=0x7fff945b3380]
> [yield: thread=0x7fff945b3380]
> [yield: thread=0x7fff945b3380]
> [yield: thread=0x7fff945b3380]

^^^^^^^^^^^^^

Hmm … a removed thread is doing this .. ??

BTW: its the mainline thread.


>
> und so weiter …
>
> loop non-finite .. :)
>
>
>
> void thread_control_t::yield()
> {
> //fprintf(stderr,"Thread control yield starts\n");
> ::std::unique_lock< ::std::mutex> m(stop_mutex);
> if(debug)
> fprintf(stderr,"[yield: thread=%p]\n", (void*)mythrid());
> //fprintf(stderr,"Unsafe stop check starts\n");
> unsafe_stop_check();
> //fprintf(stderr,"Unsafe stop check done\n");
> }
>
> —
> John Skaller
> ska...@internode.on.net
>
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups "Felix Language" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to felix-languag...@googlegroups.com.
> To post to this group, send email to felix-l...@googlegroups.com.
> Visit this group at https://groups.google.com/group/felix-language.
> For more options, visit https://groups.google.com/d/optout.


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Nov 30, 2018, 8:05:57 AM11/30/18
to felix google
pthread_monitor: sleep: thread 0x7fff945b3380 calling std::this_thread::yield()
[Thread_control:yield: thread=0x7fff945b3380]
pthread_monitor: sleep: thread 0x7fff945b3380 calling std::this_thread::yield()
[Thread_control:yield: thread=0x7fff945b3380]
pthread_monitor: sleep: thread 0x7fff945b3380 calling std::this_thread::yield()
[Thread_control:yield: thread=0x7fff945b3380]



Its in the monitor = pchannel.

It’s using C++ yield (used to use a nanosecond sleep).
This is a spinlock.

void*
monitor_t::dequeue()
{
monitor_data_t *p = 0;

// Swap NULL into the monitor until we get a non-NULL value back.
while ( !(p = ::std::atomic_exchange_explicit (&data, p, DQFENCE))) sleep(tc,1);

// grab the user data
void *elt = p->user_data;

// signal that we have the data
p->flag.store(true);
// the writer that was originally responsible for putting
// the data we read into the monitor may now proceed
return elt; // return data
}


It’s a lock-free implementation of a queue.


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Nov 30, 2018, 9:54:42 PM11/30/18
to felix google


> On 1 Dec 2018, at 00:05, John Skaller2 <ska...@internode.on.net> wrote:
>
> pthread_monitor: sleep: thread 0x7fff945b3380 calling std::this_thread::yield()
> [Thread_control:yield: thread=0x7fff945b3380]
> pthread_monitor: sleep: thread 0x7fff945b3380 calling std::this_thread::yield()
> [Thread_control:yield: thread=0x7fff945b3380]
> pthread_monitor: sleep: thread 0x7fff945b3380 calling std::this_thread::yield()
> [Thread_control:yield: thread=0x7fff945b3380]

Arghh.. OK .. found it. Added the new fibre to the wrong queue.
Added it to the current pthread’s queue, not the target’s queue.

This is the same for spawn_process, but different for spawn_pthread.




John Skaller
ska...@internode.on.net





John Skaller2

unread,
Dec 1, 2018, 1:23:28 AM12/1/18
to felix google
At risk of confusion .. I am adding a garbage collected type called stl_dtlist_t
to the RTL. Will add Felix interface shortly to test it.

This is a doubly linked list of machine words, with a scanner
that can find the machine words. So the machine words can
be Felix pointers. The implementation is, of course, an STL list:

struct stl_dlist_t {
::std::list<void*>. data;
};

No particular reason at the moment for the wrapper.

The REASON is that the fthread_list of active fthreads uses ::std::list
already, but, because that one isn’t garbage collected, I have to add_root()
every time a new fthread goes in the list.

It’s nontrivial to bypass this, the fthread_list type has to also be
garbage collected. A design problem at the moment, you cannot
make something a root unless Felix allocated it. If you do, then
the GC gets confused trying to scan it.

So I think I will just make the fthread_list collected, and have its owners
make it a root. This should improve the scheduler speed. Especially
in micro benchmarks that are run with lots of memory so the GC is never
actually run :)

One possible problem at the moment is that all the rooting and unrooting
has to be protected by a mutex. I want to get rid of the mutex and use
a hard spin lock (i.e. a lock free lock) which avoids entering the OS
completely. Using hard spinlocks is only recommended for very short
sequences of machine instructions.

Its unfortunate but most OS do not (for safety reasons) provide interrupt blockers.
A critical section that prevents pre-emptions is really required for a spin lock.
Otherwise whilst waiting the holder could be pre-empted out, leaving the spin
lock spinning a long time (until either the holder is recheduled, or, the spinning
thread is also pre-empted out).

The big advantage of a spin lock is almost zero overhead for uncontended locks.
And most of the time, if the lock period is very short, the cost of the other thread
spinning is low compared to entering the OS. I hope.

Anyhow I have to minimise the time the scheduler takes to push and pop
threads from the list, and removing the rooting and unrooting, which use
an STL map, seems like a good way to do it.

Its just a pity we ALSO have

(a) dlist, a Felix implementation of a doubly linked list
(b) stl_list, a Felix binding of STL list

already. But then programmers are paid to make messes … :-)




John Skaller
ska...@internode.on.net





John Skaller2

unread,
Dec 1, 2018, 5:04:36 AM12/1/18
to felix google
Example:

/////////////////////
include "std/control/sysdlist";
open SysDlist;

var x = sysdlist[int]();
push_front (x, new 42);
push_front (x, new 41);
var a1 = front x;
var a2 = unsafe_pop_front x;
var a3 = unsafe_pop_front x;

var v1 = *a1;
var v2 = *a2;
var v3 = *a3;

println$ v1, v2, v3;
///////////////

Specification:

///////////////////
class SysDlist {
_gc_pointer type sysdlist[T] = “::flx::rtl::sysdlist_t*";

ctor[T] sysdlist[T] : 1 = “new (*(ptf->gcp), ::flx::rtl::sysdlist_t_ptr_map, true) ::flx::rtl::sysdlist_t()";

proc push_front[T] : sysdlist[T] * &T = "$1->data.push_front((void*)$2);";
proc push_back[T]: sysdlist[T] * &T = “$1->data.push_front((void*)$2);";

fun front[T]: sysdlist[T] -> &T = "(?1*)$1->data.front();”;

gen pop_front[T]: sysdlist[T] -> @T = """
(?1*)(
([&] ()
{
if ($1->data.empty()) return (void*)nullptr;
auto x = $1->data.front();
$1->data.pop_front();
return x;
})
()
)""";

gen unsafe_pop_front[T](x:sysdlist[T]): &T => C_hack::cast[&T](pop_front x);

fun empty[T]: sysdlist[T] -> bool = “$1->data.empty()";

fun len[T]: sysdlist[T] -> size = "$1->data.size()";
}
///////////////////

Explanation:

If an abstract type such as sysdlist is a pointer to a garbage collected
object, then we have to tell the system it is. _gc_pointer does that.

The constructor uses the standard model: operatorf new,
passing as the placement arguments a reference to the GC profile
object, a pointer to the shape descriptor, and a flag saying if
this allocation is permitted to trigger a collection or not.

The variable ptf is a pointer to the thread frame. I actually forgot,
we have to tell the compiler that the constructor has to be executed
in a context in which it is visiable. The thread frame contains a pointer
the collector profile object which contains a pointer to the collector.

The type sysdlist_t and its shape sysdlist_t_ptr_map are defined
in the RTL (run time library).

The pop_front function is interesting. Unlike STL, pop_front
pops the list at the front AND returns the value that was stored there.

To do this, I have to defined a function, and call it, because you cannot
nest statements in expressions .. UNLESS you use a lambda.

Lambdas are PITA and crappily designed crud .. but for this purpose
it seems to work. The above construction is

let .. in ..

In other words we define a lambda, and then immediately apply it.
We just want to get statements inside an expression. Some better syntax
would be nice.

Because it returns NULL if the list is empty, the return type
is actually

@T

which is a possibly null pointer. You cannot dereferene one of these,
you have to checki its not null first. Hence, I cheated and did a
C_hack, cause I’m lazy, to get the unsafe versions.

I’ve been meaning to steal Swifts cheat method:

!p

dereferences a possibly null pointer, and barfs if its null. You can do this
already with other syntax. Note first:

typedef ptr[T] = &T; // alias
variant cptr[T] = | nullptr | Ptr of &T; // parser maps @T

ctor[T] ptr[T](px:cptr[T]) =>
let Ptr p = px in p
; // match failure if null

//$ Checked dereference of C pointer.
fun deref[T] (px:cptr[T])=> *(px.ptr);


Note the fun trick, by using a match, if the match fails
we get an exception for free.



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Dec 1, 2018, 7:24:42 AM12/1/18
to felix google
Cool. GC appears operational.

/////////
include "std/control/sysdlist";
open SysDlist;

for k in 1..200 do
println$ "Set " + k.str;
for j in 1..200 do
var x = sysdlist[int]();
for i in 1..1000 perform push_front (x, new i);
done
done

println$ “Done";
////////

~/felix>FLX_REPORT_COLLECTIONS=1 FLX_FINALISE=1 FLX_FREE_FACTOR=2.0 FLX_MIN_MEM=1000 flx td
Deleting collector total time = 209.42105 seconds, gc time = 88.55566 = 42.29%

~/felix>FLX_REPORT_COLLECTIONS=1 FLX_FINALISE=1 flx td
Deleting collector total time = 198.37791 seconds, gc time = 88.90117 = 44.81%



John Skaller
ska...@internode.on.net





Reply all
Reply to author
Forward
0 new messages