Reworked RTL

18 views
Skip to first unread message

John Skaller2

unread,
Dec 1, 2018, 6:17:49 PM12/1/18
to felix google
So the RTL is now reworked to remove most of the add/remove root stuff.
The structure is changed a bit, unfortunately there are a few more heap
allocated objects.

This is partly due to a design fault. Felix “scan_by_offsets”
scanners use an array of offsets into an object to find pointers.
For objects the compiler synthesises, it knows the objects and
generates the offset tables. If you make a product, it is treated
as a primitive with a single offset table.

However the sysdlist type is an actual STL list of void*.
To garbage collect that, there is a special scanner. Its really
simple! It just iterates through the STL data type using iterators
and calls “register_pointer” on each element.

The problem with foreign data types like this is you cannot
include them in other data types. The system doesn’t recursively
expand product shape data at run time. So you have to use a pointer
(because that IS recursively expanded at run time, that’s what the GC does).

The current status of my upgraded RTL is: KAPUT. Borked. Its not working.
More precisely, programs run, but fibre, threads, and the other stuff actually
using the changed RTL code crashes, segfaults, deletes unallocated pointers,
etc etc.

Getting the GC to scan system objects is a bit tricky. The async scheduler
is NOT scanned, the sync scheduler, active list, and contained sysdlist are.
This means these things are no longer deleted, the GC does that now.
However since the async scheduler isn’t scanned, it has to use
add_root/remove_root on its contained pointers.

Also when making an actual async request, the fibre has to be made
a root, and when put back into the active list on completion it has
to be unrooted. Note that demux thread does async I/O (socket data transfers)
during garbage collection! Demux knows nothing about GC. It’s not a Felix thread,
it doesn’t stop.

The point of all this work is to get rid of the add_root/remove_root calls in the
scheduling because they slow it down. These calls had to be done inside a
critical section, i.e. whilst holding a lock. The locked code now just pushes
and pops a list which should be lightning fast. If not, I’ll replace the C++ STL
list with something faster. Dlists are trivial, we actually only need an slist
with a pointer to the last node. So that’s a future performance improvement.

Anyhow there’s a bug. The theory is you cannot get a bug until the GC runs.
In that case reaping a reachable pointer is possible, leading to a crash.
I also get free() or non-malloc() pointer which is harder to understand
unless its a double free.



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Dec 2, 2018, 8:03:54 AM12/2/18
to felix google
So everything works with lots or memory but flx_tangle fails with small memory.
Suggesting GC bug of some kind. That is, not a bug in the GC but in the
rooting of the shape tables.

concurrent_coroutines: segfault

the concurrent coroutine performance test runs,
the concurrent timing is rubbish (8 seconds).
The serial timing is 4 times faster (0.25 seconds) than it used to be.
Hmm.

This one is hard to believe: 100x100 matrix multiply with thread pool:
0.007 seconds. huh?

You know, the Sydney Harbour Bridge required inverting a 103^2 matrix,
it took HUNDREDS of engineers with slide rules months to solve it.

SO no .. it didn’t work :-)



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Dec 2, 2018, 7:06:27 PM12/2/18
to felix google
So I think I found “the last bug”. The new machinery is committed.
Concurrency is still slow. However now, there’s no rooting and unrooting
of fthreads in the sync scheduler since the currently running fibre and
the active list is now garbage collected. Its not clear the rooting and unrooting
was thread safe anyhow (races could occur non-atomic updates to the
STL map used ..??? need to check). Rooting and unrooting is still required
interfacing with the async demux thread.

So now, the next step is to replace the mutex locks with spinlocks.
Technically this is “lock free” code. I’m already using a spinlock
on the async queue, however it has an OS yield() or C++ timed
wait (which has to be an OS operation I think).

For async we need a delayed spin, because we could be waiting
for a timer or socket, an arbitrarily long wait.

But for the sync scheduler the delay is just a push or pop
from an list, very fast, and should be finite time unless there
is an unfortunate pre-emption of the current holder of the lock.

So the lock will just be a compare_exchange on a boolean.
You swap TRUE with the lock variable and if you get TRUE
back the lock was already held, so you have to spin. The lock
variable is an atomic bool. The code is simple and is already
used for pchannels (and related code for condition variables
needed for bound queues).

It a one liner actually:

while ( (p = ::std::atomic_exchange_explicit(&data, p, NQFENCE))));

with a check that the barrier is right. Monitors currently say:

ENQUEUE (pchannel writer):

while ( (p = ::std::atomic_exchange_explicit(&data, p, NQFENCE))) sleep (tc,1);

DEQUEUE (pchannel reader):

while ( !(p = ::std::atomic_exchange_explicit (&data, p, DQFENCE))) sleep(tc,1);

This code is using a pointer not bool, with NULL test, and it sleeps in the loop,
the sleep code ALSO checks for a world stop. We don’t need any of that,
we can’t have a GC in the middle of our operation and our operation must make
progress and is fast.

The only possible modification is to count the iterations, and degrade to
doing an OS level sleep if the count is large enough. If we get the count
right, then the hard spin can only exceed the count if the lock holder
is pre-empted in which case we WANT to yield to encourage the OS to
reschedule it so it can release the lock.

The “right” way to do this is mask interrupts, so pre-emptions aren’t
possible. But most OS don’t let applications block pre-emptions.
[On the old x86 under MS-DOS you could mask IRQ but not NMI,
NMI being “non maskable interrupt”. So you could block a pre-emption
or signal in critical operations, but still allow the user to kill the process
if it locked up. Usually the user had another non-maskable operation
to do that .. turn off the power. :-)

So there’s a risk, in an “overthreaded” system, that a hard spin-lock
will waste a lot of CPU if a non-pre-empted thread loops waiting
for a pre-empted one. I’ll try it first, before the complication of counting
loops. At worst .. it will finally get my Mac Activity Monitor to push
CPU usage on one CPU to 100% :-)




John Skaller
ska...@internode.on.net





John Skaller2

unread,
Dec 2, 2018, 7:45:54 PM12/2/18
to felix google
Well .. StackOverlow to the rescue:

class SpinLock {
std::atomic_flag locked = ATOMIC_FLAG_INIT ;
public:
void lock() { while (locked.test_and_set(std::memory_order_acquire)) { ; }}
void unlock() {locked.clear(std::memory_order_release);}
};

Thanks to MikeMB.

I need a guard for this, I think muxguard will work with a small mod.

Some people suggest checking the variable first, before trying to do a lock.
If its locked, don’t bother with the test_and_set. Apparently spinning on the
check doesn’t cause any memory accesses (whereas the hard exchange above
does).

Anyhow tuning later ..



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Dec 5, 2018, 9:17:50 AM12/5/18
to felix google
So I have now tested on Linux as well.

OSX: single thread, 6 seconds, 4 threads, 66 seconds.
Linux: single thread, 5 seconds, 4 threads, 13 seconds

OSX is absolutely useless at multi-threading. This is using a spinlock.

Multiply matrix test, uses thread pool:

OSX: about 20 seconds 1 and 8 threads. At least no degradation.
Linux: 24 seconds 1 thread, 32 seconds for 8. Degraded.

My conclusion: Intel chips are utterly useless at concurrency.
Both boxes run high end quad-core i7 processors, and late OSX and Linux versions.
Both are laptops, the Linux is running on an ASUS, OSX High Sierra on MacBook Pro.
Same memory and clock rates. The Mac has SSD for persistent store the ASUS is
using a Winchester.

IMHO the problem is caching. In core caches do not work with concurrency,
they actively prevent it being effective.

The matrix test divides the multiplication into independent jobs that read
the same memory but write to different parts of the result array. I forget if
I’m doing it row or column major (i,e if the writes interleave or work in partitioned
address spaces). With interleaving there would be cache overlapps forcing cache
synchronisation on every write. Must look into this :-) … ok change the order ..
no difference.

However the concurrent coroutine test is doing 20 long running jobs, each of which
just updates a local variable. so there’s no interference at all between the jobs.
And it runs slower.

Its still possible the fault is mine, in my code. But the two methods of using
concurrency are completely different, even though both use a queue of jobs.
The serialisation is different in both cases, and in both cases trying to use
more than one CPU to distribute the workload slows everything down.

The vastly superior result on Linux is clearly the result of the OS pre-emption
and scheduling policy compared to OSX (its the same code in both cases).
I also tried using gcc on the mac instead of clang: same result.

To be thorough i should test on Windows too. But I’m getting a bit discouraged I have to say.
I hoped to get better performance and speed it up even more with tuning, then try to
solve the hard problem of checking whether coroutines CAN be run safely at the same
time .. but there’s no point if the result is slower anyhow.


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Dec 5, 2018, 5:34:59 PM12/5/18
to felix google
OK so i got the real matrix mul demo working. The file is in:

build/release/share/demo/threadpoolex1.flx

Here’s the resulty for OSX:

Naive mul elapsed 2.70839 seconds
Naive Parallel mul elapsed 0.400742 seconds
Verified
Using thread pool's pforloop
Smart Parallel mul elapsed 2.39832 seconds
Wrong!
pfor mul elapsed 2.68518 seconds
Wrong!


So, the first case is a standard single threaded matrix multiplication A * trans B,
where trans means ’transposed”. This is done because then the inner product
calculation involves scanning two contiguous arrays.


That’s 2.7 seconds. Using the thread pool and hand partitioning the job
into 8 jobs, it takes 0.4 seconds .. in other words its about 7 times FASTER
using 8 threads.

Weirdly, using the manually invoked autopartitioning, it taks 2.3 seconds,
roughly the same as a single thread .. and gets the wrong answer.
The pfor loop is doing the same thing with a bit of sugar. Same bad result,
both time and correctness failed.

The base calculation in all cases is this:

fun inner_product (pr: &vec_t, pc: &vec_t) =
{
var sum = 0.0;
for (var k=0; k<N; ++k;)
perform sum = sum + *(pr.k) * *(pc.k);
return sum;
}

It calculates one element of the result matrix. Here is the full matrix mul:

// naive multiply
var start = #time;
begin
for i in 0..<N
for (var j=0; j<N; ++j;)
perform &r . i . j <- inner_product (&a.i, &b.j);
s = r;
end
var fin = #time;
println$ “Naive mul elapsed " + (fin - start).str + " seconds";


To use the thread pool:

ThreadPool::start 8;

and now I define:

noinline proc inner_products_job (var i:int) () {
for (var j=0; j<N; ++j;)
perform &r . i . j <- inner_product (&a.i, &b.j);
}

This does calculates a whole result row (or is it a column?) in one job.
to use the pool:

start = #time;
begin
for i in 0..<N
call ThreadPool::queue_job$ inner_products_job (i);
ThreadPool::join;
end
fin = #time;
println$ "Naive Parallel mul elapsed " + (fin - start).str + " seconds";
verify;

Its fast. 7 times faster this way.

To use the auto partitioning we need this:

noinline proc inner_products_proc (var i:int)
{
for (var j=0; j<N; ++j;)
perform &r . i . j <- inner_product (&a.i, &b.j);
}

notice the subtle difference, the “job” has an extra () argument.

// smart parallel multiply
clear &r;
start = #time;
begin
println$ "Using thread pool's pforloop";
ThreadPool::pforloop 0 (N - 1) inner_products_proc;
end
fin = #time;
println$ "Smart Parallel mul elapsed " + (fin - start).str + " seconds";
verify;

And that’s slow and fails. But it is SUPPOSED to do EXACTLY the same thing
as manually partitioning the jobs and queuing them. The syntactic sugar version is:


// smart parallel multiply with syntax
clear &r;
start = #time;
begin
pfor i in 0 upto (N - 1) do
for (var j=0; j<N; ++j;)
perform &r . i . j <- inner_product (&a.i, &b.j);
done
end
fin = #time;
println$ "pfor mul elapsed " + (fin - start).str + " seconds";
verify;

This is the complete code without the external _job or _proc procedure,
this is what you’re supposed to write, the other procedures were to check
stuff worked .. good thing because it used to work and is now broken.

The critical thing in this code is that the naive and parallel versions
are the same, you just use “pfor” instead of “for” and it automatically
uses the thread pool. Don’t do this if the loop iterations must be sequential!

So now i have a bug to chase .. it appears the pforloop isn’t working.
Its not only wrong .. it is slow.

That would explain the “mul.flx” failure. The manual job queuing works and shows
that the multi-core operation can actually be fast.




John Skaller
ska...@internode.on.net





John Skaller2

unread,
Dec 5, 2018, 6:06:43 PM12/5/18
to felix google


> On 6 Dec 2018, at 09:34, John Skaller2 <ska...@internode.on.net> wrote:
>
> OK so i got the real matrix mul demo working. The file is in:
>
> build/release/share/demo/threadpoolex1.flx
>
> Here’s the resulty for OSX:
>
> Naive mul elapsed 2.70839 seconds
> Naive Parallel mul elapsed 0.400742 seconds
> Verified
> Using thread pool's pforloop
> Smart Parallel mul elapsed 2.39832 seconds
> Wrong!
> pfor mul elapsed 2.68518 seconds
> Wrong!

GRRR!

Now I’m confused. I added debugs to see what went wrong:

Naive mul elapsed 2.73642 seconds
Naive Parallel mul elapsed 0.402332 seconds
Verified
Using thread pool's pforloop
Pfor segment 0,999
QUEUE JOB: Counter = 0, sfirst=0, slast=110
QUEUE JOB: Counter = 1, sfirst=111, slast=221
forloop 0,110
QUEUE JOB: Counter = 2, sfirst=222, slast=332
forloop 111,221
QUEUE JOB: Counter = 3, sfirst=333, slast=443
forloop 222,332
QUEUE JOB: Counter = 4, sfirst=444, slast=554
QUEUE JOB: Counter = 5, sfirst=555, slast=665
forloop 333,443
forloop 444,554
QUEUE JOB: Counter = 6, sfirst=666, slast=776
forloop 555,665
QUEUE JOB: Counter = 7, sfirst=777, slast=887
forloop 666,776
UNQUEUED JOB: Counter = 8, sfirst=888, slast=999
forloop 888,999
forloop 777,887
Smart Parallel mul elapsed 0.414602 seconds
Verified
Pfor segment 0,999
QUEUE JOB: Counter = 0, sfirst=0, slast=110
QUEUE JOB: Counter = 1, sfirst=111, slast=221
forloop 0,110
QUEUE JOB: Counter = 2, sfirst=222, slast=332
forloop 111,221
QUEUE JOB: Counter = 3, sfirst=333, slast=443
forloop 222,332
QUEUE JOB: Counter = 4, sfirst=444, slast=554
forloop 333,443
QUEUE JOB: Counter = 5, sfirst=555, slast=665
forloop 444,554
QUEUE JOB: Counter = 6, sfirst=666, slast=776
forloop 555,665
QUEUE JOB: Counter = 7, sfirst=777, slast=887
UNQUEUED JOB: Counter = 8, sfirst=888, slast=999
forloop 888,999
forloop 666,776
forloop 777,887
pfor mul elapsed 0.448273 seconds
Verified

now .. it doesn’t go wrong, it works .. and the smart and sugared versions
both run the same speed as the manually partitioned ones.

So fiddling with it:

This FAILS .. but WORKS if I uncomment the first debug print …

proc pfor_segment (first:int, last:int) (lbody: int * int -> 1 -> 0)
{
//println$ "Pfor segment " + first.str + "," last.str;
var N = last - first + 1;
var nt = nthreads + 1;
if pforrunning.load == 0 and N >= nthreads and nthreads > 0 do
pforrunning.store 1;
for var counter in 0 upto nt - 2 do
var sfirst = first + (N * counter) / nt;
var slast = first + (N * (counter + 1)) / nt - 1;
//println$ "QUEUE JOB: Counter = " + counter.str + ", sfirst=" + sfirst.str + ", slast=" + slast.str;
ThreadPool::queue_job$ lbody (sfirst, slast);
done
sfirst = first + (N * (nt - 1)) / nt;
slast = last;
//println$ "UNQUEUED JOB: Counter = " + counter.str + ", sfirst=" + sfirst.str + ", slast=" + slast.str;
lbody (sfirst, slast) ();
join;
pforrunning.store 0;
else
// Run serially
lbody (first, last) ();
done
}

GRRRR .. :-)

Why? Well … one theory is that printing from multiple threads requires the underlying C
library to use a mutex so the barrier that introduces is changing the ordering either at
the CPU level or the C++ compiler level. It could also impact Felix inlining optimisation.

I’ll go for a Felix issue. So the theory is, in a loop, there’s a problem with closures
that capture loop variables. They capture the value of the variable at the time
it is inspected in the closure NOT the time the closure is formed. If you want to capture
a value from a loop, you have to use a double closure. The first function copies the
value into its stack frame and returns a closure over the nested function, which then
binds by address (reference) to the captured value. So each of these closures is binding
not to the loop variable, but the copy inside the closure enclosing it. So there’s one of these
enclosing closures for each loop iteration. The inner closure is delayed, and uses access
by reference to the outer closure, however the outer closure is spent. It executed at the
same time the loop iterated. It copied the control variable and then returned the inner closure.

If you get this wrong, you get unexpected results. It seems weird but it is NOT a bug.
Closure formation captures variables by their address and it HAS to work like that.
Otherwise this would fail:

var x = 1;
proc printx () { println$ x; }
var p = printx;
p;
x = 2;
p;

You expect 1 and 2 and that can only happen if the closure p captures
the variable x by address (reference) not by value.

This is a weak version of a standard accessor method, a function
that returns results depending on the current value of state it is bound to.
If it captured by value, then the result would be the same every call.

In any case, this is all a theory .. that one debug statement is required to make
it work. Some optimisation is being blocked by its presence.


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Dec 5, 2018, 8:28:18 PM12/5/18
to felix google

>
> If you get this wrong, you get unexpected results.

Almost certainly this is the issue. I put debugs in the C++ directly,
bypassing Felix. The output is sorted, to get rid of the asynchronous behaviour of threads.
It starts:

Inner products proc, i = 111
Inner products proc, i = 112
Inner products proc, i = 113
Inner products proc, i = 114
Inner products proc, i = 115


Hmm .. what happend to i = 0 thru 110?
And since I run this process twice, once manually and once with sugar,
why isn’t everything duplicated?

Inner products proc, i = 217
Inner products proc, i = 218
Inner products proc, i = 219
Inner products proc, i = 220
Inner products proc, i = 221
Inner products proc, i = 222
Inner products proc, i = 222
Inner products proc, i = 223
Inner products proc, i = 223

Oh hey .. the duplicates have cut in now … notice, 221 is a break point.
There should be 1000 inner products calculated (i=0 to 1000) for each of the two runs.

Inner products proc, i = 679
Inner products proc, i = 679
Inner products proc, i = 679
Inner products proc, i = 679
Inner products proc, i = 679
Inner products proc, i = 680
Inner products proc, i = 680
Inner products proc, i = 680
Inner products proc, i = 680
Inner products proc, i = 680
Inner products proc, i = 680
Inner products proc, i = 681
Inner products proc, i = 681
Inner products proc, i = 681

Hey what? 6 iterations …

Inner products proc, i = 997
Inner products proc, i = 998
Inner products proc, i = 998
Inner products proc, i = 998
Inner products proc, i = 998
Inner products proc, i = 998
Inner products proc, i = 998
Inner products proc, i = 998
Inner products proc, i = 998
Inner products proc, i = 998
Inner products proc, i = 999
Inner products proc, i = 999
Inner products proc, i = 999
Inner products proc, i = 999
Inner products proc, i = 999
Inner products proc, i = 999
Inner products proc, i = 999
Inner products proc, i = 999
Inner products proc, i = 999

9 iterations …

So its clear .. its exponential. Large the number, the more times its repeated.

So .. the *initial* value of the segment a job handles is correct, but the *final* value
is the same for all of them.

I will note the algoritm is tricky. The thread pool has 8 threads but the thread that
dispatched the jobs ALSO runs a segment. So there are actually 9 jobs running concurrently.

And 9 iterations of the final inner product calculation.

So each segment runs once, starting with i = 111 which is 1000/9. It should start at
0 and go up to 111. It actually goes up to 999. The next segment is 222 … 999.
I forget the maths but this is( 1 + 2 + 3 + .. 9 ) * 111 iterations instead of 1000.
111 * 9 = 999. The partitioning of the jobs is not exact because 1000 isnt divisible by 9.

I have verry little doub now, its a closure capture issue. The right stuff is being
run, but the limits are wrong. It works for the manual case because the
procedure has an extra () argument, so the arguments are capture by the
first closing, leaving the final unit closure as the job.




John Skaller
ska...@internode.on.net





John Skaller2

unread,
Dec 5, 2018, 9:05:48 PM12/5/18
to felix google


> On 6 Dec 2018, at 12:28, John Skaller2 <ska...@internode.on.net> wrote:
>
>
>>
>> If you get this wrong, you get unexpected results.
>
> Almost certainly this is the issue.

Ok its fixed. This is close to the shortest fix in history.

I changed “inline” to “noinline” on one procedure. :-)

I think this is probably enough because it appeared to be enough.

In Felix “inline” has semantics, its not a hint. Noinline forces a closure
to be formed, which uses eager evaluation. In particular arguments are
copied to parameters.

If you inline a procedure, the parameter can be *replaced* by the argument,
which is lazy evaluation.

The difference is that if that procedure spawns a closure, it will bind
to the parameter by reference. So in the inline case it binds to the
current value of the argument, whicih is the loop control variable,
and it will be the last value of the loop control variables if you save
up the closure and execute it after the loop.

But if you immediately execute a closure which copied the loop
control variable as an argument to a parameter and then spawn
a closure it binds to the parameter which is the value of the loop
control variable at the time the outer closure was run. The outer
closure frame .. there’s a new one for every loop iteration,
*because* it is no inline. If it were inline, there would only be one
parameter variable in the loop body (due to inlining).

This sucks. The semantics are bad. But forcing a copy defeats
the critical inlining optimisation which works just fine in non-looping
cases. It also works fine if you use a tail recursion instead of a loop,
and let the *compiler* decide if it can do a self-tail-call optimisation:
the compiler only does this if it doesn’t change the variable capture
semantics.

There is a statement that does this:

for … a plain loop
rfor .. a loop using tail recursion
pfor .. a loop using concurrency

Of course, rfor is expensive when it doesn’t optimise, you get a fresh heap
allocated stack frame for every iteration.


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Dec 6, 2018, 8:15:38 AM12/6/18
to felix google
OK! So I found the problem in my coroutine test:

The for loop. This for loop

for i in 1..1000 perform ++j;

for some reason is heap allocating two objects every allocation.
It’s related to the fact it parses it as a call to an iterator over a slice,
an iterator is a generator, so it shouldn’t spawn heap but something was.

Anyhow I changed it to the low level C like loop:

for(k=0; k<1000; ++k;) perform ++j;

with 200 jobs .. and it was so fast I couldn’t measure it. I had to change it
to 100000 iterations and change the number of jobs from 200

..

to 2 MILLION

That took around 30 seconds (for the processing only not the scheduling of the jobs).
WIth 1 thread. With 8 threads, 7 seconds. With 4 threads I got around 5 seconds.

Any which way that’s about 300 THOUSAND context switches per second.

I dunno if Go has a hope in hell of competing with that :-)

And it isn’t even tuned yet.

BTW: the run used so much memory it did a garbage collection (single threaded)
in the middle:

[flx_gc:gc_profile_t] Threshhold 1000000000 would be exceeded, collecting
[flx_gc:gc_profile_t] actually_collect
actually collected 21739101 objects, still allocated: 3 roots, 10869568 objects, 318841476 bytes

So the collector scanned a million objects in the single threaded version as well.


This is more like it!


John Skaller
ska...@internode.on.net





Brandon Barker

unread,
Dec 6, 2018, 8:52:41 AM12/6/18
to Felix Language
That's exciting!  And sounds like you may have found an interesting bug in the generator as well.

Once it is tuned a bit, it will be nice to have a side-by-side comparison and benchmark of Go and Felix code to show off.

John Skaller2

unread,
Dec 6, 2018, 10:39:42 AM12/6/18
to felix google


> On 7 Dec 2018, at 00:52, Brandon Barker <brandon...@gmail.com> wrote:
>
> That's exciting! And sounds like you may have found an interesting bug in the generator as well.

It’s not a bug, slice iterators in general form are intrinsically slow.

In this case it involves constants and it should be optimised, because loops
are important and the slice syntax is the simplest and preferred syntax.

I think but I’m not sure the problem is I tried to do the optimisation in the parser.
The difficulty is that then

for i in 1..10 ….

is ambiguous since it matches BOTH

for name in slice

and

for name in integer .. integer

because

integer .. integer

is a slice. I forget how the parser resolves this. The solution is to do the optimisation
AFTER parsing.

For some time I’ve been removing optimisations to simplify the compiler and
ensure correctness. My belief is that we need a system to *quantify* performance
gains and *validate* correctness, similar to the current suite of regression tests.

However I don’t know how to design such a tool. How do you “pass” a test that
is “faster” than a model case? Specify a threshhold of performance increase??



>
> Once it is tuned a bit, it will be nice to have a side-by-side comparison and benchmark of Go and Felix code to show off.

i cant write Go. However the real problem is to build suite of tests that covers a
reasonable set of combinations. The existing test is specifically designed to

(a) have almost zero lock contention and
(b) work around the fact that correct thread termination isn’t implemented!

Point (b) is totally non-trivial.

What happens at the moment is that a thread will “hang” on the Async read queue
if there are any jobs pending. So if there’s one job pending ALL the threads hang.

So the job completes and becomes ready and put in the queue, and ONE thread
can retrieve it and run it. The others .. they’re already hanging, they don’t participate
in the processing of the job, and they will hang forever if no more async requests
are made.

That doesn’t happen in the test because all the jobs to run are scheduled *before*
spawning the threads, and, there’s no async requests, so each thread terminates
when there is no work to do. None of the coroutines spawns more work either.
So its totally unrealistic: I’m only checking that the context switching time is
improved by concurrency.

It is. Whew!

Ok, so what to do? Now the thing is, at present you can say

run procedure;

and that runs procedure on a fresh scheduler. However that’s a sync only
scheduler, it cannot do async requests or spawn processes or pthreads.
Also since that subroutine call is made by ONE thread, the scheduler
is only run by a single thread, even if the calling scheduler is multi-threaded.

Suppose it could do I/O and spawn pthreads .. exactly what would that mean?
When you start a pthread the first async requests starts a demux thread to
monitor async events. But the nested scheduler is not running a new thread.

The rule for design is it must be beautiful. Because if it is beautiful it is correct.
If it is ugly .. its bugged.

I can’t see a beautiful picture here yet. Can’t choose between alternative ideas.

However one breakthrough: there is a lock on each queue to protect
context switches. The same lock can logically mark user defined critical
sections such as storing data in one coroutine by another via a pointer.
That fully serialises access to the shared store AND context switches,
in other words it can be views as just adding a new “user defined
synchronisation point”.

the NEAT thing is

(a) the lock is already there
(b) existing events (channel i/o etc) are serialised by it
(c) the new even is serialised by it as well
(d) it is LOCAL to the coroutine queue (doesn’t interfere with
any other coroutines running on separate threads)

What is more if the coroutine queue is running single threaded
at the moment the lock is ignored. So we do that with the shared
memory operations too. No need for a lock.

So the performance and semantics are *preserved* provided
we use critical sections around shared memory access.
Completely solving the problem of how to handle impure coroutines.

The solution is the traditional one: use a lock. The advantage is
that its a system provided lock, there’s only one per set of concurrent
coroutines, so you don’t need to define your own. Just writing say

atomic { … }

around the critical section is enough. So this one is, actually,
beautiful. So it must be the right solution. If you run extra processes
you have to manually launch tham, so you know you also have to use
the critical sections. So I don’t yet have to worry about deciding how
to handle impurity.







John Skaller
ska...@internode.on.net





John Skaller2

unread,
Dec 7, 2018, 9:13:09 AM12/7/18
to felix google
Thread termination logic:

Shayne Fletcher

unread,
Dec 7, 2018, 10:09:35 AM12/7/18
to felix-l...@googlegroups.com
Nice!

--
You received this message because you are subscribed to the Google Groups "Felix Language" group.
To unsubscribe from this group and stop receiving emails from it, send an email to felix-languag...@googlegroups.com.
To post to this group, send email to felix-l...@googlegroups.com.
Visit this group at https://groups.google.com/group/felix-language.
For more options, visit https://groups.google.com/d/optout.
ThreadTermination.jpg

John Skaller2

unread,
Dec 7, 2018, 9:48:03 PM12/7/18
to felix google


> On 8 Dec 2018, at 02:09, Shayne Fletcher <shayne.fl...@gmail.com> wrote:
>
> Nice!
>

But not so simple. At a reasonably high level, Felix has two subroutines:

run_until_blocked
run_until_complete

The second one keeps going until there is no more work to do (for this thread).
That includes no pending async requests.

The first one runs until there is no work that can be done *immediately*.


Why the dichotomy? Well, as you know, Felix is all about control inversion.
You write a thread .. Felix translates it into callbacks. (The callback is
a continuations resume() method).

Now, running to completion is what you want your mainline program to do.
However Felix is designed to be *embedded* in C, as well as being able to
embed C into Felix.

In particular, it is common to have a C program, such as an X-Windows
application using Xlib, that is unfortunately forced into being written as
callbacks driven by Xlib’s framework. Almost all these kinds of frameworks
are loops that call client callbacks with messages, the framework captures
the events and sends the message.

So we need to be able to run Felix for a while, and when there’s nothing
*immediate* to do, suspend it and return control to the master framework’s
event handling loop. When other stuff is idle, it will call Felix, provided
the programmer registers a suitable callback.

So Felix own framework itself hase to be turned into a callback.
The machinery is a bit messy, there are actually 4 callbacks,
three for setup and one to do the work.

Note that if you spawn a new pthread inside Felix THAT thread runs
all the time. Only the external event loop’s callback has to suspend
and resume.

In addition, whilst suspended, the thread cannot respond to a world_stop
request, so it has to actually tell the collector it is stopped, and un-tell (sic)
the collector when it resumes.

Furthermore, the client C/C++ code into which Felix is embedded needs to be
able to communicate with Felix code. For this reason there’s a special
service call “multi-write” which writes data to *multiple* channels at once.
These will be connected to coroutines that are suspended when the
write is done: the write is done from *outside* Felix. It reschedules the
readers so next time Felix is resumed, there’s work to do that wasn’t
there before. Its a kind of async I/O. (multi-write is available inside
Felix as well, its very useful for a clock telling lots of agents in a game
that its time for the next action).

So now, the thread termination condition is more difficult: not to implement,
yes, that could be tricky .. the difficulty now is much worse: its a specification
problem. Not even a design issue: the question is:

If multiple threads are servicing a queue .. what does it
actually MEAN to run until blocked? Should the other threads keep running?

The actual impact is that when querying the async read queue, run until
complete blocks until there is something to get out of the queue,
assuming there is a pending job (otherwise we’re done). However when
running run until blocked, it calls a different instruction on the queue
that always returns, whether or not there’s something to fetch.
If something got fatched .. processing continues, otherwise we’re
blocked and we have to yield to the master framework.

The logic was designed assuming one pthread. Note if we terminate
all the threads .. we have to decide how after resuming following
a suspension how to restart them.

When you spawn a pthread normally it uses run until complete.
Its only the main thread that uses run until blocked.
I have to check but, the driver is the thing responsible for
controlling this. The driver could run until blocked, in which
case it would just loop around and resume immediately.
So the flag to say, run until complete is an optimisation.




John Skaller
ska...@internode.on.net





Keean Schupke

unread,
Dec 8, 2018, 3:23:21 AM12/8/18
to felix google
Hi John,

If you are using a fixed sized hardware thread pool (sized for the number of CPU cores) then threads never terminate.

If you are considering the suspended async tasks, you don't care about termination because they consume no resources, except preventing the GC of resources they may reference if they ever resume. There is not really a point at which we can say an asynchronous task will never resume automatically, as this is determined by semantics outside the scope of the programmer. As such it is impossible to automatically do this, but if the programmer is sure an asynchronous task is no longer needed they should be able to 'cancel' it which means it's result is never going to be observed, and any referenced resources can be freed.

Keean.


John Skaller2

unread,
Dec 8, 2018, 9:01:18 AM12/8/18
to felix google


> On 8 Dec 2018, at 19:23, 'Keean Schupke' via Felix Language <felix-l...@googlegroups.com> wrote:
>
> Hi John,
>
> If you are using a fixed sized hardware thread pool (sized for the number of CPU cores) then threads never terminate.

Sure they do. Felix threads are detached. The process terminates when
all the threads do.

At present if you use a thread pool, its implicitly created, but unfortunately
you have to manually stop it or the process won’t terminate.

However the thread pool used by the coroutine scheduler to elevate them
to microprocesses is unrelated to the thread pool.

The design at the moment is that te thread pool defaults to 1 thread per CPU
I think, or to 8. I can’t remember. But you can set it to whatever you want.
There’s a system thread pool, but you can create one if you want.

For the coroutines, at the moment, the client has to spawn the extra threads.

In other words, Felix doesn’t use a fixed set of threads, one per CPU.
Its up to the programmer to make choices. I may change that after
experiments and measurements.


>
> If you are considering the suspended async tasks, you don't care about termination because they consume no resources, except preventing the GC of resources they may reference if they ever resume. There is not really a point at which we can say an asynchronous task will never resume automatically, as this is determined by semantics outside the scope of the programmer. As such it is impossible to automatically do this, but if the programmer is sure an asynchronous task is no longer needed they should be able to ‘cancel' it which means it's result is never going to be observed, and any referenced resources can be freed.

I’m not sure if you can cancel an async task at the moment.
Worse, if you do socket I/O it will not return until the job
is done or it gets an error. There’s no timeout. That needs to be fixed.


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Dec 8, 2018, 9:20:06 AM12/8/18
to felix google
Ok, I think I have figured out the some of the semantics for top level threads.

There are three adjectives for Felix threads:

(a) mainline thread
(b) primary thread
(c) secondary thread

The mainline thread is a primary thread. It is the thread that was running when
your program starts.

The non-mainline primary threads are those made by spawn_pthread.
The secondary threads are those made by spawn_process.

Every thread has an async scheduler and sync scheduler. However
secondary threads shared fibre list with one primary thread.

Now, we need secondary threads to die when there’s no work to do,
and none coming, however primary threads must return.

So, the mainline thread can be run two ways:

(a) blocking mode
(b) returning mode

In blocking mode, if there are pending async jobs, and no active threads,
a thread hangs on the async queue whereas in returning model,
it returns. This means it gets suspended and Felix yields to an external
event loop. On re-entry, the thread is resumed.

If there are active jobs but no work in the job queue, the mainline
thread must return. However no secondary thread may return
until there are no running threads, and no pending async.
Instead they pauses and retry, until this condition is met.

So things to note: by these rules, the mainline thread ONLY
may return, but leave secondary threads waiting on async.
So actually, the mainline thread can return even when there
are running jobs. Its not at all clear why the mainline thread
wouldn’t return at ANY time if there are other threads to carry on
the work.

To preseve the existing semantics, the mainline thread shouldn’t
return until there are no active threads. At this point, if there
is no pending async, all threads return, if there is pending async,
the secondary thread wait. This design has merit too, but the previous
one allows the event loop to run more quickly.

There is still a problem. If the mainline thread returns, the secondary
threads can still run out of all work and terminate.

This sounds OK but it isn’t because in the external event loop,
a multi-write can put jobs back into the queue.

Note that for nested schedulers at present there’s a new queue,
and a new sync scheduler but no new async scheduler.
At present a nested job cannot spawn pthreads, processes,
or async jobs.



John Skaller
ska...@internode.on.net





Keean Schupke

unread,
Dec 8, 2018, 12:31:31 PM12/8/18
to felix google
Hi John,

I would simplify that, and have the runtime entry code fire up a thread pool of 1 thread per core (overridable by a command line parameter) and have that pool exist all the time the program is running. In that way the programmer doesn't have to think about hardware threads at all. The programmer simply schedules asynchronous tasks, which either can be in parallel or are sequential, and should not have to worry whether there are one, eight, or sixty-four hardware threads (there would be 64 on an AMD Threadripper 2990 for example).

Keean.


John Skaller2

unread,
Dec 8, 2018, 5:15:27 PM12/8/18
to felix google


> On 9 Dec 2018, at 04:31, 'Keean Schupke' via Felix Language <felix-l...@googlegroups.com> wrote:
>
> Hi John,
>
> I would simplify that, and have the runtime entry code fire up a thread pool of 1 thread per core (overridable by a command line parameter) and have that pool exist all the time the program is running. In that way the programmer doesn't have to think about hardware threads at all. The programmer simply schedules asynchronous tasks, which either can be in parallel or are sequential, and should not have to worry whether there are one, eight, or sixty-four hardware threads (there would be 64 on an AMD Threadripper 2990 for example).


Three issues with that.

First as noted previously, Felix as a systems programming language should not
pre-empt decisions of the developer using it. Its a toolbox. If you want to do
imperative programming you can. If you want to do functional programming you can.
If you want to program with multiple pthreads, you can.

Instead, there *is* a system thread pool, defaulting to a number of threads
related to the number of CPUs. But its not fixed for a good reason: the Felix
program might not be the only process running.

Instead, the system should be *capable* of roughly what you describe,
but not enforce that design.

Second, there are heavy constraints on synchronisation when using a thread pool.
A job in a pool is cannot be allowed to block. So instead, every synchronisation
device has to be replaced by an equivalent one that works in a pool. As noted
previously, in the extreme form of this, we’re effectively rewriting the operating
system. Maybe I can do it better than the real OS and maybe not :-)

Third, Felix is designed to be embedded in C, as well as embedding C.
More particularly, it is designed to be embedded in an external event loop.
This is absolutely necessary in some contexts, for example, you might
want to do Windows graphics programming inside an event management
library, or X windows within Xlib. Or use SDL2 under iOS and Android.

To make this work there are two considerations. First, Felix mainline
has to suspend regularly, because its running as an “idle-task” in the
external event loop.

Secondly it has to synchronise with the event loop. If you’re doing
crap like OpenGL, for example, and you want to do graphics on
the current display, it HAS to be done from the main thread.
Similarly in Windows, the message queues are “per thread”.
X is thread safe and better designed in some ways but if you’re
writing SDL2 code, you have to make it work on all platforms.
SO even though you can let non-mainline threads keep running
whilst the main thread has yielded to the event looop, there
are constraints on what they can do: no graphics, for example.

In other words as an application language, you cannot just run
“the program” in a thread pool. In fact Felix doesn’t generate
programs. It generates libraries.

So the bottom line is, even if your design is optimal for a dedicated server
program, that is only one use case. The hard thing is to make that
use case work fast but still support the others.





John Skaller
ska...@internode.on.net





John Skaller2

unread,
Dec 9, 2018, 7:08:12 PM12/9/18
to felix google
OK so for top level (non-nested) scheduler:

There are 4 kinds of Felix theads (note: there are other threads such as demux event
polling threads and also in parallel mode, the GC can run extra threads).

1. mainline

This is the Felix thread created out of the standard driver’s mainline thread,
usually the one the OS started the process with. It constructs the Felix
environment, runs until there is nothing to do, cleans up, and returns
control to the OS. The mainline thread is not usually permitted to terminate
until all other threads have terminated (because it would destroy their
environment, including the garbage collector!)

2. embedded

This is not currently used. It is the kind of thread the client specifies
when embedding Felix in an event loop. It differs from the mainline
thread in that it the associated Felix thread can suspend and resume.
When it suspends the Felix thread object terminates but the environment
is preserved. Whilst suspended, the physical thread is running the
external event loop. Other Felix threads can continue to run .

3. pthread

The kind of thread made by spawn_pthread. It gets its own scheduler
and scheduler list. Async I/O constructs a dediced demux thread.
It operates the same as a mainline thread except that termination
does not tear down the whole Felix environment, it just dies.


4. process

The kind of thread mad by spawn_process. It gets its own scheduler
but SHARES the scheduler list with its caller. The sharing is transitive
and accumulative. This is how you run multiple coroutine concurrently.

TERMINATION RULES

The general rule is set of related Felix threads must terminate together.
Threads are related if, and only if, they share the same coroutine queue.
In a set of related threads all but one must be process threads, thesse are SLAVES.
The other thread is the MASTER.

To make this work is tricky. So I am trying this rule:
if there is

(a) no work being done
(b) no synchronous work to do
(c) no pending async jobs

then all process threads terminate and the other thread returns.
If its a pthread it also terminates but the mainline thread has
to do cleanup first.

Its quite tricky to meausre when there is no work being done and
no synchronous work to do efficiently.

When there are pending async jobs then

1. IF the MASTER thread is not embedded,
it BLOCKS waiting to fetch one when it completes

2. IF a thread is embedded or process
if work could turn up then
it spins with a delay waiting for work
otherwise it terminates

Work is possible if:

(a) a thread is workikng
(b) the sync queue is populated
(c) there are async jobs pending
(d) the MASTER is embedded

The really hard thing is deciding wnen a thread is working.
The rule is simple: a thread is working if, and ony if, the “ft” variable of the
sync scheduler is non-null. When a fibre finishes and it becomes NULL
the sync scheduler tries to fetch another fibre to run. The thread is
considered to continue to be working whilst it tries to do this.

So basically all that’s required is to start the count at zero,
and increment it when ft is set to non-NULL, and decrement it
when it becomes NULL.

The hard but is the optimisation. If it is NULL only transiently
we don’t want to decarement and then increment the count.
We only want the count to be decremented when a scheduler
finds there is NO work to do other than maybe async work.

If the working count is zero AND the asyc pending count is zero,
tha set of related threads is finished IF and ONLY IF it is
a mainline or pthread set.

Embedded .. I don’t know for sure yet.
In this case threads should just keep running until explicitly
killed. The reason is, the client external event loop is
allowed to *inject* work into Felix from outside. There’s a
special operation for this. As a motivating example,
a bunch of fibres is running a game, shooting each other,
making noises and drawing. Then they all complete,
and the embedded system suspends and yields to
the event loop. Whan 4 ms has elapsed, the master
framework injects a clock pulse with a multi-write that
wakes up all the fibres, and then resumes running Felix.

4 ms is 25 FPS, about the right frame rate for a low end game.

Anyhow the difficulty is I have functions like “get fibre from queue”
that can return either a fibre or NULL if there aren’t any.
So I can’t really tell with an explicit test, in each case.
SImilarly routines exit. The problem is, this stuff happens
concurrently.

The current code has a race.- A thread says its not working
and delays a bit, then goes back to trying to work, where it
says its working. All the process threads do this. So most
of the time they’re spinning saying, “I’m not working” and
occasionally, even when there’s no work to dom they’re
saying “I’m working” before they actually find some work.
Looking for a job is a job in itself :-)

The problem is, we need some random luck for all
the threads to be saying they’re not working, including
the master .. before any of them can terminate.
And of course they can’t terminate when they’re sleeping.
And they’re working when they’re not sleeping … :-)



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Dec 10, 2018, 12:05:53 AM12/10/18
to felix google
Well finally .. all tests pass. And:

///
~/felix>flx concurrent_coroutines.flx
Spawned
Spawned
New thread
New thread
New thread
New thread
New thread
New thread
Elapsed 7.49796
Elapsed 0.685071
~/felix>flx mul
Starting
Single thread Done 21.6451
Thread pool Done 3.21698
~/felix>flx build/release/share/demo/threadpoolex1.flx
Naive mul elapsed 2.72407 seconds
Naive Parallel mul elapsed 0.398374 seconds
Verified
Using thread pool's pforloop
Smart Parallel mul elapsed 0.41521 seconds
Verified
pfor mul elapsed 0.4002 seconds
Verified
//////


i have not checked the logic for processes (concurrent coroutines).
I am NOT sure that the atomic updates actually include barriers.
I’m not sure there are no races.

I am sure nested schedulers still only allow synchronous ops.

But the numbers look good, finally. Concurrent coroutines, 8 times faster than single threaded.
Thread pool based multiply: mul, anout 7 times faster with multiple threads.
Using system thread pool, same: around 7 times faster.

So .. its time to commit. :-)


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Dec 10, 2018, 1:49:07 PM12/10/18
to felix google


> On 10 Dec 2018, at 16:05, John Skaller2 <ska...@internode.on.net> wrote:
>
> Well finally .. all tests pass. And:
>
> ///
> ~/felix>flx concurrent_coroutines.flx
> Elapsed 7.49796
> Elapsed 0.685071

So I checked the results, got a bug, got different timings .. its not clear what’s
happening. The problem is the timing and result found depend on exactly
when the result printer runs which is not determinate with the test.
It would be good if I could “run” the tests but “run” doesn’t allow spawning processes
at the moment.

I tried a second version in which the helper processes are spawned FIRST
not afterwards. This test ALSO included the time to create an ftread, which uses
malloc.

This version of the test got the right results.

But the SHOCKER was the timing. Ithought it would be slower because malloc()
was included in the timing. Here’s the result:

Single thread: Elapsed 0.000173

Huh?

With helper threads:

Elapsed 2.3e-05

That’s SO FAST the measurement is effectively zero.

Hum. The problem appears to be that STL list is SLOW. In the concurrent
version of the test, the fibres get eaten up by the helper threads faster than
they can be created. So the STL list is always almost empty, and malloc()
is simply recycling the memory from STL list. (The Felix objects are
garbage collected and the collector is not running).

I need a more robust test. With the collector running, the numbers don’t
change but they must be lying because the collector reports program
time of 5 seconds (44% in the GC which is pretty good).



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Dec 15, 2018, 4:01:03 PM12/15/18
to felix google
Well … this code:


proc atest() {
println$ "A test";
spawn_pthread { println$ "I'm a pthread!"; sleep 2.0; println$ "Pthread awakes"; };
sleep 1.0;
println$ "Hey, I woke up";
}


println$ "Running async test";
async_run atest;
println$ "After running async test";

println$ "Running async test";
async_run atest;
println$ “After running async test";

WORKS! Instead of “run” you say “async_run”. This is a MAJOR upgrade.
I haven’t tried it with concurrent coroutines yet. There’s a still a problem
with them (performance sucks).

There is a bug: the async scheduler is currenly leaked. In the RTL it
is ref counted. I could fix that with an explicit delete, but I think I’ll
just fix the RTL to garbage collect it.


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Dec 19, 2018, 8:39:08 AM12/19/18
to felix google

So its not yet working but I am close to a design. So here’s what happens:

1. In a main program, the environment is set up, the mainline thread executes,
it returns, and it tears down the environment. Process complete. The thread
must suspend but remain registered when async events are pending.

2. In an event loop, the thread exits without cleanup and suspends
as soon as there is no work to do, leaving pending async events.
The driver must resume the thread repeatedly until there are
no pending async events, in this case the thread exists
and cleans up.

3. A thread can spawn another thread with spawn_pthread.
That thread has its own scheduler and async queue.
It exits only when there is no work and no pending events.
It doesn’t cleanup (only the main thread is allowed to do that).

4. The main thread must wait until all other threads are completed
before cleanup.

5. Any thread can create and run a nested scheduler.
Doing this is a subroutine call. The thread cannot return until
there is no work to do on that scheduler.

6. Any thread can create a helper process thread.
It runs with the same async scheduler and queue
as its creator. Processes cannot return until there
is no work to do and no async pending.T

7. The thread that creates them also cannot return until all the
child processes it created have returned. In effect,
the master thread must join all te slaves.

8 Note that if a slave process creates another one it
is owned by the mastrer not the slave so the creator slave
does not wait for it, the master does.

9. See again (5), slaves can run nested schedulers too.




John Skaller
ska...@internode.on.net





John Skaller2

unread,
Dec 20, 2018, 10:15:05 PM12/20/18
to felix google
~/felix>flx concurrent_coroutines; flx concurrent_coroutines2; flx concurrent_coroutines3
Schedule before launch
Done Serial kk = 200, elapsed=12.4906
Done Concurrent kk = 200, elapsed=3.78415
Schedule after launch
Done Serial kk = 200, elapsed=12.4911
Done Concurrent kk = 200, elapsed=4.57675
Spawn after launch
Done Serial kk = 200, elapsed=12.5106
Done Concurrent kk = 200, elapsed=5.71336


Each test spawns 6 helper processes, so for my corei7 with hyperthreading,
that’s 7 threads. Each coroutine is eating CPU by calculating ack(3,40),
where ack is Ackerman’s function. 200 runs are done.

Schedule uses spawn_fthread which puts the jobs at the END of the queue.

Spawn uses spawn_fibre which starts running the job immediately,
pushing *itself* onto the head of the queue.

Exactly why the results differ is beyond me! In these tests, almost all the time
is spent doing Ackerman’s function, so the context switching time is
irrelevant. The GC time will be insignificant.

Note that if you spawn_fibre first then launch the helper processes
it should be the same as for serial (or slower).


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Dec 22, 2018, 5:59:40 AM12/22/18
to felix google
So my latest test said SIGBUS.

///////////
fun ack(x:int,y:int):int =>
if x == 0 then y + 1
elif y == 0 then ack(x - 1, 1)
else ack(x - 1, ack(x, y - 1))
endif
;

fun slow (x:int) {
println$ "Slow " + x.str;
assert ack(3,10) > 0;
return 1;
}

proc countem (y:int) {
println$ "Got one";
}

var start = time();
async_run {
//spawn_process { println$ "Helper 1"; };
spawn_process { println$ "Helper 2"; };
#(source_from_list ([1,2,3,4,5,6,7,8]) |->
function slow |->
function slow |->
function slow |->
function slow |->
function slow |->
function slow |->
function slow |->
function slow |->
procedure countem);
};
var elapsed = time() - start;

println$ “Elapsed " + elapsed.str;
//////////////

Hmm .. lldb couldn’t give any information except “invalid thread”.

OK, so try static linkage. Well …

rocess 14354 stopped
* thread #2, stop reason = EXC_BAD_ACCESS (code=2, address=0x700008401ff8)
frame #0: 0x0000000100004af8 concurrent_pipeline`flxusr::concurrent_pipeline::ack(_vI67614_x=1, _vI67615_y=<unavailable>) at concurrent_pipeline.cpp:539 [opt]
536 _ml67602_L69376:;
537 /*match case 2:any*/
538 /*parallel assignment*/
-> 539 _vI67615_y = ack(_vI67614_x,_vI67615_y - 1 ); //init
540 _vI67614_x = _vI67614_x - 1 ; //init
541 goto start_69378_L69378;
542 }


Hang on. That’s Ackerman’s function. HEAVILY RECURSIVE.

OMG you utterly LAME STUPID CRAP OPERATING SYSTEM.

Its a stack overflow.

there is NO EXCUSE AT ALL FOR A STACK OVERFLOW ON A 64 BIT MACHINE.

32 bit, sure. not much address space. But on a 64 bit machine all stacks can
be F’ING INFINITE.

I changed Ack(3,11) to Ack(3,10) and sure enough it works.

however there’s no speedup and it doesn’t seem to hammer the CPUs ..

It should! When a reader and writer meet, the writer is started and the reader
goes on the queue .. one of the spawned processes should start running it.
Its not happening…. hmm ..




John Skaller
ska...@internode.on.net





John Skaller2

unread,
Dec 22, 2018, 6:50:43 AM12/22/18
to felix google


> On 22 Dec 2018, at 21:59, John Skaller2 <ska...@internode.on.net> wrote:
>
>
> there is NO EXCUSE AT ALL FOR A STACK OVERFLOW ON A 64 BIT MACHINE.
>
> 32 bit, sure. not much address space. But on a 64 bit machine all stacks can
> be F’ING INFINITE.
>
> I changed Ack(3,11) to Ack(3,10) and sure enough it works.

I fixed the stack size on Posix .. temporarily secondary threads get 100Meg.
Now using Ack(3,12):

Serial Elapsed 67.3182
Concurrent Elapsed 22.7808

That’s better!


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Dec 29, 2018, 3:50:49 PM12/29/18
to felix google

>
> I fixed the stack size on Posix .. temporarily secondary threads get 100Meg.
> Now using Ack(3,12):
>
> Serial Elapsed 67.3182
> Concurrent Elapsed 22.7808
>
> That’s better!

With faster schannels:

Serial Elapsed 55.546
Concurrent Elapsed 18.6067

:-)



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Dec 30, 2018, 4:34:22 PM12/30/18
to felix google
ok so my next trick is to look at the cost of service calls.

at present you write

_svc variabe

where variable contains the service call. The address of the variable is put into
the current continuation which then returns control to the scheduler.

The type of a service call is a variant with the standard representation,
a _uctor_ with an integer tag for the service call number and a pointer
to the argument. Here’s read and write on an schannel:

/*7*/ | svc_sread of _schannel * &address
/*8*/ | svc_swrite of _schannel * &address

The second argument is a pointer to the location containing the address
value to be read or the place to put the address value being written.

The argument is heap allocated according to the rule for variant formation.

It works like this:

noinline proc svc(svc_x:svc_req_t) {
var svc_y=svc_x;
_svc svc_y;
}

proc read[T] (chan:schannel[T], loc: &&T) {
svc$ svc_sread$ C_hack::cast[_schannel] chan, C_hack::reinterpret[&root::address] (loc);
}


proc write[T] (chan:schannel[T], v:T) {
var ps = C_hack::cast[root::address]$ new v;
svc$ svc_swrite$ C_hack::cast[_schannel] chan, &ps;
}

The write operation make a copy of the argument to be written on the heap,
since schannels can only read and write addresses.

It shouldn’t need to do that if:

(a) the value fits into an address
(b) its a unique pointer already

So basically at present doing a write requires 2 allocations and doing a read requires 1.
Using the new allocation free schannel operations and queue service I got this
with a null test sending 50K integers along a pipeline length 50.

~/felix>FLX_MIN_MEM=1000 flx concurrent_pipeline3
List length = 49152
Serial Elapsed 10.288
Concurrent Elapsed 37.12

If I watch the mac CPU HIstory monitor running this the concurrent version CPU usage bars are
filled with red. The monitor uses green for user and red for system use. Only 4 CPUs work.
There’s no red in the serial version. The red has to be thread pre-emptions related to
contention of locks on allocation.

The GC uses a system mutex to serialise allocations. Allocation use malloc which itself
is thread safe and probebly uses a system mutex for serialisation as well (although
modern mallocs have thread local storage to allocate from, using that pool there would
be no serialisation required).

Note that in Felix GC locks are acquired even if the program is entirely serial.
There is a serial GC, in fact the GC is a serial object which is made thread
safe in a derived class with wrapper methods that do the locking.

Unfortunatly wrapping malloc in a spinlock is not acceptable because
malloc is an unknown. On the other hand the management data structure
is a JudyLArray with guarranteed bounded performance so a spinlock
for it would be appropriate on access .. but of course insertions still
have to use malloc.



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Dec 30, 2018, 4:57:07 PM12/30/18
to felix google

>
> ~/felix>FLX_MIN_MEM=1000 flx concurrent_pipeline3
> List length = 49152
> Serial Elapsed 10.288
> Concurrent Elapsed 37.12

Deleting collector total time = 54.79076 seconds, gc time = 123.09077 = 224.66%

Interesting :-)

Without GC:

~/felix>FLX_MIN_MEM=2000 FLX_REPORT_COLLECTIONS=1 flx concurrent_pipeline3
[FLX_REPORT_COLLECTIONS] Collection report enabled
[FLX_REPORT_GCSTATS] GC statistics report enabled
[FLX_REPORT_COLLECTIONS] Collection report enabled
[FLX_REPORT_GCSTATS] GC statistics report enabled
List length = 49152
Serial Elapsed 10.3565
Concurrent Elapsed 23.7794
Deleting collector total time = 34.93044 seconds, gc time = 0.00000 = 0.00%


The concurrent version is still slower.

Note that in this test I *EXPECT* the concurrent version to be slower.
This is because its a null test. The routines in the pipeline do nothing
except read and write an int. No calculations. So ALL the time is spent
in the system. So most of the time is going to be spent contending locks
and doing allocations.

I added a forced collection between the serial and concurrent tests:

~/felix>FLX_MIN_MEM=2000 FLX_REPORT_COLLECTIONS=1 flx concurrent_pipeline3
[FLX_REPORT_COLLECTIONS] Collection report enabled
[FLX_REPORT_GCSTATS] GC statistics report enabled
[FLX_REPORT_COLLECTIONS] Collection report enabled
[FLX_REPORT_GCSTATS] GC statistics report enabled
List length = 49152
Serial Elapsed 10.18
[Gc::collect] Program requests collection
[flx_gc:gc_profile_t] actually_collect
actually collected 14893302 objects, still allocated: 2 roots, 49896 objects, 816276 bytes
[Gc::collect] Collector collected 14893302 objects
Concurrent Elapsed 15.9588
Deleting collector total time = 34.12926 seconds, gc time = 7.60181 = 22.27%


So 15 million objects are being created. In this test, the service call object only
need to be created once, and I can fudge that in the test code: at present the service
call is inside the procedure’s read/write loop. But the addresses of the data to be read/written
are the same each iteration so the service call object can be lifted out of the loop (the call itself
can’t be of course).

Also I can bypass the “new” operation on the data to write with some hackery.

It actually quite good above: only 50% overhead for the concurrent version.
however the overall time of 10 seconds is absurd, that’s at least 1000 times
too slow. We’re only using 50K values, and a pipeline of length 50.
It should only take a few microseconds to send a value down the pipe.


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Dec 30, 2018, 10:50:41 PM12/30/18
to felix google


> On 31 Dec 2018, at 08:57, John Skaller2 <ska...@internode.on.net> wrote:
>
>
>>
>> ~/felix>FLX_MIN_MEM=1000 flx concurrent_pipeline3
>> List length = 49152
>> Serial Elapsed 10.288
>> Concurrent Elapsed 37.12
>
> Deleting collector total time = 54.79076 seconds, gc time = 123.09077 = 224.66%
>
> Interesting :-)
>
> Without GC:
>
> ~/felix>FLX_MIN_MEM=2000 FLX_REPORT_COLLECTIONS=1 flx concurrent_pipeline3
> [FLX_REPORT_COLLECTIONS] Collection report enabled
> [FLX_REPORT_GCSTATS] GC statistics report enabled
> [FLX_REPORT_COLLECTIONS] Collection report enabled
> [FLX_REPORT_GCSTATS] GC statistics report enabled
> List length = 49152
> Serial Elapsed 10.3565
> Concurrent Elapsed 23.7794
> Deleting collector total time = 34.93044 seconds, gc time = 0.00000 = 0.00%


As promised I hacked it to be functionally equivalenty but do no allocations.
Actually the service request object constructors do an allocation but it is
only done once, not inside the loop.

The hack transfers the integers directly instead of a pointer to a heap allocated
value. So we can see how fast the RTL machinery is.

Are you ready???

~/felix>FLX_MIN_MEM=1000 FLX_REPORT_COLLECTIONS=1 flx concurrent_pipeline3
[FLX_REPORT_COLLECTIONS] Collection report enabled
[FLX_REPORT_GCSTATS] GC statistics report enabled
[FLX_REPORT_COLLECTIONS] Collection report enabled
[FLX_REPORT_GCSTATS] GC statistics report enabled
List length = 49152
Serial Elapsed 0.249465
[Gc::collect] Program requests collection
[flx_gc:gc_profile_t] actually_collect
actually collected 295204 objects, still allocated: 2 roots, 49746 objects, 813200 bytes
[Gc::collect] Collector collected 295204 objects
Concurrent Elapsed 0.248457
Deleting collector total time = 1.58544 seconds, gc time = 0.16338 = 10.30%


The serial code is 50 times faster, the concurrent code 100 times faster,
and, faster than the serial code too. So its not the 1000 times I hoped for ..
but its still pretty awesome.

The code is below. This kind of transfer (zero allocations) is good
for ANY pointer and any integer small enough to fit in a pointer.

Observe the svc_sread/svc_swrite type constructors that make service
request objects and the _svc primitive that does the actual operation.
The Felix compiler translates _svc into a service call by setting
the pointer to the request into the continuation object, saving
the program counter, and returning. The scheduler does the request
then calls the resume() method which jumps to the point after
the request and continues.

It’s easy to overload read/write so they transfer pointers directly.
Its not clear how to avoid repeadedly allocating the request object tho.
The user shouldn’t need to see that.

//////////////////////////////
proc makem (var lst:list[address]) (r: (out: %>address)) () {
var x : address;
var wreq = svc_swrite (C_hack::cast[_schannel]r.out, &x);
proc dowrite() { _svc wreq; }
next:>
match lst with
| Empty => ;
| head ! tail =>
x = C_hack::cast[address]head;
lst = tail;
//println$ "elt=" + x.str;
dowrite();
goto next;
endmatch;
}



proc slow (seq:int) (r: (inp: %<address, out: %>address)) () {
var x : address;
var rreq = svc_sread (C_hack::cast[_schannel]r.inp, &x);
var wreq = svc_swrite (C_hack::cast[_schannel]r.out, &x);
proc doread() { _svc rreq; }
proc dowrite() { _svc wreq; }
while true do
doread();
//println$ "rout " + seq.str + " read " + x.str;
// assert(C_hack::cast[uintptr]x == C_hack::cast[uintptr]1);
dowrite();
done
}

proc countem (r: (inp: %<address)) () {
var x : address;
var rreq = svc_sread (C_hack::cast[_schannel]r.inp, &x);
proc doread() { _svc rreq; }
while true do
doread();
//println$ "result=" + x.str;
done
}

var lst = ([C_hack::cast[address]5,C_hack::cast[address]2,C_hack::cast[address]3]);
for(var i = 1; i<15; ++i;) lst = lst + lst;
println$ "List length = " + lst.len.str;

proc check(concurrent:bool)
{
async_run {
if concurrent do
spawn_process {; };
spawn_process {; };
spawn_process {; };
spawn_process {; };
spawn_process {; };
spawn_process {; };
spawn_process {; };
done
/*for(i = 1; i<1000; ++i;)*/ {
//if i%100 == 0 perform
//println$ "i=" + i.str;
#(makem lst |->
slow 1 |->
slow 2 |->
slow 3 |->
slow 4 |->
slow 5 |->
slow 6 |->
slow 7 |->
slow 8 |->
slow 9 |->
slow 10 |->
slow 11 |->
slow 12 |->
slow 13 |->
slow 14 |->
slow 15 |->
slow 16 |->
slow 17 |->
slow 18 |->
slow 19 |->
slow 20 |->
slow 21 |->
slow 22 |->
slow 23 |->
slow 24 |->
slow 25 |->
slow 26 |->
slow 27 |->
slow 28 |->
slow 29 |->
slow 30 |->
slow 31 |->
slow 32 |->
slow 33 |->
slow 34 |->
slow 35 |->
slow 36 |->
slow 37 |->
slow 38 |->
slow 39 |->
slow 40 |->
slow 41 |->
slow 42 |->
slow 43 |->
slow 44 |->
slow 45 |->
slow 46 |->
slow 47 |->
slow 48 |->
countem);
};
};
}

var start = time();
check(false);
var elapsed = time() - start;
println$ "Serial Elapsed " + elapsed.str;
collect();

start = time();
check(true);
elapsed = time() - start;
println$ "Concurrent Elapsed " + elapsed.str;






John Skaller
ska...@internode.on.net





John Skaller2

unread,
Jan 1, 2019, 6:59:50 PM1/1/19
to felix google
So two changes required now. First:


> proc slow (seq:int) (r: (inp: %<address, out: %>address)) () {
> var x : address;
> var rreq = svc_sread (C_hack::cast[_schannel]r.inp, &x);
> var wreq = svc_swrite (C_hack::cast[_schannel]r.out, &x);
> proc doread() { _svc rreq; }
> proc dowrite() { _svc wreq; }
> while true do
> doread();
> //println$ "rout " + seq.str + " read " + x.str;
> // assert(C_hack::cast[uintptr]x == C_hack::cast[uintptr]1);
> dowrite();
> done
> }


I need to change the service call representation. I think I’ll use a C union.
Each request object start with a tag value and then fields for its data.
So a request would look like

var req = read_req (svc_read_tag, thechannel, &resultvar);
_svc req;


The second change is to transfer pointers directly. Technically,
they should be uniquely typed.

Now one hopes if we put the new request inside a loop, the C++ compiler
will notice it is invariant and lift it out. I wouldn’t trust the compiler much though.


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Jan 4, 2019, 11:55:25 AM1/4/19
to felix google
So I’m doing the change so svc requests now use a C discriminated union instead
of a Felix one. In other words a union. This avoids heap allocation of the request data
which the Felix variant required.

I am thinking about whether I should add this to Felix as a core type.
That is, an expanded variant.

The Felix library still heap allocates a copy of the object to be transfered.
This can be avoided by specialising for small objects, primarily pointers.
Technically they should be uniq.

HOWEVER it occurs to me that be adding an extra variable to the request
of type

void (void *src, void *dst)

we can call that function to do the copying. That is, we can say take
a C++ copy or move constructor, wrap it in a C function, and pass that.
The system then invokes it. The current copying is done by

*r = *w;

where void **r, void **w are pointers to the slots containing the pointer
to be copied and where it has to go.

Hmmm…


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Jan 10, 2019, 12:50:16 PM1/10/19
to felix google
Visiting again *user* level locking issues.

First, with the standard multi-core scheduling model, contention is very unlikely.
Roughly speaking the probability is:

fCon = nFib * (nThread - 1) * fLock

where

nFib is the total number of fibres running
nThread is the total number of threads running the fibres
fFlock is the fraction of time spent by a fibre holding a lock

Note that spinlocks are seirously BAD on a uniprocessor because a contender
will run UNTIL prempted. Correct operation on a uniprocessor really requires
the lock holder to mask IRQ interrupts to prevent pre-emptions. This would be
nice on a multi-processor too. AFAIK only Windows provides real critical sections.

For a large number of fibres run by a small number of threads, with locks only used
occasionally, and spanning only small amounts of time, the possibility of contention is
very small, and the delay given that the lock is only held briefly also small PROVIDED
a spin lock is used. The worst case is that a lock holder is pre-empted and descheduled
by the OS, whilst there is a condender which is not. The probability of THAT happening
is approximately zero if he total number of threads is less than the number of CPU cores,
but goes up rapidly as the number of threads exceeds the number of cores. note that
pre-empting a condender than becomes highly desirables.

System mutex are better when there are a lot more threads than CPUs, for the
above reason, especially if nFib is small and fLock is large. I would note also that
a spinlock using test-and-set with sequential consistency does impose a significant
burden on the processor cache management protocol. Acquire/release is probably
good enoug but there’s still an overhead synchronising caches.

Second: what can we do in the scope of a spinlock? Basic memory access
should be OK. It is much better if muitiple accesses hit the same cache line.
The worst case with distributed access is a pre-emption based on triggering
a virtual memory paging request. Then we have a spinlock waitiing for the disk!

Allocations are problematic. Standard Felix allocation is strictly forbidden because
the world-stop GC could be deadlocked if a lock holder triggers a world stop,l
then a contended will never notice and spin forever. For this purpose Felix must provide
a GC aware mutex which uses a timeout and spins around a check on the world-stop
flag (suspending as required if it is set).

Felix also provides a flag in Felix heap allocations which prevents a collection
being triggered. The scheduler and most of the RTL code must ensure this flag
is false, inhibiting GC, however Felix code currently has no way to set this flag.

In any case a Felix allocation invokes the GC allocatior which is protected
by a system mutex. It is not a good idea to nest locks. This is a serious difficulty
with locking. Systems usually provide a recursive mutex to supprt this, however
spinlocks cannot fo this directly (because there is no way to test of the flag
of a test-and-set is akready set except by actually setting it, in which case
we’d need to save the result so we can decide whether to clear the lock or not,
which is impossible without also protecting the result .. :-)

Even if we’re just calling malloc(), then, since malloc() is thread safe,
it may be setting a mutext itself. Modern mallocs should be using thread local
storage and spinlock protected shared free lists to minimise the need to actually
call the OS for more memory, but we can’t really know.

SO: we need to give the user a spinlock, but Felix cannot easily ensure it is
only used in a safe way, nor, if it used in an efficient way.

Third, locks work best if localised, meaning, associated closely with the
memory region accessed in the lock scope. Note this also means the
lock memory itself should be physically close to the memory being
protected if possible, preferably in the same cache line.

The simplest lock is a single global lock. In Felix this would be store
in the GC object which is universally accessible. Using it allows cross
thread coroutine migration and arbitrary memory access serialisation.
The chance of contention, however, is raised when there are many clients.
OTOH that also increases the chance the lock itself becomes pinning
in a loaded cache line.

Please recall, however, point one in this article. In a typical heavily concurrent
system, chance of contention is vanishingly small and chance of pre-emption
of a lock holder also very small. This is NOT true in ordinary code using
system threads, it is a key advantage of user space scheduling.

However, there’s a good case for a lock on the scheduler queues.
Normally there’s only one queue, however each spawned pthreaqd
creates a new queue. Helper threads share the parent’s queue.
Such locks will be more efficient, but only really useful if there
are a LOT of pthreads spawned. They’re also dangerous, because
they can’t protect inter-pthread communication.

Its also appropriate to provide a user lock constrictor so only
those routine actually sharing some memory share the lock,
minimising the chance of a useless serialisation a global lock
would lead to, when serialisaing unrelated accesses.

Its also worth considering, that with only a small number of lock
users, we can use a heavier duty lock, and protect allocations
and other things, making these locks conditionally safer and
more capable, provcided they’re correctly shared and used.
A good way to share locks would be to actually send them
on a synchronous channel!

Thus, there are quite a few different locking options, depending on:

(a) what the locks protect

(b) how the locks are shared

The locks are: spinlock, hard system mutex, spinning mutex

A spinning mutex is a system mutex with a time out, which regularly
checks for the world stop flag in the spin loop.

A more efficient but heavier spinning lock uses a condition variable
with a time out instead. Thia kind of lock has to register itself with
the GC, so itr can be signalled to bump it out of the wait state.
Note that the lock must be released to suspend on the GC, and
reaquired before spinning back to the conditional wait on the
condition, which is actually a boolean flag representing the actual lock.
[So we have a lock protecting access to a lock .. :]

In summary: there are some tricky considerations choosing which locks
to give the Felix programmer.

WORSE: we have to pick some syntax. It is tempting to do:

atomic { … };

or even just

atomic statement;

This would be a global spinlock. UNSAFE because the controlled code MUST
NOT ALLOCATE with the Felix GC, and its hard for the programmer to know
when that is. Variants, for example, do heap allocations.

however, since the lock is on the GC, we could use the lock state to turn
off GC triggering on allocation. The catch is, C++ doesn’t allow the state
of a test-and-set flag to be examined. Only two operations are supported,
test-and-set and clear. The flag CAN be examined by testing it but that
also sets it. If the lock was clear, we just set it, so we can now release
it and do the allocation with triggering, otherwise it was set, and
we prevent the triggering. Unfortunately this is not sound because
after deciding that its safe to allow triggering another thread can
acquire the lock, and then another thread contend for it, meanwhile
our allocation trigger a world-stop.






John Skaller
ska...@internode.on.net





John Skaller2

unread,
Jan 10, 2019, 1:08:45 PM1/10/19
to felix google
Here is a Felix mutex: its VERY COOL!

chip mutex
connector io
pin inp: %<unit
pin out: %>unit
{
var u = ();
while true do write (io.out, u); read (io.inp, &u); done
}

To acquire the mutex, simply read the other end of the output channel,
do some work, then release the mutex by writing to the input channel.

Felix will automatically suspend your routine until it acquires the lock.
Once acquired, the mutex will wait until you release it until granting
it to another routine. Its trivially simple proof channel I/O is the supreme
control construction. Everything can be implemented with channels.

Note that this lock is extremely efficient!
What’s more there are no issues protecting things like allocations.

Note: a biridirectional channel would be better! However the chip DSL
doesn’t support this I dont think.




John Skaller
ska...@internode.on.net





John Skaller2

unread,
Jan 11, 2019, 6:54:00 PM1/11/19
to felix google
At present, Felix can easily wrap any C or C++ code in a library and make it available.

However there’s no way to construct a *derived class* from a library defined C++ class.
This needs to be fixed.

Apart from syntax issues, the problem is to allow a Felix function like thing to act
as an override of a virtual methods in a base class. There are secondary problems
like accessing protected members of the base. Remember accessing public methods
of the base is NOT a semantic problem, that’s already possible:

type base = “base*”;
fun get: base -> int = “$1->get()”;

No problem. Now for a derived class defined in Felix we need to just generate
a suitable name and derivation. We need syntax to define the method interfaces,
that isn’t too hard either. But how do we get *FELIX* functions and procedures
to work as methods of the derived class?

This means, specifically, that a C++ programmer can actually use the derived
class in C++ without really knowing that they’re calling a Felix function.

Lets look at functions. For the general case it’s pretty easy. In the constructor,
we accept a pointer to the Felix function. In the derived class we make a
non-static member to hold this pointer. The constructor puts the Felix function
in that slot. Then we generate a wrapper like:

int get() {
return felix_get->clone()->apply(argument);
}

To make this work correctly with the GC, we also need to make a shape for
the derived class in which the offset of pointer to the function is put into
the shape so the GC can trace it. This also means if the user heap
allocates the derived class, they have to use the *Felix* operator new,
and supply this shape pointer and a pointer to the GC. The C++ programmer
must also NOT delete the object since the GC is managing it.

With suitable syntax to tell the compiler what to do, I think so far
this is doable. Just messy.

Procedurres are a bit harder. Ordinary ones can be handled the
way the Felix compiler itself handles them, with a dummy driver
loop. However a *coroutine* is harder. To make that work, we have
to instantiate a real scheduler. That should work because async schedulers
are “first class subroutines”.

Its at least a month’s work.


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Jan 12, 2019, 1:10:28 AM1/12/19
to felix google
just deleted and refetched Felix .. and lost all the diagrams ;( IDIOT!
didn’t push the commits did I … ;(


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Jan 15, 2019, 8:15:57 AM1/15/19
to felix google
Aha, a major breakthrough!

Leaving out exponential types, we make types with primitives, sums +, and products *
plus recursion. All such types are regular expressions. And therefore they ALL
describe heterogenous arrays, that is, sequences.

It may seem weird that a binary tree is an array but it is. If all three operators
(including recursion) are expanded in place you literally get a heterogenous array.
Just thing recursive descent.

Now clearly, this kind of type system, used in functional programming,
is extremely weak. in fact its ridiculously weak. We shall immediately
generalise it to a context free grammar with enough constraints for
sanity and the ability to parse stings of that language.

All types are STILL arrays. I have on idea why using a context free grammar
hasn’t caught on. Notice context *dependent* grammars are in use, its
called dependent typing.

Now here’s the breakthrough. In Felix coroutines sends stuff up and down
channels. There’s no reason you cant send an int, and then a float,
and then a string, and then go back and repeat.

And the type is OBVIOUSLY

(int * float * string) ^*

where ^* means Kleene closure. Whcih is nothing more than tail recursion.

Now here’s the actual breakthrough. TWO families of operators

+,++
*,**
rec, trec

The left operator is a SPATIAL DATA TYPE CONSTRUCTOR.
The rigtht operator is a TEMPORAL DATA TYPE CONSTRUCTOR

All types are sequences. However now a type can use EITHER or BOTH
of spatial and temporal type constructors.



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Jan 15, 2019, 8:36:44 AM1/15/19
to felix google
I have had a lot of trouble getting flx to batch compile C++.
So after reverts, resets, and other stuff, i have decided to rewrite parts of “flx”
so I can actually understand what its doing :-)

The first step is command line parsing. The rules are more of a mess than I’d like.

~/felix>FLX_NEWPARSER=1 flx --debug -c --static -Ix -I y -O1 fred.cpp hello.flx pppp hhh
New parser, options =list((--debug, ), (-c, ), (--static, ), (--I, x), (--I, y), (-O1, ))
files =list('fred.cpp')
primary =hello.flx
args =list(‘pppp', 'hhh')


So basically we have options, filenames, a *primary* filename, and then program
arguments. The primary filename, if found, ends the Felix command and the rest of
the words of the command line are arguments to be passed to the generated executable
when it runs (if the command runs an executable).

You can also use — to make the end of the arguments to flx and the start of
arguments to the program.

The standard form of an option is —key or —key=value. However there
are some short forms which use a single prefix - character. These are
a mess.

Short forms like -c have no arguments. Short forms like -o, -od, -ox have a followup
argument. The is the next word. Some short forms like -I, -L, and -l can either have
followup arguments OR hardup arguments:

-Ifred
-I fred

are equivalent. Some short forms only allow hardup arguments:

-fcompileroption

The parsing of long options is application independent but unfortunately
short forms can’t be parsed with knowing the application.

Some applications also allow combined short forms, for example

tar -zxvf tarball.tgz

This is really ugly! Its hard to write a general parse rule for that.
Felix “flx” currenly doesn’t allow combined short forms.

In principle the parser must convert short forms to long ones, handling
the hardup/followup issues, so the parser output is more regular.

Primary filenames are a filename without an extension, or the extensions
.exe, .flx, .fdoc. These are special because they’re counted but the
rest of the words are program arguments.

Exactly which options, primaries, etc are allowed/supported isn’t entirely clear.
You might say, if -c++ is specified, it implies the last *.cpp file is the primary.
Or you might say there is none.

The idea is to process the result of the parse further, extracting more precise
representations. Felix currently stuff all the data into a single control record
which drives the processing but its a mess and it doesn’t work so well
when you have a regex driven batch.






John Skaller
ska...@internode.on.net





John Skaller2

unread,
Jan 15, 2019, 7:42:41 PM1/15/19
to felix google
Let me deal with spatial types first. Given a grammar, organised in this fashion:

nt1 =
| sym1-1-1 sym1-1-2 …
| sym1-2-1 sym-1-2-2 …

| epsilon



where the epsilon term is optional, we specify each non-terminal is a TYPE,
and each production is assigned a TYPE CONSTRUCTOR whose arguments
are the nonterminals in the production. We can then rewrite the grammar using
the type constructor as the first terminal, followed by the nonterminals.
This stage2 grammar is the optimised representation of the first one.

Any particular string generated by the grammar is layed out in memory
by simply writing the constructor term in the first memory slot, followed by
the arguments. The result is universally a heterogenous array.

The original grammar can be layed out in memory too, however interpreting it
requires more work to be done at run time. Because there are alternatives,
there is always a run time component, however the optimised version is
minimal, requiring only a switch.

The optimised layout is *identical* to the universal representation of variants:
a tag specifying the constructor, followed by the data. The layout specified is
pointer free. However, one can always throw in pointers at particular
places determined by a model. Typically in an FPL that uses boxing that
is *everywhere*. For other models, it might be that you only use a pointer
at a recursion point but not for primitives or products.

We will come back to this, but suffice it to say if the data is considered immutable
using pointers instead of the actual objects, provided it is done in conformance
to a computable model, leads to an isomorphic representation which permits
less memory use by sharing, at the expense of pointer chasing at run time,
and of course the immutability constraint. The advantage of the heterogenous
array is that copying is fast and the model permits mutation.

I want to again reiterate that the stage 1 and stage 2 representations are logically
equivalent. Its just that if you use the actual terminals in a prodiuction, instead of
a synthesised terminal representing the type constructor, you have to actually
do some heavier parsing at run time.

I weant to summarise: the types we have here are a MAJOR extension over traditional
inductive types because now we allow a context free grammar to specifiy the types,
not merely a regular expression. Traditional types *contrary* to popular belief,
do NOT support generat recursion. Mu binders ONLY support tail recursion:
the mu variable is ALWAYS a leaf. In other words, primitive recursion.

Grammars allow general recursion. Not that in the Chomsky heirarchy,
context *dependent* grammars correspond to dependent typing.
Its interesting that we have absurdly weak regular grammars of conventional
types systems with dependent typing thrown in. This is wrong. The correct
method is to go to context free grammars first, THEN add dependent typing.

The second MAJOR breakthrough is to consider the cofunctional programming.
A coffunction is a process which runs in an infinite loop, reading data from
an input channel and writing to an output channel. Cofunctions are strictly stronger
for a given type than functions because they can preserve state whereas functions
cannot. Pipelines of cofunctions are isomorphic to function chains after lifting
to the state monad because the state monad adds the extra state.

Now the point is, instead of thinking about an heterogenous array stored in memory
in a single object, at consecutive addresses spatially, we now think of sending
the same data down a channel. The SAME type that describes the memory layout
spatially ALSO describes the sequence of values coming down the channel,
and so the same machinery of disptaching off discriminant tags can be used
to process it, just control inverted.

In other words, spatial and temporal types use THE SAME types as each other.

But now comes the fun. Consider a type system with BOTH spatial and temporal
type constructors. In particular, for any given type, consider the spatial representation,
and consider that instead of sending each memory slow down a channel in order,
whjich is the corresponding temporal type, consider instead CHUNKING.

In other words we can send whole sub-bjects down the channel.

The data types are distinct but isomorphic. In particular the it is a software engineering
choice given a type how to rewrite it with a MIX of spatial and temporal constructors.
Obviously that choice determines how the code processing the data is written.





John Skaller
ska...@internode.on.net





John Skaller2

unread,
Jan 19, 2019, 10:52:19 AM1/19/19
to felix google
I am changing the type of

new x

from

&T

where T is the type of x, to

uniq (&T)

Its not 100% clear this is right, because there’s a confusion between deep and shallow types.
But I’m doing it anyhow :-)

Some other expressions like array_alloc should change as well.

Now, this will break stuff:

var x = 1; // x is an int
var px = &x; // px is a &int
px = new x; // WOOPS

The problem is, px has type &int and new x now has type uniq (&int). The latter is a subtype
of the former, and passing an argument or type A to a function or procedure with a parameter
type P is allowed if A is a subtype of P, but not for an assignment.

So I’m changing the rules to allow assigning a subtype of a variable type to that variable.
It works the same way, by coerciing the initialiser. I will have to change initialistion too:

var x : P = e; // where typeof e is A

Assignment of a polymorphic variant subtype was already allowed.

===========

The deep vs shallow issue is really distubring. So too is the failure of the unification
algorithm to generalise as I expected:

| BTYP_ptr (`W,t1,[]), BTYP_ptr (`W,BTYP_uniq t2,[])
| BTYP_ptr (`W,t1,[]), BTYP_ptr (`RW,BTYP_uniq t2,[])
(*
| BTYP_rref t1, BTYP_rref (BTYP_uniq t2)
| BTYP_rref t1, BTYP_pointer (BTYP_uniq t2)
*)

The first pair of rules is essential for storeat operation to work as expected.
Without these rules, storeat operation with a non-uniq argument cannot be pass
a uniq pointer for the purpose of storing a value there.

What’s alarming is that the second pair of rules causes an ambiguity so they’re comment out.

More generally, uniquness is clearly a property of a location, not a value.
So a uniq type doesn’t really make sense. That is, we want uniq pointers ONLY.
All “values” in the sense of “rvalues” in C/C++ are obviously unique.

But there’s worse to come with deep copy. With a unique writable pointer to a list node,
we can modify the node in a function, but does permission extend to the tail of the list?
[It doesn’t] Yet it does with a list described with a variant where the pointer is implicit.
In that case we own the whole list.

An example of this: List::rev_map is tail recursive. List::map is defined by calling
rev_map and then reversing the list *in place*. Its safe to do this because I know
rev_map returns a mapped copy, that is, I called so I own the resulting list.
Note this would NOT be true if map is optimised away if given the identity function!

So map and rev_map both *should* return uniq lists. In particular there should
be TWO versions of “rev”: te one that gets a uniq list can reverse it in place,
the one that doesn’t have to make a new reversed list.

Thats all find because the pointer to the next node in a functional list is hidden.

The question is whether a C style list of struct nodes with next pointers could
work the same way, that is, would the type system do the right thing.
I think it would correctly only allow the head node to be modified UNLESS
all the pointers in the list were uniq .. in which case sharing would be impossible.
If the pointers were all read-only then sharing would be possible.
Both cases would be referentially transparent, that is the type system would
ensure transparency. In the case of RW pointers we have sharing,
but we lose transparency because any client can modify any part
of the list that can be reached: this is NOT an error automatically, you may
want to modify a shared object .. but it should only be allowed in a
procedure (not a function) because functions are supposed to provide transparency.

In general I have trouble with

uniq (uniq T)

i.e. if you say

box (box (&x))

or now

box (new 1)

If uniq is to be combinatorial this has to be allowed.

My feeling is that uniq T is wrong. In the literature, we start with substructural
logic, and ADD sharing. But in Felix, we have sharing and add unique.
This means we have to ask about the propagation of uniq (deep/shallow thing).
Instead of asking about propagation of sharing.

“Something is rotten in the state of Denmark”



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Jan 20, 2019, 7:08:34 AM1/20/19
to felix google


> On 20 Jan 2019, at 02:52, John Skaller2 <ska...@internode.on.net> wrote:
>
> I am changing the type of
>
> new x
>
> from
>
> &T
>
> where T is the type of x, to
>
> uniq (&T)

SO I recently committed a bogus version of all this. Finally I got it right.
I know its right now because I get errors:

Here’s the first one:

////////////////
Once error: Using uninitialised or already used once variable
(70618:->node)
Variable node defined at
/Users/skaller/felix/src/packages/lists.fdoc: line 1092, cols 5 to 66
1091: var oldlast = dl*.last;
1092: var node = new (data=v, next=nullptr[dnode_t], prev=oldlast);
**************************************************************
1093: dl.last <- Ptr node;


In instruction (_urv70558<70558>varname :>> Wptr(cptr,[])) <- (Ptr<70357>struct (node<70355>varname :>> RWptr(record(data:(int),next:(cptr),prev:(cptr)),[])));
Detected at/Users/skaller/felix/src/packages/pointers.fdoc: line 32, cols 39 to 54
31: // ordinary pointers
32: proc storeat[T] ( p: &>T, v: T) = { _storeat (p,v); }
****************
33:
////////////////

Of course it worked before, because the type wasn’t uniq. I can obviously fix it by

var node = unbox (new ….);

Of course without the change I could have written

var uniquething = box (new thing);

Another option is to bring back once variables:

var x = new thing; // discards uniq
once x = new thing; // requires and keeps uniq

Another error:

/////////////////////
[flx_bind/flx_lookup.ml:4856: E207] [bind_expression'] Dereference non pointer, type uniq((RWptr(Y,[])))
In /Users/skaller/felix/build/release/test/regress/rt/serialise_01.flx: line 77, cols 10 to 16
76: var c = new Y ( X(1,2), "hello world");
77: println$ c*.str;
*******
/////////////////////

I dont understand this one because THIS code works:
//////////////////////////
var x = 42;
var px : uniq &int = new x;
println$ (*px).str;;
////////////////////////

but this fails:

println$ px*.str;

This is supposed to be a shortcut to (*px).str, i.e. “just syntax”.

So “new” isnt used that much. It was a test case for more generally
making everything uniq that actually is. The difficulty that has surfaced is that
it propagates to variables which are shared.


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Jan 21, 2019, 8:03:24 AM1/21/19
to felix google
I have fixed

px*.str

to mean precisely the same as

(*px).str

I also added

px->str

with exactly the same meaning. This is the same as the familiar C syntax.
Couldn’t do this before because this operatorf is left associative, but the
function arrow

A-> B-> C

is right associative. The expression arrow has “factor” precedence, the same as
operator ., *. and &. do.

Please note

px -> str

means the same as

str (*px)

also, because operator . is just reverse application.


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Jan 21, 2019, 5:12:54 PM1/21/19
to felix google
So I am gradually working on actually using uniq types for performance,
and to see how they actually impact code. In this example:

//////////////////////////
var x = box ([1,2,3]);
var y = box ([4,5,6]);
var z = rev (splice (x,y));
println$ z;
///////////////////////

First, I observe that the list constructors *should* return uniq types but don’t.
So I have to uniq-ify them with the box operator.

The next line is the important one. I *safely* splice these lists together and
*safely* reverse them in place. Perhaps I should use a different function
name that rev.

Rev returns a uniq list, always. If you call it with a uniq list argument,
the reverse is done in place. Otherwise the list is reversed with a simple
tail recursive algorithm.

This change broke some code:

In /Users/skaller/felix/src/packages/grammars.fdoc: line 221, cols 24 to 52
220: | Empty => out = (head,(#`Epsilon :>> prod_t)) ! out;
221: | _ => out = (head,(`Seq (rev newseq) :>> prod_t)) ! out;
*****************************
222: endmatch;


The polymorphic variant constructor `Seq taking a uniq list argument cannot
be coerced to a type containing a `Seq constructor taking a shared list argument.

This is a bit of a can of worms. I can obviously make it work by unboxing
the result. Variants are tagged pointers and can be copied about freely,
so they shouldn’t ever have uniq arguments. However the arguments are
expected to be immutable.

Hmmm.


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Jan 22, 2019, 5:56:13 AM1/22/19
to felix google
So .. I’ve converted List rev and map to return uniq. I pretty much has to unbox
every map. Annoying.

Also, variant constructors accepting a list don’t accept a uniq. I think that has
to be fixed. However for polymorphic variants its not clear.

Constructors are generally considered purely functional and have immutable
arguments, so probably the “uniq” should just be thrown out, even for
polymorphic variants. In Felix variants are typically represented by a tagged pointer
and the pointer is to a freshly heap allocated copy of the argument .. in other
words they’re already unique (just not “uniq”ly typed).




John Skaller
ska...@internode.on.net





John Skaller2

unread,
Jan 22, 2019, 5:59:15 AM1/22/19
to felix google
Note variant constructors are specifically injection functions.
And thus as functions a parameter of type T should accept
an argument of type uniq T.

First class injections probably already do (i.e. closures over constructors).


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Jan 22, 2019, 7:24:31 AM1/22/19
to felix google

>
> Note variant constructors are specifically injection functions.
> And thus as functions a parameter of type T should accept
> an argument of type uniq T.
>
> First class injections probably already do (i.e. closures over constructors).
>

Ah,, I have mis-diagnosed the problem. Variant constructors DO accept
uniq arguments when defined with non-unique ones.

What doesn’t work is pattern matching:

This works:

/////////////
match unbox (rev ([5,6,7])) with
| Cons (t,h) => println$ "h=" + h.str + ", t=" + t.str;
| #Empty => println$ "Empty";
endmatch;
////////////

Ok so try this:

////////////
match (rev ([5,6,7])) with
| Cons (t,h) => println$ "h=" + h.str + ", t=" + t.str;
| #Empty => println$ "Empty";
endmatch;
///////////

the Cons actually works. I can prove that by:

///////////
match (rev ([5,6,7])) with
| Cons (t,h) => println$ "h=" + h.str + ", t=" + t.str;
| _ => println$ “Empty";
endmatch;
///////////

That works. But this doesn’t:

/////////
match (rev ([5,6,7])) with
| Snoc (t,h) => println$ "h=" + h.str + ", t=" + t.str;
| #Empty => println$ "Empty";
endmatch;
/////////

Flx_lookup:inner_bind_expression: SimpleNameNotFound binding expression (andthen (match_ctor Snoc(ul_mv_68619<68619>), _lam_68622 of (unit)))
Error at:
/Users/skaller/felix/ul.flx: line 26, cols 3 to 13
25: match (rev ([5,6,7])) with
26: | Snoc (t,h) => println$ "h=" + h.str + ", t=" + t.str;
***********
27: | #Empty => println$ "Empty";

[Flx_lookup.bind_expression] Inner bind expression failed binding (andthen (match_ctor Snoc(ul_mv_68619<68619>), _lam_68622 of (unit)))
SIMPLE NAME _match_ctor_Snoc NOT FOUND ERROR
In /Users/skaller/felix/ul.flx: line 26, cols 3 to 7
25: match (rev ([5,6,7])) with
26: | Snoc (t,h) => println$ "h=" + h.str + ", t=" + t.str;
*****
27: | #Empty => println$ "Empty";


So what’s going on? This is what’s going on:

variant list[T] = | Empty | Snoc of list[T] * T;
fun _match_ctor_Cons[T] : list[T] -> bool = "!!$1";
inline fun _ctor_arg_Cons[T]: list[T] -> T * list[T] =
"reinterpret<#0>(flx::list::snoc2cons<?1>($1))"
requires snoc2cons_h
;


You see, a pattern match says:

(1) try to see if the pattern matches first,
(2) if so, extract the pattern variables

If the user supplies a function

_match_ctor_CTOR : varianttype -> bool

then CTOR can be matched. If the match variable has type

uniq varianttype

is works as well because a function accepting varianttype also
accepts any subtype, including uniq varianttype. The argument extraction
is done by _ctor_arg_CTOR: variant_type -> argtype. So again
the user function should work.

This should work also for the *original* constructors but doesn’t seem to.
So Snoc and Empty fail, but Cons works because its a user defined pattern match.



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Jan 25, 2019, 10:43:30 PM1/25/19
to felix google
I am tired of trying to get Appveyor to build Felix. The stupdi system is too slow
for a standard build, forcing all sort of rubbish like saving the built version
on Github .. something contrary to the purpose of Github ..just to save time
on the build. Similarly, there’s a win64 binary version of Ocaml on GitHub.

In general, CI servers are all a load of rubbish. YAML is a heap of drivel.
Ci servers need

(a) a real control language
(b) transient storage so they can work like a real build system
(c) much better interactive interfaces
(d) inter-project synchronisation

Felix is a moderately large, moderately complicated system.
CI servers as they are at the moment are good for builing small
libraries written in C and not much else.

Felix builds on my core i7 Windows box in about 30 minutes.
It times out on Appveyor. Felix builds on my Mac in about 20 minutes.
It times out on Travis sometimes.

Appveyor doesn’t bomb a project if the log gets too long but
Travis does. This is ridiculously moronic crap.

~/felix>ls -R build/release/* | wc
5926 5624 94821

The log limit on Travis is 1000 lines. Felix built, as shown,
consists of around 6000 files.

In principle Felix can be built in pieces, but there are dependencies
and CI servers have no sane way to manage or refer to them.
Instead they use the brain dead idea of just building everything
from scratch every time.

As it is, I want to add a LOT more tests to the build.
And I also want to start adding performance tests which necessarily
must take enough time to get reasonable meaures of the time take:
IMHO about 1 second is necessary. I’m adding uniqness support to
some existing libraries, starting with list (on top of the test case
libraries ustring and ucstring) and it’s necessary to check the
performance upgrades this enables are actually happening.

There are also a significant number of optimisations in the compiler
missing or commented out. Again, there’s not much point to an optimisation
if you cannot check it is working. In some cases this requires historical
data to at least prevent performance regressions. This is hard to organise.
Tests change. New ones are added. The right way to do this is to
run the current tests on OLD versions of Felix as well as the current
one to show the improvements (if any :)

Uniqueness typing should allow Felix to clobber both Ocaml and Haskell
on a significant class of operations. Haskell is going to get linear typing
at some point.



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Jan 26, 2019, 10:17:51 AM1/26/19
to felix google
Its tricky to dream these up. I hope the answer is correct:

/////////////////////////////
// List performance test
var y = ([1,2,3,4]);
for (var j=0; j<14; ++j;) perform y = y + y;
println$ "Length " + y.len.str;
collect();

begin
var x = y;
var start = time();
for (var i = 1; i < 16 ; ++i;) perform
x = rev x;
var elapsed = time() - start;
println$ "Len = " + x.len.str + ", time = " + elapsed.str;
collect();
end
begin
var x = box y;
var start = time();
for (var i = 1; i < 16 ; ++i;) perform
x = rev x;
var elapsed = time() - start;
println$ "Len = " + x.len.str + ", time = " + elapsed.str;
collect();
end
///////////////////////

So we make a list of 64K elements, then reverse it 16 times, and call the GC.
Each rev should copy the list.

Then we make the list unique, and repeat.
It should reverse the list in place.

Here are the numbers:

~/felix>flx lst
Length 65536
Len = 65536, time = 3.91266
Len = 65536, time = 0.0106959


Here’s the reason its 400 times faster:

~/felix>FLX_REPORT_COLLECTIONS=1 flx lst
[FLX_REPORT_COLLECTIONS] Collection report enabled
[FLX_REPORT_GCSTATS] GC statistics report enabled
[FLX_REPORT_COLLECTIONS] Collection report enabled
[FLX_REPORT_GCSTATS] GC statistics report enabled
Length 65536
[Gc::collect] Program requests collection
[flx_gc:gc_profile_t] actually_collect
actually collected 262136 objects, still allocated: 2 roots, 65539 objects, 1048808 bytes
[Gc::collect] Collector collected 262136 objects

^^^^^^^^^^^ after initial list construction

vvvvvvvvvv start first test timer now

[flx_gc:gc_profile_t] Threshhold 10000000 would be exceeded, collecting
[flx_gc:gc_profile_t] actually_collect
actually collected 345258 objects, still allocated: 2 roots, 139868 objects, 2238096 bytes
[flx_gc:gc_profile_t] New Threshhold 10000000
[flx_gc:gc_profile_t] Threshhold 10000000 would be exceeded, collecting
[flx_gc:gc_profile_t] actually_collect
actually collected 308095 objects, still allocated: 2 roots, 195612 objects, 3130000 bytes
[flx_gc:gc_profile_t] New Threshhold 10000000
[flx_gc:gc_profile_t] Threshhold 10000000 would be exceeded, collecting
[flx_gc:gc_profile_t] actually_collect
actually collected 345760 objects, still allocated: 2 roots, 171883 objects, 2750336 bytes
[flx_gc:gc_profile_t] New Threshhold 10000000
[flx_gc:gc_profile_t] Threshhold 10000000 would be exceeded, collecting
[flx_gc:gc_profile_t] actually_collect
actually collected 357624 objects, still allocated: 2 roots, 154087 objects, 2465600 bytes
[flx_gc:gc_profile_t] New Threshhold 10000000
[flx_gc:gc_profile_t] Threshhold 10000000 would be exceeded, collecting
[flx_gc:gc_profile_t] actually_collect
actually collected 366522 objects, still allocated: 2 roots, 140740 objects, 2252048 bytes
[flx_gc:gc_profile_t] New Threshhold 10000000
[flx_gc:gc_profile_t] Threshhold 10000000 would be exceeded, collecting
[flx_gc:gc_profile_t] actually_collect
actually collected 307658 objects, still allocated: 2 roots, 196267 objects, 3140488 bytes
[flx_gc:gc_profile_t] New Threshhold 10000000
[flx_gc:gc_profile_t] Threshhold 10000000 would be exceeded, collecting
[flx_gc:gc_profile_t] actually_collect
actually collected 345432 objects, still allocated: 2 roots, 172374 objects, 2758192 bytes
[flx_gc:gc_profile_t] New Threshhold 10000000
[flx_gc:gc_profile_t] Threshhold 10000000 would be exceeded, collecting
[flx_gc:gc_profile_t] actually_collect
actually collected 348674 objects, still allocated: 2 roots, 163159 objects, 2610752 bytes
[flx_gc:gc_profile_t] New Threshhold 10000000
Len = 65536, time = 4.0285

^^^^^^^^^^^^^^^ first test finished

vvvvvvvvvvvv final collection
[Gc::collect] Program requests collection
[flx_gc:gc_profile_t] actually_collect
actually collected 289631 objects, still allocated: 2 roots, 131077 objects, 2097432 bytes
[Gc::collect] Collector collected 289631 objects
Len = 65536, time = 0.0105109

^^^^^^^^^^^ second test finished

vvvvvvvvvvv final collection

[Gc::collect] Program requests collection
[flx_gc:gc_profile_t] actually_collect
actually collected 131071 objects, still allocated: 2 roots, 131078 objects, 2097456 bytes
[Gc::collect] Collector collected 131071 objects
Deleting collector total time = 4.89595 seconds, gc time = 2.29151 = 46.80%

Now again with more memory allocated:

~/felix>FLX_REPORT_COLLECTIONS=1 FLX_MIN_MEM=1000 flx lst
[FLX_REPORT_COLLECTIONS] Collection report enabled
[FLX_REPORT_GCSTATS] GC statistics report enabled
[FLX_REPORT_COLLECTIONS] Collection report enabled
[FLX_REPORT_GCSTATS] GC statistics report enabled
Length 65536
[Gc::collect] Program requests collection
[flx_gc:gc_profile_t] actually_collect
actually collected 262135 objects, still allocated: 2 roots, 65540 objects, 1048832 bytes
[Gc::collect] Collector collected 262135 objects
Len = 65536, time = 2.00956
[Gc::collect] Program requests collection
[flx_gc:gc_profile_t] actually_collect
actually collected 3014654 objects, still allocated: 2 roots, 131078 objects, 2097456 bytes
[Gc::collect] Collector collected 3014654 objects
Len = 65536, time = 0.0100911
[Gc::collect] Program requests collection
[flx_gc:gc_profile_t] actually_collect
actually collected 131072 objects, still allocated: 2 roots, 131078 objects, 2097456 bytes
[Gc::collect] Collector collected 131072 objects
Deleting collector total time = 4.31303 seconds, gc time = 1.88991 = 43.82%

in summary:

~/felix>FLX_MIN_MEM=1000 flx lst
Length 65536
Len = 65536, time = 1.99968
Len = 65536, time = 0.00941896

So, now the collection is not counted in the first test, but allocations are.
Its twice as fast. But the in place reversal does no allocations.
So now its only 200 times faster.

Notice the ONLY change in the code is that the candidate list x is made unique
in the second test.

I first tried to make a concatenation test. But its hard to do because uniq things can only be
used once. Interestingly my “copy()” routine does nothing for a uniq list, but copies a
non-unqiue one. I need a routine

dup: list -> uniq list * uniq list

SO you can do:

var x = box ([1,2,3,4]);
def x,var y = dup x;

Now we have two (distinct) uniq copies of the original x.
Hmmm.


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Jan 27, 2019, 9:41:38 AM1/27/19
to felix google
Here’s an append test:

//////////////////
var elapsed1 : double;

begin
var x = Empty[int];
var y = ([1]);
var start = time();
for (var i = 1; i < 2000; ++i;) perform
x = x + y;
elapsed1 = time() - start;
println$ x.len.str;
eprintln$ "Standard append " + elapsed1.str;
collect();
end

var elapsed2 : double;
begin
var x = box Empty[int];
var start = time();
for (var i = 1; i < 2000 ; ++i;) do
var y = box ([1]);
x = x + y;
done
elapsed2 = time() - start;
eprintln$ "Unique append " + elapsed2.str;
println$ x.len.str;
collect();
end

println$ elapsed2 < elapsed1 / 4.0;
/////////////////


So, the standard operation has to reverse the list into another list,
reverse the tail list onto that, then reverse the result, discarding
the original list AND the reversed list. However the same tail can
be used for each iteration.

The unique version has to create a fresh tail every time, but then
it just splices the head list onto the tail.

So the uniq version should be a bit faster, since the tail is only one element.
I had to set the counter to only 2000. Here’s why:

Standard append 7.82298
Unique append 0.00738215

The unique version a THOUSAND times faster. With large initial memory:

Standard append 3.92569
Unique append 0.00534487

note neither test includes the terminal GC time.
I do a collection after the first test to give the second one a fair go.
Even with no collection in the first test we still have:

Deleting collector total time = 7.20084 seconds, gc time = 3.23785 = 44.96%

So roughly .. HALF the time is spent allocating memory and the other HALF
the time collecting it in the first test.

Uniq looks useful …

C++ move semantics should enable similar performance gains for suitable
data structures, but probably don’t becase the optimisations do not play
well with reference counting.



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Feb 2, 2019, 9:48:21 PM2/2/19
to felix google
At the moment, Felix removes all axioms during monomorphisation.

For reductions: if all the symbol instances in the reduction have been
instantiated by monomorphisation, then the reduction is too,
and the instance retained.

For example consider:

reduce rev2 (x:list[string]): rev (rev x) => x;

This reduction rule is monomorphic, because there are no type
variables. However list and rev are polymorphic. The instantiation
which might be retained is:

reduce rev2_string (x:list_string):
rev_string (rev_string x)) => x

PROVIDED the two instantiations:

rev[string] -> rev_string
list[string] -> list_string

were performed during the main monomorphisation pass.
The condition can cause a loss, although not in this case.
In this case if either of the symbols rev_string, list_string
doesn’t exist in the monomorphised symbol table, the
LHS pattern of the reduction can never match any expression,
so throwing it out is good because it improves performance.

However more generally, if all the symbols in the LHS exist,
there can be a match, but if there is a non-existent symbol in
the RHS, the reduction cannot be applied anyhow. For example
in the stupid case:

x => rev (rev x)

if list_string exists, the LHS can match but the RHS can’t be generated
if rev_string doesn’t exist.

==========

The situation is very bad. All reductions are check prior to monomorphisation
but this is close to pointless. Most reduction patterns don’t occur in
hand written code: the programmer hand optimises them away.

Reduction rules are useful because optimisations like inlining
can create patterns that do match. Previously Felix did everything
polymorphically, but it was too hard to get right so I monomorphised
before optimisation which kills the utility of polymorphic reductions.

There’s a good reason NOT to generate all the polymorphic reductions
that might match. Consider list map fusion:

map f (map g x) => map (f \circ g) x

If we have list x with element type T, g codomain U,
and f codomain V, and we have 10 types of lists,
we need 10 * 10 * 10 instances to cover all possible
fusions. We can reduce that number by checking there
is a map from T to U or U to V: of course for the reduction
to actually be applied these have to be coupled.

Still I don’t know how to do that efficiently, it feels like
a 10*10 loop or worse to decide what to instantiate.

In theory, we instantiate polymorphic reductions over ALL
possible types but this makes matching impossibly slow,
and of course we’d probably run out of storage before we
even got there. That’s what matching with polymorphic
reduction is used, because the match is just unification.
The problem is the old non-type symbols are gone.

One way to fix this would be to keep an INVERTED instantiation map,
that is, one that goes from the new monormorphic symbol back to the
old polymorphic symbol. We can then take an expression and
*reverse* the monomorphisation mapping to get the original
polymorphic expression, and then match that against the reduction
LHS using unification.

The critical observation is that unification does not depend on the
symbol table. So the fact the old polymorphic symbol have been
deleted doesn’t matter for the purpose of matching!

Now if we find a match, we have to monorphise the RHS of the reduction
before replacing the expression. That can fail if it involves a symbol
instance that wasn’t already monomorphised. In that case,
we can simply retain the original expression.

So I think there is a way to get polymorphic reductions to actually work.

======================================

The other way to do this is to require the user to MANUALLY instantiate
the reduction. That is, introduce a statement:

instantiate revrev[string];

This is much simpler to do because it generates the instantiations during
monomorphisation in the usual way, and consequently if the LHS of
a reduction rule matches, the RHS terms will all exist. The only trick
here is to ensure LHS and RHS symbols are considered “used” when
garbage collecting the symbol table.

The problem with this method is that the programmer has to KNOW they
want certain monomorphic reductions applied, it won’t happen by “magic”.

On the other hand, this machinery has a major advantage: it strictly
limits the actual number of matches that need to be done.

This is important because

(a) matching is done on EVERY expression
(b) INCLUDING every subexpression of that expression
(c) for EVERY reduction LHS
(d) at MULTIPLE points in the optimisation process

Of course .. the user could just write a monomorphic reduction rule in the first place!



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Feb 16, 2019, 8:41:36 AM2/16/19
to felix google
Most of the Felix compiler could be written in Felix. But there’s one part i have been
struggling with: replacing Dypgen, the parser.

Bison actually can do GLR, although its mechanim is very weak. Unlike Dypgen,
which relies on user actions being purely functional, Bison doesn’t actually
execute the user actions on speculative branches, it saves them up until,
I think, there’s only one thread left. Not sure but its .. well rubbish.

The other problem with Bison is it expects to process tokens, and, because
it doesn’t actually execute speculative branches, the tokeniser has to be fixed.
Dypgen can use tokens but you don’t have to write a tokeniser, its scannerless.

I’ve been trying to figure out how that works. Dypgen takes a grammar and
produces a LALR1 kernel. It then uses that to parse. When it hits a conflict,
be it shift/reduce or reduce/reduce it splits processing into two logical threads,
one handling each alternative.

What had me baffled was this: because Dypgen is scannerless,
if one production being followed has say a token specified by regex1,
and the other by regex2, what does it do? GLR by specification advances
all threads simultaneously, in other words given a token, all the threads
consume it, before any proceed to the next token. In other words, they
proceed in lock-step. But how can they do that if regex1 consumes more
characters than regex2?

Dypgen only tries the regex of each branch in the current position,
not all of them.

I now know how it works I think :-)

At each point each alternative has a set of possible next tokens.
All of them are tried. But only the LONGEST MATCHES is kept.
Any branch using a shorter match is rejected, as if there
were NO matches.

How exactly does this work? Well I think what happens is this:
given a grammar, every regexp (or string literal) is assigned
a unique token number. We put each one in an array at that
index. So at any point, we have a set of integers for each alternative.
We apply all the corresponding regex, keeping only the ones
with the longest matches. We can then reject all the branches,
or, we can continue with one or more.

The set of tokens for each branch is called the FIRST set,
it is calculated during calculation of the LALR1 kernel.

All the alternatives that proceed consume the SAME string.
Even if they call it a distinct token. And that’s the secret
I was missing.

Trivial example:

number = float | integer
float = r"9+(.9+)"
integer = r”9+"
sum = float “+” float
sum = integer “+" integer

99 + 99.9

This has to be float + float. However both float and integer match
the first digits so both branches are taken, the second one being
rejected on seeing 99.9 because that can’t be an integer even
though it matches the first two digits, float is a longer match.

There are FOUR terminals here: float, int, and + and + again.
Each and every regex or string in the grammar is considered
a *distinct* terminal, even if its the same as another one.



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Feb 22, 2019, 2:03:46 PM2/22/19
to felix google
Hmm .. just had an idea how to make labels and gotos safer.

The problem is that in Felix, LABEL is a first class type. In particular,
you can pass labels around and store them in variables.
And you can goto a label stored in a variable:

proc f() {
lab:>
var there = lab;
// blah
goto there; // indirect goto
}

You can pass a variable too:

proc g(there: LABEL) { .. goto there; }

Note that a label is NOT just a code address.
A label value ALSO includes the continuation object for which it
is taken, similarly to a function closure.

FYI: it took a while to figure out what I did. A continuation
consists of the this pointer to a procedure object, and that’s all.
However procedure objects are derived from C++ class con_t
which contains a PC field which is a code location (usually
an integer which the resume() method can switch(){} to).

A label is a jump address, which is a continuation plus a program
counter. Since the continuatiuon already contains a program counter,
a jump address just overrides it. So “in principle” label objects,
being jump addresses, and continuations, are the same: they’re
both continuations in the abstract. The distinction is artificial,
and a consequence of adding dynamic labels after already
implementing continuations. Sorry about that! [In other words
in the continuation component of a jump address the continuations
PC is ignored]

The point, however, is that the continuation pointer is garbage collected
which means it can never dangle. If you can construct a label value,
you can always SAFELY jump to it from anywhere at any time.

HOWEVER it if the continuation frame has already returned,
the return address will have been zeroed out by stack unwinding
on return and so if you jump to a label in a dead continuation,
it will run up to the return statement, and then terminate
the fibre. This is safe but it is also nonsense. :-)

Anyhow, the problem I want to address here is that LABEL values
can jump to ANY label.

Now in the old days in Fortran and C, you could pass ANY value
to a function. You could pass an integer where a float was required.
There was no type checking (due to separate compilation).
That got fixed by adding type checking.

LABEL values have the same problem so it should help to solve
it with the same idea: allow a type annotation.

proc f() {
type T,U;
lab[T]:>
var x = lab;
somehwere[U]:>

goto x;
}

we don’t know where x is, so we don’t know where the goto x goes.
But we know it CANNOT be somewhere, because x has label type
annotation T, and somewhere has annotation U.

Note that in Felix, type can *escape* abstractions. For example
in a class a type definition labelled as private only means the name
of the type, not the type itself, is private. A value of that type can still
be passed outside the class and recaptured with the typeof() operator.
Similarly, types local to functions and procedures can escape, even if
their names are hidden by functional abstraction.

Therefore, as specified, typed labels do not provide much enforcement.
They do not constrain the set of allowed jumps, rather they constrain
the set of disallowed jumps. In other words we’d like to have a finite
set of possible jump targets but instead we get a finite set of targets
we cannot be jumping to.

Hmmm.



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Feb 24, 2019, 8:42:51 PM2/24/19
to felix google
I’m editing the reference manual and i’m really unhappy with loops.

The C style loop is good. Its fast and correct.

The iterator based loops are correct but really slow because iterators
tend to spawn new generator frames for some reason. I added some
optimisations for slices like 0..10 and 0..<9 but realised I don’t know
the right semantics for loops. The basic loops

for i in 0 upto 10

is a mess because it tries to handle both the maximum range (min to max)
and also the empty range and it isn’t clear what the exit value of i should be.

Ideally we would be able to write something like:

for i in 0 upto 10 ..
for i upto 20 …

and be sure that is equivalent to

for i in 0 upto 20


but its hard to see how to do that! The loops takes pains to ensure i cannot
go off the end, so i has the value of the final iteration. But then the followup
loop is hard to write (the one shown will fail because it would start with
i instead of i + 1).

Its not clear to me what actually happens if the range is empty.
In theory, i should not be changed in that case.

Note that the final value of i doesn’t matter for iterator loops because
the control variable is local to the loop. The basic loops, however,
do not define scopes.




John Skaller
ska...@internode.on.net





John Skaller2

unread,
Mar 3, 2019, 9:06:42 AM3/3/19
to felix google
I’ve been wanting to do this for a while (to relieve the boredom of doing documentation and
chasing bugs).

Consider a class and instance like

class Tord[T] { virtual fun <: T * T -> bool; }
instance Tord[int] { fun <: int * int -> bool = “$1<$2”; }
println$ Tord[int]::< (1,2); // true

The problem is, we might want to order ints in the other direction.
I propose this:

class Revint = instance Tord[int] {fun <: int * int -> int = “$1>$2”; }

This defines a (partial) specialisation, as does any instance, however
it gives it a name. The result is a class. Of course, the class can
be polymorphic:

class RevOrd[T] = instance Tord[T] {fun <:(x:T,y:T) => Tord[T]::<(y,x); }

Notice RevOrd’s < function is no longer virtual!

Basically, we take the original class and specialise the type variables,
and then specialise the some of the virtual methods too. Non-virtual
methods can’t be specialised.

The construction is basically a functor, where the parameters are the
type variables of the domain class, and its virtual method, and the arguments
are the same as for an instance, which replace the type variables and
virtual methods. Its exactly the same as an instance except instances
are found implicitly: for any types there is only one instance, which fixes
the virtuals to those in the instance definition.

The constuction above does the same operation as ordinary instantiation,
except that the result has a name.

In effect, the class Tord is turned into a what Ocaml calls a module functor.
Note virtual types get mapped too, but my brain hurts thinking about them.




John Skaller
ska...@internode.on.net





John Skaller2

unread,
Mar 7, 2019, 7:07:14 AM3/7/19
to felix google
I’ve been trying to document uniqueness types, and have had loads of problems.
Basically .. they don’t work. I mean, in Felix, as I have designed it.

The problem basically comes down to a polymorphic function

fun f[T] (x:T) => x,x;

and a unique value:

var y = box whatever;

and the application:

f y

now causes a fault, duplicating the uniq value y. The fault can be avoided with
a specialisation:

fun f[T[ (x: uniq T) => let z = unbox x in z,z;

In specific cases there’s no problem:

fun g(x:int) => x,x;
g (box 1) // fine!

because uniq int is a subtype of int. the problem is a parameter of type T
doesn’t mean “any non-unique type” but “any type”.

This issue doesn’t occur in C++ with rvalue references because roughly
rvalues are automatically unique. If you make a function in C++ with an
rvalue binder, you have to make one with an lvalue binder too (either
a ref, const ref, or pass by value or whatever). But if you make a
non-rvalue binder function, pass by value or const ref, it accepts rvalues too.
Even for templates, a T parameter bound to an rvalue does not become
a T&&.

The core problem is that (a) all values are always uniq anyhow.
Uniquness as a special thing ONLY applies to variables.
Why? Because variables are, pretty much, the ONLY way to make
something non-unique. Of course references and pointers
do too but its all the same bag really: storage location involved.

In Felix there’s a problem is you have a function returning a uniq thing
and you write:

var x = function () …

then unexpectedly x is unique. Whether it is unique or not changes the
meaning of the code.

We just can’t have this.



John Skaller
ska...@internode.on.net





Keean Schupke

unread,
Mar 7, 2019, 7:18:12 AM3/7/19
to felix google
Hi John,

I have thought for a long time that rvalues and lvalues are confusing, but they serve a purpose in C/C++.

One simple way to resolve this is to make all variables immutable, and only allow mutation of "objects" that is things pointed to. This makes sense because to be mutable, and to have an identity (as in aliasing) something must have an address.

So if you want a mutable counter you would need to do:

var *x = 0
(*x)++

Using a 'C' like notation. You then have pointer to linear(unique) mutable, pointer to immutable and pointer to aliased mutable as different types.

You could reduce the restriction on variables from immutability to linearity, and this would still work I think.


Keean.





--
You received this message because you are subscribed to the Google Groups "Felix Language" group.
To unsubscribe from this group and stop receiving emails from it, send an email to felix-languag...@googlegroups.com.
To post to this group, send email to felix-l...@googlegroups.com.
Visit this group at https://groups.google.com/group/felix-language.
For more options, visit https://groups.google.com/d/optout.

John Skaller2

unread,
Mar 7, 2019, 10:33:49 AM3/7/19
to felix google


> On 7 Mar 2019, at 23:18, 'Keean Schupke' via Felix Language <felix-l...@googlegroups.com> wrote:
>
> Hi John,
>
> I have thought for a long time that rvalues and lvalues are confusing, but they serve a purpose in C/C++.
>
> One simple way to resolve this is to make all variables immutable, and only allow mutation of "objects" that is things pointed to. This makes sense because to be mutable, and to have an identity (as in aliasing) something must have an address.
>
> So if you want a mutable counter you would need to do:
>
> var *x = 0
> (*x)++

Actually this IS what Felix does logically, though not with that notation.

In Felix you store a value in a variable using the procedure

_storeat ( &x, v);

which is notated more conveniently by

&x <- v;

or by

x = v;

So the interpretation is that “x is an immutable address” but it is spelled “&x”.
When you write “x” in an expression you really mean “*&x”.


> Using a 'C' like notation. You then have pointer to linear(unique) mutable, pointer to immutable and pointer to aliased mutable as different types.

However Felix does not do that. If you write:

var x = box 42;

then x has the type uniq int, and &x has the type &(uniq int). That’s actually a read-write pointer.
The notation &<x has type &<int which is a read-only pointer (and &>x has type &>int which
is a write only pointer).

Now, this is NOT quite the same as what you said. But I think probably what you said
is correct.

I think the difference is that in Felix you can have a read, read-write, and write-only
pointer to the same variable, and the variable can be of type uniq T, or some other T.

If I understand correctly, Rust variables are immutable unless you say

“mut”

Lots I don’t like about Rust but the way it does unique typing does work.
I actually don’t want full enforcement because of the compilcations.
The idea is that uniqueness is just a device to *allow* the programmer to represent
the two sides of an exclusive ownership contract, that is, a contract where
a functions says “I want to own that location exclusively so i can modify it”
which is currently written

fun f (x: uniq list[int]) => …

for example a uniquely owned list, and the caller says

f (box x)

to specify that the list x is exclusively owned and the ownership is being transfered.

In C++ expressions are (usually) rvalues, and you can use ::std::move(x) to specify
a variable x is to be moved. However I don’t think T&& is actually a type in C++,
that is, if you call a template

template<class T>
void f(T x)

with an rvalue “int” than T is stil int. You have to write T &&x to capture ownership.

Anyhow the problem in Felix is that the “basic” language is like C++ or Algol in that
by default variables are addressable and the pointers can be shared and anyone with
a pointer can read or write. This is true at the moment EVEN IF the type is uniq.
Only the actual variables are tracked by the control flow analysis.

Actually I’m quite happy with how it works for monomorphic types.
in particular a unique value can be passed to a function accepting a non-unique
type, for example

fun f(x:int) => …
f (box 42)

works fine because uniq T is a subtype of T. The problem is with

fun f[T] (x:T) =>

sets T to “uniq int”.

Roughly I think Felix has the same bug as C++ “const”. In other words, it should
be impossible to decide T is uniq anything. The polymorphic function above
should specify that x is not unique. If you want uniq, you should write it:

fun f[T[ (x: uniq T) => …

But I can’t do that because “uniq” is a type combinator (constructor).

Your solution has to be correct: “uniq” should be a property of a pointer,
not the value pointed at. The same as “const”.


> You could reduce the restriction on variables from immutability to linearity, and this would still work I think.

The problem is how to “do it right” without breaking too much code. I changed operator new in Felix
to return uniq. Don’t use it much so no drama. But in principle with the current model ALL
constructors should return uniq.

Felix already has

val x = 1;

Note “val” not “var”. Vals are not immutable, they’re “single assignment”.
But its close enough:

for(i in 1..20) do
val x = 1; // OK, x is reset each iteration
….

At some stage I also had

once x = 1;

to mean x had to be used exactly once. But that doesn’t work.

Felix handles *components* of product types that are unique.
For example

var x = (42, box 42);

this works fine. The first component is non-unique, the second uniq.
Felix tracks the uniq second component by pretending its a variable.
But it can only do that *because* values can be unique.

Sub-components of “variable as address” are also addresses.
Felix does that by overloading projections so there are both
value projections (fetch component) and pointer projections.





John Skaller
ska...@internode.on.net





Keean Schupke

unread,
Mar 7, 2019, 10:52:28 AM3/7/19
to felix google
Hi John,

I wonder, 'val' would be 'variable' and 'var' would be a pointer to a mutable 'object' with automatic dereferencing. Val's because they are immutable would never need to be unique (linear). If you make a 'var' unique it would be a property on the pointer because every access (read or write) is a dereference. 

The key point is that something mutable must have an address, so therefore it semantically behaves like a pointer to a mutable object, not the mutable object itself. We can see this if we think about object identity and passing by reference.

I guess this is going to break anything with type annotations, but you could provide an automatic type upgrade with a new language version release that rewrites all the current types into the new equivalent, as Apple do. I think this would leave the value level syntax alone?

Keean.



John Skaller2

unread,
Mar 7, 2019, 10:56:41 AM3/7/19
to felix google
Just to be clear: the *problem* with variables being pointers, so they can be

* pointer to movable objects
* pointer to shared objects
* pointer to immutable object

is products. If a variable is a tuple or record or struct type, that the whole product
has one of the above properties. Whereas with uniq type, its component by component.
The value typing splits the variable into sub-variables with distinct types.
This can’t work if the pointer type has the unique/shared/immutable property instead
of the value type.

Indeed, as I understand it, Rust has no products. It has functions with multiple parameters.
Each function has is also a specific kind of product functor. This is a disaster.
Products are fundamental.


John Skaller
ska...@internode.on.net





Keean Schupke

unread,
Mar 7, 2019, 11:03:32 AM3/7/19
to felix google
I don't think that's the case, it depends on whether you have deep immutability or not.

Marking pointers is always shallow. So the uniqueness would only apply to the object directly pointed to.

As a product must contain 'objects' not values, then that means the pointers in the product are immutable (you cannot swap objects in and out of the product) but you can mutate them, unless that pointer in turn is immutable/unique etc.

With values there is no such thing as a pointer or a reference, a value must be exactly that, so a product of values is itself an immutable value.

In theory there is no reason why you can't have in-place immutable values in an object, and whether you can change those values depends on the mutability of the object containing them, but of course the values them selves are immutable.

I may not have explained very well, does that make sense?

Keean.



John Skaller2

unread,
Mar 7, 2019, 11:10:05 AM3/7/19
to felix google


> On 8 Mar 2019, at 02:52, 'Keean Schupke' via Felix Language <felix-l...@googlegroups.com> wrote:
>
> Hi John,
>
> I wonder, 'val' would be 'variable' and 'var' would be a pointer to a mutable 'object' with automatic dereferencing. Val's because they are immutable would never need to be unique (linear). If you make a 'var' unique it would be a property on the pointer because every access (read or write) is a dereference.
>
> The key point is that something mutable must have an address, so therefore it semantically behaves like a pointer to a mutable object, not the mutable object itself. We can see this if we think about object identity and passing by reference.
>
> I guess this is going to break anything with type annotations, but you could provide an automatic type upgrade with a new language version release that rewrites all the current types into the new equivalent, as Apple do. I think this would leave the value level syntax alone?


Well Felix itself, and me, are the only users. And I could always invent a new language.

I certainly agree with the idea mutable means addressable, at least underneath.
In Felix vals cannot be addressed. They can change, but they’re not really mutable.
Its more like, in a loop, the val is “reset* to a new value, rather than modified.

in fact Felix has a loop:

rfor i in ….

The ‘rfor” loop is actually a tail recursive procedure, so any vals in the loop body
really are immutable, that is, they’re like “let” variables in Ocaml. The “reset”
is actually assigning to a different address in a different frame because of the
recursion, unless the compiler self-tail-rec optimises the loop.

Sigh. Its actually an accident that I discovered vals are not immutable,
they were meant to be.

The key property of vals is lazy evaluation. That is, the initialiser of a val
is allowed to replace any use of the val. Whereas vars are guarranteed
to be eagerly evaluated, i.e. when control flows through them.

the problem in any programming language is most “things” have multiple
factors each with multiple options, but you cannot give the programmer
all the combinations or their brains would explode. So you have to pick
the most useful ones. So in Felix we have “val” and “var”.

There is also a “const” but it is a C binding .. to an expression which need not
be const. Again, it was meant to be const:

const x = “42”;
const y = “::std::time()”; // woops, not const at all :-)

“cexpr” would be a better name for it.




John Skaller
ska...@internode.on.net





John Skaller2

unread,
Mar 7, 2019, 11:25:29 AM3/7/19
to felix google


> On 8 Mar 2019, at 03:03, 'Keean Schupke' via Felix Language <felix-l...@googlegroups.com> wrote:
>
> I don't think that's the case, it depends on whether you have deep immutability or not.
>
> Marking pointers is always shallow. So the uniqueness would only apply to the object directly pointed to.
>
> As a product must contain 'objects' not values, then that means the pointers in the product are immutable (you cannot swap objects in and out of the product) but you can mutate them, unless that pointer in turn is immutable/unique etc.


But products are values. For example

(10, 20)

is a coordinate on a screen. Its a single value.



>
> With values there is no such thing as a pointer or a reference, a value must be exactly that, so a product of values is itself an immutable value.

In Felix, it works really well: the type above is int * int. Its a value. Its immutable.
However if you put it in a variable its mutable:

var z = (10,20);
&z.1 = 21; // z is now 10,21


That’s because the projection 1 is overloaded so it applies to values AND pointers.

Now consider:

var z = (10, box 20);

Here the type of the value is

int * uniq int

If you do this:

var u = z;

its fine! And now you can do this:

var one = z.0;

That’s fine! But this is an ERROR:

var two = z.0;

and Felix catches it. The point is, its NOT the variable that’s uniq or not,
and that’s why the above works. you can do this:

&z.0 <- 99;

and that’s NOT an error. The component was dead because we moved its
value to “one”. So now you can assign it again.

Felix keeps track of the state of every component of every variable
but it ignores address taking because that’s too hard (except
for the LHS of &x <- v which it knows is an assignement).

The point is, values are “immutable” because they don’t HAVE an address.
But if you store a value into a variable OR if you construct a value
on thje heap, you get an address, and then you can (potentially) modify
them, unless the type system disallows it.

uniq is not handled by the type checker, its a completely separate control
flow analysis done long after type checking, overload resolution,
and monomorphsation, in fact it is done after inlining and optrimisations
as well (which means the error diagnostic are useless .. but that’s another issue :)




John Skaller
ska...@internode.on.net





Keean Schupke

unread,
Mar 7, 2019, 11:28:16 AM3/7/19
to felix google
My simple system is this (you may notice it works pretty much like arrays)

Write an immutable value like this:

let x = 3

We write an object like this:

let x = [3]
print x[0]

You can mutate the contents (change the 3 for another value).

We write a 'value' product like this:

let x = (3, 4, [5])

So 3 and 4 are constant, you can swap the contents of the third element for a different value, but not change the identity of the object (the address pointer is not changeable)

You write an object product like this:

let x = [3, 4, (5, 6)]

So you can mutate 3 and 4, but not 5 or 6, but you can swap out the pair (5, 6).

You only need one declaration:

let x = 3
let y = [4]

All the information is carried in the type of the variable.

Only drawback is that you must de-reference an object to get the value:

print(x, y[0])

Perhaps there should be a short-handed for the first element of an object? 

My view is that if you get this right, it should be simple, but that's one of the hardest things in language design to get right. Lvalues/rvalues are a mess. The above is my best effort so far.

Keean.


Keean Schupke

unread,
Mar 7, 2019, 11:40:16 AM3/7/19
to felix google
Regarding:

>   &z.0  <- 99;

Surely this should be an error because 'z' is a value, swapping an object for the value 10 should be illegal as it is not boxed.

If we think of it like this:

A value is a sequence of bytes, like an int32 is 4 bytes. Let's say we have the product of two integers, that's a sequence of 8 bytes. If this is immutable is may not have an address, it could be in a register, there is any no way to address part of it and change its contents.

If we have a pointer to those 8 bytes that says they are immutable there should be no way to change any of those bytes.

If instead we have a product of two integers and an object, then on 64 bit system the object will be an 8byte pointer. The product would be 16 bytes and the whole thing would be immutable, you cannot change the pointer in the product to point to another object, but we can mutate that object (if this pointer is not also immutable).

I am not sure what Felix is doing differently that would allow the number 10 to be replaced in that product, does it store all values externally from the product?

Keean.


John Skaller2

unread,
Mar 7, 2019, 12:03:04 PM3/7/19
to felix google


> On 8 Mar 2019, at 03:28, 'Keean Schupke' via Felix Language <felix-l...@googlegroups.com> wrote:
>
> My simple system is this (you may notice it works pretty much like arrays)

Yes, I like it, and its pretty similar to Felix actually.

>
> Write an immutable value like this:
>
> let x = 3
>
> We write an object like this:
>
> let x = [3]
> print x[0]
>


In Felix roughly you would use a pointer:

val x = 3;
val x = new 3;
print *x;

For the same 3 statements.



> You can mutate the contents (change the 3 for another value).
>
> We write a 'value' product like this:
>
> let x = (3, 4, [5])

val x = (3,4,new 5)

The type is int * int * &int


> Only drawback is that you must de-reference an object to get the value:
>
> print(x, y[0])
>
> Perhaps there should be a short-handed for the first element of an object?


In Felix, for the x above, to get the third component it would be

*(x.2)

or

x*.2

and to assign to it

x.2 <- 42

No short hand for first component. It looks like our solutions are isomorphic.
However Felix has vars as well. With a var ALL the components are assignable.
The translation is simple, so vars don’t add anything:

var v = 3,4, new 5;
val x = &v;

however none of this handles move/unique .. :-)

We have mutable and immutable and a way to get from immutable to mutable.
Also when you use a var in an expression .. it is now a value and immutable.


> My view is that if you get this right, it should be simple, but that's one of the hardest things in language design to get right. Lvalues/rvalues are a mess. The above is my best effort so far.

Yes, its nice, and AFAICS the same as Felix except for notation. The problem is how to extend it
to handle moves.

BTW: Barry Jay once said to me, that the way a language handles variables
*characterises* the language.

In Ocaml, all variables are immutable. but records can have mutable fields.
References ML style are derived from that:

type ‘a ref = { mutable contents: ‘a }

i.e. a reference is just a record containing a mutable field. However, Ocaml BOXES
(almost) all values, and they’re immutable. So “ref” is immutable, it is of course
a pointer due to boxing. Clever. In the core language, mutable fields are the only
departure from purely functional semantics.

The problem with “move” semantics is that languages don’t have the syntax which allows
the variable whose contents are moved out to “go out of scope”. In fact, its worse:
if you have no product functor, the only thing you can do with functions is compose
them linearly, and, all values are then unique and thus mutable. Its actually products
that allow “expressions” in the form of trees, and variables exist ONLY to allow
sharing. A variable that only gets used once isn’t necessary, just replace the one
use with its initialiser.

So making “uniq” variables is in some sense oxymoronic.




John Skaller
ska...@internode.on.net





Keean Schupke

unread,
Mar 7, 2019, 12:19:45 PM3/7/19
to felix google
I am not sure I see the problem. If a unqiue pointer is stored in a product, you can only mutate the object as much as you like, but you can only copy/move the pointer itself once. The type system would have to enforce that nothing ever tries to copy/move the pointer again.

With a mutable product you could swap the pointer with a different pointer as an additional operation not possible with an immutable product.

Keean.



John Skaller2

unread,
Mar 7, 2019, 12:20:51 PM3/7/19
to felix google


> On 8 Mar 2019, at 03:40, 'Keean Schupke' via Felix Language <felix-l...@googlegroups.com> wrote:
>
> Regarding:
>
> > &z.0 <- 99;
>
> Surely this should be an error because 'z' is a value, swapping an object for the value 10 should be illegal as it is not boxed.


In the example, z is a var:


var z = …

This means, the name of the variable is spelled &z, and plain z actually means the variable
dereferenced.

The notation is hard to explain. The variable is an address. Its a sop to C programmers
that the spelling ”z” means what is stored at the address whilst “&z” actually means
the address itself.

The “&” there isn’t an operator. Its part of the name. Of course its called
“the addres of operator” but you can only use it with a variable.

I know its confusing. Your notation reflects the reality better but mine is closer
to conventions used in C and Algol.

In Ocaml, if x is a reference than

!x

is the contents, and

x := v

is how you assign to it.

> I am not sure what Felix is doing differently that would allow the number 10 to be replaced in that product, does it store all values externally from the product?

Felix does exactly the obvious thing. When you write:

var z = (10,20);

then its two 32 bit values on an x86_64 at consecutive addresses and both are mutable.
If you write:

val v = (10,20);

then v is just a name for the pair (10,20). There may or may not be addressable store.
Typically there will be store somewhere, but it could be both values are in registers.
v can act like a C macro.

If you say

new 42

then what you get is a pointer to heap allocated store containing 42. So there
are TWO ways to get mutable store: using a named variable or using the heap.
Actually there are other ways, using library functions, but “new” is a system intrinsic.

Hope that makes sense. There’s no magic. No lvalues, or, perhaps more
precisely only variables are lvalues, there are no lvalue expressions. ****


**** well actually Felix does allow

&*&x —> &x

but its really an optimisation.



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Mar 7, 2019, 12:40:41 PM3/7/19
to felix google


> On 8 Mar 2019, at 04:19, 'Keean Schupke' via Felix Language <felix-l...@googlegroups.com> wrote:
>
> I am not sure I see the problem. If a unqiue pointer is stored in a product, you can only mutate the object as much as you like, but you can only copy/move the pointer itself once.

The problem is that the representation is:

var x = (10, box 42)


which has type int * uniq int.

The second component isn’t a pointer. We’re not storing a uniq pointer into a product.
Its the value stored there that is unique. “box” doesn’t mean boxing as in FPLs.
Box means, you got the nice teddy bear and gift wrapped it before giving it to
you kid as a present. The kid can’t play with the teddy bear without unboxing it.
If the box is on the shelf, you can take it off the shelf and move it somewhere else.
Then the place on the shelf where the box was is now vacant.

So the idea of a box is that there are no separate pieces, its “abstracted” in some sense.
But the box is still a value.

The box in x is stored at address

&x.1

which at the moment has type

&(uniq int)

Note the & means pointer in Felix type notation. The point is, we’re not storing
pointers in objects here. What we have is a variable which is a shelf,
and there are two things on the shelf: an int, and gift wrapped teddy bear int.

The “uniq” type is the wrapping. Its part of the value.

Felix keeps track of the individual shelves of the bookcase.
It knows when you move an value out of one of the shelves.
I.e. it tracks the components of a product separately AS WELL as tracking
the whole product (recusively to any depth).

It HAS to do that because of this:

fun f(x:int, y; uniq int)
fun g(param: int * uniq int)

these two functions have the SAME domain type. So does this:

fun h: int * uniq int =
| x,y => …

All three functions accept one argument because, all functions accept
one argument. The first form simply unpacks the argument into separate
variables for you.



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Mar 7, 2019, 1:05:49 PM3/7/19
to felix google

Remember the original problem: this is safe:

fun f (x:int) => x,x;
f (box 42)

The argument “box 42” has type uniq int, which can’t be duplicated.
But uniq in is a subtype of int, so the compiler inserts an unboxing
coercion and f gets an ordinary int, not a uniq one.

But this is NOT safe:

fun f[T] (x:T) => x,x;
f (box 42)

because now T = uniq int

In other words, the type system is unsound.

Now, if we only have say

!T — shared pointer to T
@T — unique pointer to T

the problem goes away because there is no “either shared or uniq pointer to T”.
This is your suggestion I think.

That fixes the problem because you cannot dereference a T, only a !T or a @T.

But it introduces a new problem: you can no longer have a product containing
a uniq values. It can certainly contain a uniq pointer or a shared pointer
but that’s not the same thing at all. In Felix products are expanded.
We need the first component to be shared and the second one uniq,
so the address of the whole product is an ordinary shared address,
the address of the first component is also shared, but the address
of the second one is unique.

That happens now. For example given

var x = (10, box 42);
var z = x;

that works. The whole product is copyable once. But now,
the first component can be copied again, the second can’t be:

var one = x.0; // fine
var two = x.1; // ERROR!

On the other hand:

var x = (10,box 42);
var two = x.1; // fine
var z= x; // ERROR

because copying the whole product now isn’t allowed, because the
second component has already been moved out.

All this works now in Felix.

Felix trackis every component of a product separately, It knows when you
move the whole container product it is just a sequence of moves of the
components.


John Skaller
ska...@internode.on.net





Keean Schupke

unread,
Mar 7, 2019, 1:55:31 PM3/7/19
to felix google
Unique int is not a subtype of int though is it? Liskov's principle states that you can use the subtype anywhere you can use the type (I hope I have this the right way around). I can't use a unique int as the LHS of an assignment, but I can use an int, so it is not a subtype.

What about the other way around, that doesn't work either because we can't use a unique int on the RHS in the same place as an interview, because we cannot move/copy the pointer more than once.

So I think the subtyping rules are wrong?


Keean.

John Skaller2

unread,
Mar 7, 2019, 8:45:44 PM3/7/19
to felix google


> On 8 Mar 2019, at 05:55, 'Keean Schupke' via Felix Language <felix-l...@googlegroups.com> wrote:
>
> Unique int is not a subtype of int though is it?

Yes, it is.


> Liskov's principle states that you can use the subtype anywhere you can use the type (I hope I have this the right way around). I can't use a unique int as the LHS of an assignment, but I can use an int, so it is not a subtype.

Liskov’s principle is incorrect. It’s language dependent. The principle only applies
to passing arguments by value. In any case in Felix

x = y;

means

storeat (&>x, y);

where &> means “write only pointer”. Since is a write only pointer is basically
a “set” method, it is contravariant in its domain. So write only pointers are
contravariant.



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Mar 7, 2019, 8:56:23 PM3/7/19
to felix google


> On 8 Mar 2019, at 05:05, John Skaller2 <ska...@internode.on.net> wrote:
>
>
> Remember the original problem: this is safe:
>
> fun f (x:int) => x,x;
> f (box 42)
>
> The argument “box 42” has type uniq int, which can’t be duplicated.
> But uniq in is a subtype of int, so the compiler inserts an unboxing
> coercion and f gets an ordinary int, not a uniq one.
>
> But this is NOT safe:
>
> fun f[T] (x:T) => x,x;
> f (box 42)
>
> because now T = uniq int
>
> In other words, the type system is unsound.

So, the answer is obvious.

I am doing the uniqueness checking too late. This is the C++ template error.
I have to do uniqueness analysis BEFORE monomorphisation.
Which is extremely nasty.

The point is that since

T = uniq U

is a valid substitution thie function:

fun f[T] (x:T) => x,x;

has to be judged a type error. In the theory, and as Keean is always saying,
you have to start with a substructural type system and add shares into it after,
not the other way around.

So if T is a substructural type, this is correct:

fun f[T] (x: share T) => x,x;

In other words “share T” is not a type combinator. More precisely

share: substructural types —> ordinary types

is the kind of “share”. So

share (share int)

is impossible because the kind of its argument has to be a substructural type,
not an ordinary type. IN Felix

uniq: TYPE -> TYPE

and that’s the design fault.




John Skaller
ska...@internode.on.net





Keean Schupke

unread,
Mar 8, 2019, 2:49:56 AM3/8/19
to felix google
That makes sense. I still don't agree about the subtyping because unique should be a property of the container. A value cannot be 'unique' because it's immutable.

let t = Uniq 3
y = t
z = t // error 't' is empty
t = 4
t = 5 // error if linear type 't' already full, okay if affine type.

Note: linear types must be used exactly once, affine types can be used zero or one times.

I don't like this syntax, because uniqueness is a property of the pointer (or container) and not the value. It does not make sense on the RHS. Better would be:

let t : uniq int = 3
t = 4

It's clearer what is going on. Even better in my opinion would be:

let t : uniq [int] = [3]
t[0] = 4

As now the container is explicit, we could make unique an operator again to get rid of the type annotation:

let t = uniq [3]
t[0] = 4

This is probably my preferred option for clarity.


Cheers,
Keean.


John Skaller2

unread,
Mar 8, 2019, 10:33:49 AM3/8/19
to felix google


> On 8 Mar 2019, at 18:49, 'Keean Schupke' via Felix Language <felix-l...@googlegroups.com> wrote:
>
> That makes sense. I still don't agree about the subtyping because unique should be a property of the container. A value cannot be 'unique' because it's immutable.


I don’t disagree, I was just telling what the Felix implementation does.
In functional programming there are no containers. Even a list is a value.
Although a unique list doesn’t make sense, it does make sense
for the representation, which clearly does consist of objects.

It’s the representation that matters, uniqueness is only of any interest
because it supports optimisations.


John Skaller
ska...@internode.on.net





Keean Schupke

unread,
Mar 8, 2019, 10:59:57 AM3/8/19
to felix google
If something is a "functional" language variables should be immutable, and hence there would be no linear types. 

We can introduce mutable state to a functional language using the state monad. This works in a similar way to my examples, the variable is still immutable and contains a reference type:

f = do
   x <- newRef 13

x has type "Ref Int" not int. newRef would be in the state monad having type "a -> State (Ref a)"

The problem with the state monad is that monads don't compose (with other state spaces). We can solve this with "Freer Monads" as in "Eff" which provides a row-polymorphic effects monad.

So I think you end up at a similar place for n functional languages too.


Keean.


John Skaller2

unread,
Mar 8, 2019, 11:08:15 AM3/8/19
to felix google
So this is the situiation.

First, Felix started off as a C++/Algol style language with a simplifying rule:
all types must be first class (regular/copyable etc etc).

Of course, uniquness typing breaks the rule.

What I’m still puzzling over is that it works well for monomorphic types.
This is the definition in the library:

https://github.com/felix-lang/felix/blob/master/src/packages/unique.fdoc

There is also support in the unification algorithm which makes Uniq T
a subtype of T, and *special weird hacks* for pointers.
The uniqueness checking is done late by control flow analysis.


With this, we can do this:

https://github.com/felix-lang/felix/blob/master/src/packages/ucstring.fdoc

The synopsis is:



ctor : string -> ucstr
ctor : +char -> ucstr (unsafe)
proc delete : ucstr
fun len : ucstr -> size
fun set : ucstr * int * char -> ucstr
fun reserve : ucstr * size -> ucstr
fun append : ucstr * ucstr -> ucstr
fun append : ucstr * &ucstr -> ucstr doesn't consume second arg
fun + : ucstr * ucstr -> ucstr
fun + : ucstr * &ucstr -> ucstr doesn't consume second arg
proc += : &ucstr * &ucstr -> ucstr modifies first arg, doesn't consume second
fun erase : ucstr -> slice[int] -> ucstr
fun insert : ucstr -> int * ucstr -> ucstr inserts second arg into first at pos
fun dup : ucstr -> ucstr * ucstr destructive dup
fun dup : &ucstr -> ucstr * ucstr nondestructive dup


Now what makes this work is


// abstract representation
private type _ucstr = new +char;

// make it uniq
typedef ucstr = uniq _ucstr;


The ucstr is a standard C string, i.e. a NTBS, wrapped with uniq,
however the fact it is a NTBS is hidden, the user cannot get at
the char pointer: the type +char is a C char array, but the new operator
on types provides abstraction: ONLY the code inside the UniqueCString
class knows its a char pointer.

So, a ucstr is a unique VALUE. However the *representation* is a pointer.

The key point is some operations like appending can use mutators due
to the uniqueness. The ustr is ALWAYS unique, its impossible to share
the pointer (except by cheating).

This type was an experiment, its not an efficient string representation.
But the point is IT WORKS.

So Keean is both right an wrong. Unique does make sense for values.
The class above proves it. But also, it doesn’t make sense except that
the representation is a pointer, and uniqueness is used to allows
mutations on the representation which are referentailly transparent
IN OTHER WORDS the public type acts like a value in that the
mutations are hidden.

The core problem is that for *arbitrary* types a uniq type combinator
is nonsense. But its the only way to actually track which pieces
of a structured value like a product move instead of copy.



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Mar 9, 2019, 6:42:00 AM3/9/19
to felix google
A possible solution is to simply disallow the replacement of a type variable with
a uniq type.

That fixes

fun dup[T] (x:T) => x,x;

So if you want to call a poly fun on a uniq type you have to unbox it first.
if you don’t like that, you can do a specialisation:

fun dup[T] (x: uniq T) => x.copy, x;

This has no impact on monormophic code. Also, this would still work:

fun f(x:T) => ….
fun g(y: uniq T) => f x;

because uniq T is a subtype of T. What that means is that
the unboxing operation will be done automatically with a coercion.

The HARD bit: products.

BTW: at present, pattern matches cause an issue. Variants can’t
have uniq types as arguments. This is not because of logic but
because I haven’t figure it out. :-) Basically variant arguments
are immutable *unlike products* where you can write to a
product that is stored in a variable or on the heap.

Perhaps more interesting .. pattern variables are, I think,
automatically uniq, because they’re not allowed to recur
in patterns, i.e. patterns are intrinsically linear. But match
*handlers* can use a pattern variable more than once.
Hmmm.




John Skaller
ska...@internode.on.net





John Skaller2

unread,
Mar 9, 2019, 6:49:09 PM3/9/19
to felix google


> On 9 Mar 2019, at 22:41, John Skaller2 <ska...@internode.on.net> wrote:
>
> A possible solution is to simply disallow the replacement of a type variable with
> a uniq type.
>
> That fixes
>
> fun dup[T] (x:T) => x,x;

So actually .. the rules I propose are already implemented.


fun f[T](var x:T) => x,x;

begin
var y = box 42;
var z = f y;
println$ z;
end

This works. But remove the var, and it fails. I checked the generated
code. The coercion is being inserted. But the function is inlined and
the val lazily evaluated:

z = unbox y, unbox y;

which of course uses y twice. Actually this is safe but only in this case
because there’s no mutation.

flow: normal= processing for y<69465> := (box<69347>primfun[int] 42);
1@3 apply((prj0:RWptr((int^2),[]) -> RWptr(int,[])), &z<69466>ref) <- (y<69465>varname :>> int);
1@3 ()
flow: normal= processing for apply((prj0:RWptr((int^2),[]) -> RWptr(int,[])), &z<69466>ref) <- (y<69465>varname :>> int);
Once error: Using uninitialised or already used once variable
(69475:->y)
Variable y defined at
/Users/skaller/felix/z.flx: line 4, cols 3 to 18
3: begin
4: var y = box 42;
****************
5: var z = f y;


In instruction apply((prj1:RWptr((int^2),[]) -> RWptr(int,[])), &z<69466>ref) <- (y<69465>varname :>> int);


Notice, the code reads:


&z.0 <- y.int
&z.1 <- y.int

so there are two uses of y.


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Mar 9, 2019, 7:23:47 PM3/9/19
to felix google
Here’s one I do not understand:

This passes:

fun erase (var x:ucstr) (sl:slice[int]) : ucstr =>

This fails:

fun erase (var x:ustr) (sl:slice[int]) : ustr =>


Its exactly the same string type in both cases.

// abstract representation
private type _ucstr = new +char;

// make it uniq
typedef ucstr = uniq _ucstr;

// abstract representation
private type _ustr = new +char;

// make it uniq
typedef ustr = uniq _ustr;


The second class is supposed to become a counted string but so far
I have just copied the code and changed the type name.

0@427 return erase'2<18206>closure;
0@427 (69789:->x)
flow: function return
Once error: Once variables unused!
(69789:->x)

The compiler is actually correct. Its a HOF, so the code is equivalent to

fun erase (x) => return (fun (sl) => …);

In other words, x is in fact not used. There’s no control flow using it.
The use is inside the returned closure. The flow analyser cannot handle this
case at the moment. There is an implicit address taken in the closure to
the frame of erase containing x and thus indirectly a pointer to x itself.

The question is .. why does this *correctly* fail .. when the previous identical
function does not. Everything works if I comment out the counter string
class.

The actual function is ALSO problematic:

fun erase (var x:ustr) (sl:slice[int]) : ustr =>
match sl with
| Slice_all => set (x,0,char "")
| Slice_from idx => set (x,idx, char "")
| Slice_from_counted (first,len) => strmov x (first,first+len)
| Slice_to_incl incl => strmov x (0,incl)
| Slice_to_excl excl => strmov x (0, excl - 1)
| Slice_range_incl (first, last) => strmov x (first, last+1)
| Slice_range_excl (first, last) => strmov x (first, last)
| Slice_one pos => strmov x (pos, pos+1)
;

Here x is correctly used in all branches. However Felix doesn’t
know that the matching is exhaustive and adds an MATCH_FAILURE
branch at the end .. which doesn’t use x. You can prevent this I think
with a final wildcard branch .. which also wouldn’t use x unless you wrote

| _ -> C_hack::ignore(x)

However this isn’t the issue. Note that with a HOF there REALLY is a problem.

var closure = erase x;
closure (1..20);
closure (1..10);

The two calls cause x to be used twice.


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Mar 10, 2019, 9:36:51 AM3/10/19
to felix google


> On 10 Mar 2019, at 10:49, John Skaller2 <ska...@internode.on.net> wrote:
>
>
>
>> On 9 Mar 2019, at 22:41, John Skaller2 <ska...@internode.on.net> wrote:
>>
>> A possible solution is to simply disallow the replacement of a type variable with
>> a uniq type.
>>
>> That fixes
>>
>> fun dup[T] (x:T) => x,x;
>
> So actually .. the rules I propose are already implemented.


>
> fun f[T](var x:T) => x,x;
>
> begin
> var y = box 42;
> var z = f y;
> println$ z;
> end
>
> This works. But remove the var, and it fails. I checked the generated
> code. The coercion is being inserted. But the function is inlined and
> the val lazily evaluated:
>
> z = unbox y, unbox y;
>
> which of course uses y twice.

Fixed. I was checking the argument type for uniquness (not just the whole
type but all its components as well eg int * uniq int is not uniq but contains
a uniq type).

But this is wrong:

coerce[uniq int -> int] (box 42)

has type int. I needed to check the types of all sub-expressions.

Fixed. The above works now without the var. Uniq part forces
eager evaluation. There may be other cases i haven’t found.


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Mar 15, 2019, 10:57:51 AM3/15/19
to felix google
Felix now chains together nominal type coercions:

//////
struct XX { a : int; };
struct YY { b: double; }
struct ZZ { c: string; }


// XX -> YY
supertype YY (x:XX) => YY (x.a.double + 0.76);

// YY -> ZZ
supertype ZZ (x:YY) => ZZ (x.b.str + "!!");

proc showYY (x:YY) { println$ "YY.b = " + x.b.str; }
proc showZZ (x: ZZ) { println$ "ZZ.c = " + x.c; }

var xx = XX (23);
showYY xx;

showZZ xx;
//////////

Here XX is a subtype of YY which is a subtype of ZZ.
To pass the XX value xx to showZZ, which has parameter of type XX,
we have to coerce XX to YY and then that to ZZ.

If there is more than one coercion chain, Felix picks one of the
shortest ones. The program must ensure all coercion chains
are semantically equivalent.

The purpose is that, if you say decide

tiny < short < int < long < vlong

then a function accepting a long will accept a tiny, short,
int, or long. If you want you can add extra coercions
to bridge multi-step coercions for performance reasons but
the semantics must be the same.

Overloading in Felix at the moment DOES NOT CARE ABOUT THE LENGTH
OF THE COERCIONS CHAINS. For example, if you have an argument
of type short, and functions of type int and long, it will pick the function taking
the int, but not because it is closer, but because int is a subtype of long,
and so is more specialised. It picked the shortest chain.

But consider A is a subtype of B is a subtype of C is a subtype of D,
and A is also a subtype of X, then if you have an X and a D function
and call with an A argument it is ambiguous because neither of X and D
is a subtype of the other. It won’t pick X, just because less coercions
are required to get there.

Not that still, at the moment, only *monomorphic* nominal types can be
handled. Polymorphic nominal types should be trivial provided
they have the same number of type variables in the same order eg

supertype P[T] (x:B[T]) => …

HOWEVER consider the question: is B[int] a subtype of P[long]??

Before we can answer that we have to know if B[int] is a subtype of B[long],
in other words, B is *covariant*. Here is a counter-example:

struct B[T] { f:T -> string; }

In this case B is contravariant because -> is contravariant in its first argument,
the function domain (and covariant in the second). And, if that were instead:

struct B[T] { f: T -> T; }

then B is invariant, since it has to be both co- and contra-variant. Most P/L’s that
handle variance require the compiler to calculate it if the data structure is visible,
or specify it if it is abstracted, eg Ocaml.

At least to start I think this is too hard, so we will have to insist in invariance.
In this case, we could actual just write the supertype rule without any
type paraeters at all. The existing algorithms would then “just work”.


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Mar 16, 2019, 1:29:04 AM3/16/19
to felix google
So, this works fine:



supertype vlong: long = "(long long)$1";
supertype long: int = "(long)$1";
supertype int : short= "(int)$1";
supertype short : tiny = "(short)$1";

proc f(x: vlong, y:vlong) { println$ "vlong " + (x+y).str; }
proc f(x: long, y: long) { println$ "long " + (FloatAddgrp[long]::add(x,y)).str; }
//proc f(x: int, y:int) { println$ "int " + (FloatAddgrp[int]::add(x,y)).str; }

f (1s,2);

and, it works fine if I uncomment the last entry. And this works too:

class A {
proc f(x: vlong, y:vlong) { println$ "vlong " + (x+y).str; }
proc f(x: long, y: long) { println$ "long " + (FloatAddgrp[long]::add(x,y)).str; }
proc f(x: int, y:int) { println$ "int " + (FloatAddgrp[int]::add(x,y)).str; }
}

open A;

f (1s,2);

This works but:

class X[T] {
fun ad: T * T -> T = "$1+$2";
}
open X[tiny];
open X[short];
open X[int];
//open X[long];
//open X[vlong];

println$ ad(1,2s);

It FAILS is I uncomment the last two comments.

[resolve_overload] Ambiguous call: Not expecting equal signatures due to same function
fun ad<68616>:<T68612>^2
but distinct type variable arguments
1: long returns long
2: vlong returns vlong
try using explicit type arguments!
See: /Users/skaller/felix/b.flx: line 45, cols 10 to 18
44:
45: println$ ad(1,2s);
*********
46:

See: /Users/skaller/felix/b.flx: line 37, cols 3 to 32
36: class X[T] {
37: fun ad: T * T -> T = "$1+$2";
******************************
38: }

The diagnostic is saying int * short unifies with both signatures

long * long
vlong * vlong

which is correct, it does. It did in the class A case too.
However in the class A case the functions were distinct
even though they had the same name, and overload resolution
selected the most specialised one correctly.

In the class X case, its the *same* function with two different
specialisations exported, with the same signatures, but the
algorithm is NOT choosing the most specialised one, it’s barfing instead.

Its testing using Flx_unify.compare_sigs. I had trouble with this before.
It uses unification including subtyping. however, I upgraded the diagnostic:

[resolve_overload] Ambiguous call: Not expecting equal signatures due to same function
fun ad<68616>:<T68612>^2
but distinct type variable arguments
1: long sig <T68612>^2 returns long
2: vlong sig <T68612>^2 returns vlong
try using explicit type arguments!

and now it’s clear: the signatures being compared are polymorphic, that is,
the bindings of the type variables haven’t been substituted into the signatures.



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Mar 16, 2019, 5:00:23 AM3/16/19
to felix google
>
> [resolve_overload] Ambiguous call: Not expecting equal signatures due to same function
> fun ad<68616>:<T68612>^2
> but distinct type variable arguments
> 1: long sig <T68612>^2 returns long
> 2: vlong sig <T68612>^2 returns vlong
> try using explicit type arguments!
>
> and now it’s clear: the signatures being compared are polymorphic, that is,
> the bindings of the type variables haven’t been substituted into the signatures.
>

I know exactly why this happens: the type from the view, long or vlong above,
isn’t replacing the type variable T68612 from the class, which it should.

The open X[long] and open X[vlong] etc create entries in the private name space.
ALL such entries are views. For example a plain old function

class A { fun f[T] :T => .. }

has a view in the class A name map like

for all U. f [U]

In other words ALL namespace entries are specialisations, it just that some
are identiy specialisation (like the one above).

When we do overloading its the view type variables we first solve for,
THEN we can substitute them away in the view types which are then
used to specialise the base function’s types, eliminating the base
function type variables entirely, as well as the view type variables.
There may be type variables left, if so these come from the calling
context. These are existential, that is, the result is monomorphic
(since the context type variables are considered abstract types,
not variables).

The view type variables can be set

(a) explicitly like f[int]
(b) by overload resolution

Problem is the algorithm is extremely messy. Here were have the base type
variables set by the open statement. The algo will fix that later but
it is failing too early. If I do the subtitution earlier something else breaks :-)



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Mar 16, 2019, 1:51:12 PM3/16/19
to felix google


> On 16 Mar 2019, at 20:00, John Skaller2 <ska...@internode.on.net> wrote:
>
>>
>> [resolve_overload] Ambiguous call: Not expecting equal signatures due to same function
>> fun ad<68616>:<T68612>^2
>> but distinct type variable arguments
>> 1: long sig <T68612>^2 returns long
>> 2: vlong sig <T68612>^2 returns vlong
>> try using explicit type arguments!
>>
>> and now it’s clear: the signatures being compared are polymorphic, that is,
>> the bindings of the type variables haven’t been substituted into the signatures.
>>
>
> I know exactly why this happens: the type from the view, long or vlong above,
> isn’t replacing the type variable T68612 from the class, which it should.

Ok so fixed that and now there are some other problems.

The first one was that we got equal signatures with type variable substitutions:

fcomplex

typematch float with
| float => fcomplex
| double => dcomplex
| ldouble=> lcomplex
endmatch

Of course the second expression SHOULD have been reduced to the first one.
Just a missing beta-reduce somewhere …

I commented out all the complex number stuff to bypass this because it occurs
in the library and prevents running any test cases. After that i get this
kind of error in some test cases:

Flx_lookup:inner_bind_expression: Client Error binding expression (== (b, b))
Error at:
/Users/skaller/felix/build/release/test/regress/rt/tuple-02.flx: line 36, cols 10 to 16
35: b:= 1,2;
36: println$ b == b;
*******
37:

[Flx_lookup.bind_expression] Inner bind expression failed binding (== (b, b))
CLIENT ERROR
[flx_bind/flx_overload.ml:1305: E251] Too many candidates match in overloading == with argument types (int^2)^2
Of the matching candidates, the following are most specialised ones are incomparable
Eq[t: TYPE where _typeop(_staticbool_and,(TRUE, TRUE),BOOL)]::==<6153><6153> sig (<T3090> * <T3091>)^2
Eq[t: TYPE where _typeop(_staticbool_and,(TRUE, TRUE),BOOL)]::==<6153><6153> sig (<T3802>^<T3803>)^2
Perhaps you need to define a function more specialised than all these?
In /Users/skaller/felix/build/release/test/regress/rt/tuple-02.flx: line 36, cols 10 to 16
35: b:= 1,2;
36: println$ b == b;
*******
37:


Now the thing is .. on this occasion THE COMPILER IS RIGHT:

case 1: T3090 = int, T3091=2
case 2: T3802 = int, T3803 = int

These substitutions do indeed lead to (int^2)^2 == (int * int)^2

Now the thing is .. i never actually could figure out how the tuple comparison code worked at all.
Now i know .. it never did :-)

Here is the tuple comparison code:

// Class Eq: Equality
instance [T,U with Eq[T], Eq[U]] Eq[T ** U] {
fun == : (T ** U) * (T ** U) -> bool =
| (ah ,, at) , (bh ,, bt) => ah == bh and at == bt;
;
}

instance[t,u with Eq[t],Eq[u]] Eq[t*u] {
fun == : (t * u) * (t * u) -> bool =
| (x1,y1),(x2,y2) => x1==x2 and y1 == y2
;
}

instance[t with Eq[t]] Eq[t*t] {
fun == : (t * t) * (t * t) -> bool =
| (x1,y1),(x2,y2) => x1==x2 and y1 == y2
;
}

and the array code:

//$ Equality and Inequality.
instance[T,N with Eq[T]] Eq[array[T, N]] {
fun == (xs:array[T,N],ys:array[T,N]) = {
val n = xs.len;
// assert n == ys.len;
if n == 0uz do
return true;
else
for var i:size in 0uz upto n - 1uz
if not (xs.i == ys.i) return false;
done
return true;
}
}


It’s quite clear, the last tuple comparison is dealing with an array of length 2.
It is more specialised than the instance before dealling with a pair.
It is also more specialised than the array code, dealing with length N.

HOWEVER these are instances. They have NOTHING to do with overloading.
They’re used during monomorphisation when finding type class instances.
What matter is the signatures opened.

For arrays:

open[T,N] Eq[array[T,N]];

For tuples:

open [T,U with Tord[T], Tord[U]] Tord[T ** U];
open [T,U with Tord[T], Tord[U]] Tord[T * U];

Since Tord inherits Eq, this opens Eq as well.

The first case for tuples doesn’t apply, so there are two cases:

Eq[T^N]
Eq[T * U]

and that is what is causing the ambiguity.

IMHO .. it really IS ambigfuous.
I note the array code ALSO says this:

open[T,N] Tord[array[T,N]];

Well actually .. it should be

open [T,N with Tord[T]] Tord[T^N];

since, to order an array lexicographically requires ordering the individual
elements (as in the tuple case). Stuff in Felix sometimes mysteriously works. ..
maybe i just never compared to arrays for order.


So .. welll hey, lets fix it by adding:

open [T with Tord[T]] Tord[T * T];


WOW! That worked!


John Skaller
ska...@internode.on.net





John Skaller2

unread,
Mar 17, 2019, 5:59:27 AM3/17/19
to felix google
So now it seems to work, and now:

supertype vlong: long = "(long long)$1";
supertype long: int = "(long)$1";
supertype int : short= "(int)$1";
supertype short : tiny = "(short)$1";

is in the library. This means mixed mode *signed* arithmetic now works
using integral promotion. Note, Felix still ties to add tiny’s together
so its not quite the same as C (but the calculation is done in C anyhow).

It’s not clear if unsigned integers are subtypes.

More precisely, unsigned integers are nonsense in the first place.
They only exist because of limitations on early computers when C was designed.

For floats, the rule is counter intuitive:

ldouble < double < float

The biggest float with the biggest range and most precision is a subtype of smaller floats.
The reason is: if you have an iteration with a convergence critreria using floats,
then you can safely use a double, throwing away precision .. and get the same result.
It doesn’t work the other way.

All this really leads to a difficulty question: what the heck IS a subtype anyhow??

You are not going to like the answer!! The correct answer is: its any dang thing you please,
provided the subtypes chosen are consistent. [The technical term is “coherent”].

One coherence rule is transitivity.

So now consider: could we have float < double < ldouble?

I think probably, yes. The integer rules is based on the “isa” notion,
that is, the denotational semantics of small integers say the values are
a subset of bigger ones, so they’re subtypes.

On the other hand, a record with extra fields over another is a subtype,
so the coercion throws away information. So we have two quite different
rules, one throwing away information, and one based on preserving it
via an embedding.

Rules have consequences. For example

&<int is a subtype of &<long

because read-only pointers are covariant and int is a subtype of long.

Is this sustainable?????

Certainly for Abstract pointers it’s no problem, but for machine pointers,
I suspect it cant work. For abstract pointers, you just fetch the int
and promote it to long. There’s no way to actually implement that
with machine pointers, because static casts just change the type,
which is *always* wrong unless the types are layout compatible,
which int and long are not. The pointers ARE, but the deref on
a non-matching pointer is undefined.

So Felix says machine pointers are invariant (all kinds).



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Mar 24, 2019, 7:26:49 PM3/24/19
to felix google


> On 17 Mar 2019, at 20:59, John Skaller2 <ska...@internode.on.net> wrote:
>
> So now it seems to work, and now:
>
> supertype vlong: long = "(long long)$1";
> supertype long: int = "(long)$1";
> supertype int : short= "(int)$1";
> supertype short : tiny = "(short)$1”;


> So now consider: could we have float < double < ldouble?

But the real question is: should we have ldouble < double < float.


Because that is the “correct” subtyping rule. The C rule is actually wrong.

First, lets review the integer rule: given functions f(T * T) where T is integral,
a mixed call like f (int, long) selects the smallest type T such that T * T matches.
That’s long * long. So the calculation is carried out with the LEAST precision possible
because subtypes have less precision that their supertypes.

Now consider f (T * T) with the correct subtyping rule for floats. Here, the subtyping
coercion *demotes*, that is, throws away precision. A call goes to largest type T
such that T * T matches, so that the calculation is carried out with the MOST
possible precision. This isn’t what you think:

f ( 1.0, 1.0L ) // doule, ldouble

is carried out at double precision, by demoting the ldouble to double. In C, that would
be done at ldouble precision. So basically if you have a high speed single precision
float calculation and feed in a double value, it is demoted to single precision.

The justification is that, promotion introduces precision that doesn’t exist in the
data and this can lead to false results. Typically we have numerical algorithms
which are repeated to refine a result until a convergence condition is met.
A loop that converges in float domain may run forever in double domain because you’re
trying to get precision that simply doesn’t exist.



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Mar 25, 2019, 11:54:39 PM3/25/19
to felix google
Felix has too many confusing integer types. I’m tutorialing compact linear types.

My first observation is a long standing one: with compact linear types
the exact unsigned integers uint8,16,32,64 are not required,
the types 2^(8^n) for n=1,2,3 will do instead, i.e. uint8 and 2^8
have the same representation anyhow, and since uint8 is abstract,
we could just define maths on 2^n instead.

However more nastier is the fact Felix allows compact linear types
to be coerced to int. But not other integer types. But cl_t is uint64_t
in the RTL.

I’m thinking to do this:

uintptr, size are synonyms
intptr, ssize, ptrdiff are synonyms

address is distinct but the same size as both

Actually it should be called baddress because derefs return bytes.
Which is yet another integer type. :-)

We also should have waddress which returns waddresses, i.e. an aligned
pointer to a machine word.

These assumptions are safe on all modern processors. The distinction between
pointers and machine word sizes was necessary on the x80286 family
due to segmentation, on MS DOS and with Windows 3.1. AFAIK, all modern
processors are either 32 bit linear addressing or 64 bit linear addressing.
The x86_64 is actually a biatch because int is 32 bits even though pointers
and registers are 64 bits.

Compact linear types also have a problem: they’re universally 64 bits (even on
32 bit platforms). But we should use shorter lengths if we can for space efficiency,
in particular bool is 64 bits on a 64 bit platform in Felix, but only a single byte for
most C++ implementations, forcing me to introduce cbool as well. For values
it makes no difference but for pointers it does.

Felix is designed for desktops, laptops and servers .. not microcontrollers,
but it should work on most modern phones and tablets, and game consoles,
since these generally have either x86 or ARM processors which are standard
linear addressing machines. Code pointers don’t matter because Felix doesn’t
have them anyhow.





John Skaller
ska...@internode.on.net





John Skaller2

unread,
Mar 27, 2019, 7:53:10 AM3/27/19
to felix google
This now works:

/////////////
typedef point = (x:int, y:int);
typedef coloured_point = (x:int, y:int, colour: int);
typedef elevated_point = (x:int, y:int, z:int);

fun step_right[T] (a : (x:int, y:int | r2:T)) => //NOTE label r2 here
//let r = (a without x y) in
(x=a.x+1, y=a.y | a.r2)
;

var cp : coloured_point = (x=0, y=0, colour=42);
var ep : elevated_point = (x=0, y=0, z=100);

println$ cp.step_right._strr;
println$ ep.step_right._strr;
////////////

The change is that you can now name the type variable in the polyrecord
parameter to get all the fields (as an opaque value) other than the ones
listed so that a.r2 above is equivalent to the commented out variable r.

The extension ONLY works for fetching values, that is, you cannot
assign to or take the address of a.r2 even if the parameter ‘a’ is
declared var. The reason is that after monomorphisation, the fields
actually in r2 need not be contiguous, they will be sorted along
with the fields x and y. Although assignment could be implemented,
it is hard and not worth the effort. What is implemented is just syntactic
sugar for something you could already do.

Some bugs may be found. The implementation required adding the identifier
to the bound type, however it is actually an alias and not part of the type.
So code doing unification or type comparisons has to ignore it.
The field is stripped out during monomorphisation, indeed, polyrecords
are eliminated during monomorphisation.

The primary hassle is that the polyrecord constructor function is used to eliminate
polyrecords, by shifting known fields to the left of the vertical bar | but it cannot
do this if the RHS of the | has a label.



John Skaller
ska...@internode.on.net





John Skaller2

unread,
Apr 8, 2019, 12:28:49 AM4/8/19
to felix google
I just turned on the unsigned 128 and 256 bit ints.
I have split the library up a bit, separating float, complex, int, quaternion, and random.

I want to try adding 128 and 256 bit signed ints .. using Felix to do it, instead of C++.
Addition is trivial .. the operations are identical (signed 2’s complement and unsigned
use exactly the same machine instruction to do addition).

Multiplication can be done by testing the sign, multiplying the absolute values,
and then negating if necessary. Signed ops can give the wrong answer if
the positive values are too big or the negatives too small. The nasty cases
are things like

1 * MINVAL
2 * (half-minal)

The result should be MINVAL. MINVAL is a 1 bit followed by all 0s. The two’s
complement is the complement + 1. The complement is a 0 bit followed by all 1’s.
Add 1, and we get a 1 bit followed by all 0’s again. Multiple by 1, value unchanged.
Now 2’s complement again, and we get the right answer.

So I should prove a theorm that

s1 * s2 = sgn(s1) * sgn(s2) * abs(s1) * abs (s2)

IF

abs(s1) x abs(s2) is representable


The * ops in the first line are computer operations, the x in the second line
is the mathematical multiplication.

The idea is roughly that if the absolute values considered as unsigned values
are representable as a *signed* value, then the formula produces the
correct result. The NASTY case is when the result is MINVAL, which is
representable, but there is no corresponding representable absolute
value.

Division is harder. We have to use the same trick: do the division of absolute
values, then adjust based on the sign. C rounds the dividend towards zero.
The remainder has to be adjusted accordingly.

The funny thing is exhaustive testing is possible by applying the algorithm
generically to unsigned char. The number of cases is small enough to check
every case. If it works for 8 bits, it has to work for any number of bits 8 or above.





John Skaller
ska...@internode.on.net





John Skaller2

unread,
Apr 10, 2019, 12:48:27 AM4/10/19
to felix google
So this has exposed some interesting issues.

The subtype coercion registryis expressed with abstract types if necessary.
When the abstract types are downgraded to their representations, there’s
a problem: the representation need not be a nominal type. The registry
can only handly monomorphic nominal type subtyping. So the downgrade
can fail.

But there’s worse. Coercions are only implemented late, after downgrading.
If the downgrade succeeds, we end up coercing the representation types
instead of the abstract types. This will give the wrong result if the downgrade
is not structure preserving. And it isn’t, in the case of

type int128 = new uint128;
type int256 = new uint256;
supertype int128: int64 = “uint128_t($1)”;
supertype int256: int128 = “uint256_t($1)”;

The problem is that the promotion of unsigned integers just adds
zero bits at the high end, but to promote signed integers,
we have to do sign extension. The problem is the second case,
because int128 downgrades to uint128, and thus the promotion
is then not sign extending.

The problem is hard to solve because the coercion solver sees
only downgraded types, i.e. unsigned integers for sizes 128 and 256,
and the coercion rule for them is wrong for signed types.

To be explicit .. the coercion is represented as

coerce (x, t)

and is not replaced by the function which would do the right job,
but one that does the wrong job, because there really is a coecion
for the unsigned types as well and its different.

The solution is to remove the coercions *before* removing abstract
types. Also care is needed because C++ is going to see the unsigned
types here, so we have to make sure we don’t just use a cast as
I did (we actually have to call a sign-extension promotion function
on the unsigned int).

I have no idea what happens if I expand the coercions first,
it might break something else.

HOWEVER I’m not getting the wrong result! I’m getting a CRASH:

flxg: frontend ........................... : 0m33.9
flxg: bind ............................... : 0m0.1
[flx_opt]; Polymorphic Uniqueness Verification .............0s
[flx_opt]; Pre-monomorphisation user reductions ............0s
[flx_opt]; Finding roots ...................................0s
[flx_opt]; Monomorphising ..................................0s
[flx_opt]; Verifying typeclass elimination .................0s
[flx_opt]; Simplify requirements ...........................0s
[flx_opt]; Downgrading abstract types to representations ...0s
[flx_opt]; Verifying abstract type elimination .............0s
[flx_opt]; Removing unused symbols .........................0s
[flx_opt]; Uncurrying curried function .....................0s
[flx_opt]; Generating wrappers (new) .......................0s
[flx_opt]; Set SVC funs inline ............................0s
[flx_opt]; Inlining ........................................0s
[flx_opt]; Post-monomorphisation user reductions ...........0s
[flx_opt]; Remove unused symbols ...........................0s
[flx_opt]; Expanding Coercions (new) .......................0s
[flx_opt]; Stripping Lambdas (new) .........................0s
[flx_opt]; Eliminate dead code .............................0s
[flx_opt]; Do stack call optimisation ......................0s
[flx_opt]; Mark heap closures ..............................0s
[flx_opt]; Very Late Uniquness Verification ................0s
[flx_opt]; optimisation pass complete ......................0s
flxg: optimse ............................ : 0m0.1
flxg: lower .............................. : 0m0.0
flxg: codegen ............................ : 0m0.0
55
-55
110
false
true
-3025
Calculating dividend
Shell terminated by signal SIGILL

and I know why: there’s an infinite loop due to infinite tail recursion.
Hmmm.


John Skaller
ska...@internode.on.net





Reply all
Reply to author
Forward
0 new messages