The Felix microtasking kernel is running reasonably well in itself.
It uses spinlocks on channel I/O and the scheduler queue shared between
concurrent schedulers and performs no allocations on these functions.
However, USER code in resume() methods (and elsewhere) can cause
two dangerous operations:
1. OS mutex locks
2. OS calls in general
The first case is use of operator new to create GC objects.
The underlying malloc() pose a threat, although modern mallocs should
be using thread local store and other techniques accessing buffered
shared memory to minimise system calls. Worse though, GC allocations
require several updates to Judy tables which although efficient can still
call mallc().
Also GC collections cause huge delays if the GC is used heavily.
The Judy problem is hard. Judy actually allows you to use a different
allocator, but it is a global variable and cannot be passed to Judy functions.
Because Judy uses fixed node sizes it is perfect for a custom allocator.
Also the nodes are small and if we expect Judy to use a small fraction of the
total program allocations, it makes sense for Judy to use a specialised allocator
that does not ever return memory to the OS. Thus on a long running program,
the amortised time lost due to allocations will be zero.
Even better, temporary Judy arrys used in GC might use an arena allocator
which is simply reset to empty when the array is cleared.
However, we also need faster allocations for both system and user objects.
Ignoring the problems of coding an allocator, we want the basic operation
to be wait-free as much as possiblel that is, require either no synchronisation
by using thread local store (using Felix thread objects to hold the data NOT
system thread local storage!), or, if shared, spinlocks.
The PROBLEM is that we do have to call malloc and free to get memory.
One way around this, sort of, is to use a server thread. The server holds say
10 empty pages and the allocators call the server to request a page,
using spinlocks to do the fetch or return an unused page. Using lists
this is extremely fast.
So the problem is reduced to the server calling malloc and free,
but it is in a separate thread, so PROVIDED it maintains the
list of available pages so it is never empty when a client calls
for a page, no spinlock will be held up by memory requests.
Note spinlocks can ALWAYS be held up by pre-emptions that
de-schedule the holder of the lock, forcing the lock requestor
to spin waiting on the OS to re-scheduler the holder so it can
finish its task.
this can ONLY be avoided by assuring the computer as a whole
has all user tasks locked to a single core. Ideally, pre-emptions
would be disabled. in practice this just means finding the right
number of threads for best performance, on an 8 core that could
be 4,6,8, or even 20.
The memory server should use “machine learning” to adjust itself
so it is able to efficiently service requests. A simple algorithm could
suffice, with environment variables allowing the initial parameters to
be set.
One coding problem here: some applications want to be single threaded.
So the server will have to work in that context too, service calls will
be subroutine calls. in this case locking isn’t needed anywhere but it may
be hard to ensure locks are not used unless needed.
Presently the kernel uses a trick with the scheduler: to run coroutines
as processes (i.e. concurrently) you have to call “spawn_process”
instead of “spawn_fthread” at least once from a running coroutine.
This dynamically enables spinlocks.
Note also the algorithm for holding onto the and releasing the pthreads
used to service calls is unprincipled. i basically make all the threads
hand around until ALL threads have stopped, and, there are no async
requests.
A flexible scheme would allow a user specified allocator. in particular
the system might use a special allocator. in general, this leads to a problem
with the GC in that it now has to find the corresponding deallocator.
An obvious solution is to use yet another Judy array to track all non-standard
allocators per object.
however for bulk page allocators, we could instead check if the pointer to the
object is interior to a bulk allocated page and if so call the required deallocator,
otherwise call the standard deallocator (which calls free() at the moment).
A more constrained solution puts the allocator pointer into the type object.
When the type object (shape_t) is in the library we can statically fix the
allocator. When the flxg compiler generates the shapes, it must choose the
allocator with user guidance, statically. When the type object is created at
run time (which we don’t do at the moment) the allocator can be passed
to the constructor.
But this method fixes the allocator based on the type of the object.
The best solution is probably to do BOTH. in other words, we maintain
a Judy array for “per object exceptions” to the allocator specified
in the type object.
We note this slows down all deallocations, whilst the overhead of
recording a special allocation is acceptable because .. well .. its
special.
So .. there’s some work to consider. At present I want to write a
hard specification for the kernel: data structures and API.
—
John Skaller
ska...@internode.on.net