Rambling, content-free update #0

19 views

Skip to first unread message

William ML Leslie

unread,

Dec 10, 2022, 1:42:08 AM12/10/22

to cap-talk

Greetings everyone,

I thought I might record some of the things I've been doing over the last six months, not because I have something exciting to announce, but more to let you into the process of trying to bootstrap a fully-featured operating system on top of a capability system. There's no real takeaway other than "wleslie clearly doesn't know what he is doing", and I can't promise I'll be half as clear or interesting as listening to Shap think about this stuff out loud. Consider yourself warned (:

The first half of the year had been something of a write-off for unrelated reasons, but by the end of June I had resolved to spend a few hours each week on working on problems that would make me happy without having to spend 10-12 hours a week on them. The idea was not merely to explore but to spend most of the time implementing and seeing which things would work and when I'd need to go back to the drawing board. One of the clearly-defined goals I had was to be able to run MontE on top of Coyotos. I think it would make a great language in which to write the secretary and sample user shell.

The secretary - just for those not familliar with KeyKOS terminology - is the process which represents the user's presence on the system. They hold your capabilities so that you can use them when you return, whether that's via local or remote log-in. It may be an analogue to the desktop or home directory. In a persistent system, it also holds your running programs, as well as a capability to the set of global resources. I'm not sure if I want to stick with the "system is a business" nomenclature for Coyotos, but Glyph did rightly point out that Anthropomorphism is a Powerful Warrior in the fight against complexity.

Can we run rpython?

The MontE runtime, Typhon, is written in rpython. rpython doesn't really do cross compilation, unless the target is ARM. The reason is a bit silly: it expects to be able to run target executables to determine the size of common C types, and there's no easy way to just provide that information. So, I patched the module rpython.rtyper.tool.rfficache to return the sizes supported by our cross toolchain. I also implemented rpython.translator.platform.coyotos to select the correct compiler and correct link flags.

When trying to do anything unusual with rpython that requires an executable - that is, anything that you can't do in an untranslated test - you can try building the tinylang runtime. Tinylang is used for testing the impact of JIT changes, and the repo includes test programs such as fib.

[wleslie@pop-os:~/src/pypy/rpython/jit/tl]
14:02:08 $ ~/pypy2.7-v7.3.9/bin/pypy ../../bin/rpython -O2 --platform coyotos targettiny1

This would fail with some missing symbols. I already knew about a few of these, as I'd inspected the symbols that a native build of typhon expects - things like read, write, isatty, and lseek. I wrote a quick implementation of these that would work with newlib and consume the Coyotos iostream API, since the lack of open() in the list of symbols for tinylang indicated that these were just for interacting with stdio. I rebuild with my newly fleshed-out C library, only to find that it was now complaining about pthread.

I think rpython only uses pthread to ensure that it has sufficient stack space for running app-level code. Nevertheless, it was clear that we were going to need pthread. The symbols used by tinylang didn't seem scary either - thread create and detach, attr init, destroy, setscope and setstacksize, and conditional variable support; init, destroy, signal, wait, and timedwait. I got to work with a fairly straightforward implementation of these. Lacking drivers for timer support, some of these would have to just pretend to do the right thing, but all in all implementing pthread only had a few challenges.

One is - how does a thread know its own identity? We don't have a supervisor that can answer that question for said thread, and I don't want to introduce one. Ideally, we'll want to implement functions like pthread_self without a system call. The way that many systems (such as GNU) do this is to allocate a register to point at the thread's TLS area. This technique was specified in the brilliant ia64 architecture manual. Unlike ia64 however, x86 does not have a large set of general purpose registers so that one can be claimed for this purpose; instead we use the segment registers, GS on 32 bit architectures and FS on x86_64.

Note that Coyotos has sensible treatment for segment registers. It initialises them to zero, it saves and restores them on context switch, and it's possible to set them using the Process cap. This means that a new thread can know it's identity the moment it starts executing. So, to start a new thread, you can allocate a Process, set it's registers (with setFixRegs), and then set its address space and pc and start it.

However, I started to run into a bit of an ordering problem. Typically, libpthread depends on libc. In particular, it needs to be able to allocate memory for things, so it makes sense for it to use malloc. However, malloc includes structures that should not be modified concurrently, so it typically makes use of locks. In order to implement locks that can be used from libc, they would need to live in the Coyotos Runtime (libc-coyotos), if not directly within libc, and with typical implementations of locks you need a way to use a thread's identifier to identify the owner of the lock. I eventually resolved to have identification and locks within libc-coyotos, and to use these from libpthread.

Synchronisation

So, there are a few ways to implement locks in Coyotos. One way would be to ask one process to arbitrate. I don't like this solution as it introduces an unnecessary bottleneck between cores, and introduces a strange space overhead. Imagine if your options for locks required either allocating space for registers and a stack for each lock, or having an external process manually track which addresses are locks and who is waiting on them.

Instead my gut said that we could implement an Abstract Queued Synchroniser. This would be a lock-free queue that processes could add themselves to to be notified when the lock was released. When a thread was in the process of releasing a lock, it would set the lock to be owned by the next thread in the queue, and then send it a notification to wake it up. Many systems do something similar - I learned about this from Java, but GNU also does this.

The question is, how can a thread wait for a notification without a race condition? Consider: One thread is trying to enqueue while another is trying to release.

<thread 1>: add_to_wait_queue(q, self);
<thread 2>: next = next_waiting(q);
<thread 2>: wake_up(next);
<thread 1>: park(self);

There are three obvious ways we could implement park/wake_up without introducing another process.

The first is to use send/receive. The idea is that park is implemented as a closed wait on an endpoint created for this purpose. Waking a thread would involve sending a message via that endpoint. But should that send be blocking or nonblocking? If the send should be blocking, we need to be aware that the thread may be in some sort of faulted condition. Perhaps it is handling a signal that doesn't remove it from the wait. Maybe the thread is in the process of cleanup. But if the send is nonblocking, then it could be missed by the target thread. The wake_up on step 3 would not be received by the park on step 4.

The second technique is to use notifications. Notifications have the advantage that they are level-triggered, so that even if a process was not ready to receive a notification when it happened, it will receive it when next checked. A drawback to notifications is that there are a small, fixed number of them per process. You can't tell which endpoint a notification came in on, you basically have to claim one for this purpose, or any other purpose you might need to be asynchronously notified of. They are a bit like IRQs in hardware.

Another problem with notifications is that in order to receive them, you need to open wait. This means you could receive any number of calls while waiting for the notification, and with each one you need to store the arguments and reply cap somewhere and get back to waiting. I added a new flag to the invocation control word to enable waiting for notifications on a closed wait, for just this purpose.

The third option is to repeatedly yield() and check if the lock is owned by you (or by nobody). I don't like the thought of a disk full of active objects, where each thread is waking up regularly just to check if there's mail. It could be done, though.

There are more esoteric things - such as calling setAddressSpaceAndPC to remove a process from a wait - but I think it's clear at this point that this synchronous invocation model which might work well for systems you build entirely yourself does not apply well to larger systems. I made the modification to the IPW0 as described to use notifications for this purpose for now.

I think I have more to say about Endpoints - they are strange - firstly, they don't have any real state besides the protected payload match, and secondly a thread in a closed wait does not wait on an endpoint but instead on an endpoint id. What this means is that Coyotos IPC is effectively a CSP-style bidirectional synchronisation event. There's no place in the endpoint to do any sort of handoff, and in fact a thread does not even need a capability to the endpoint it's waiting on. Endpoints exist more-or-less purely as a way to do cheap revocation.

Atomics

I also want to say just quickly that the documentation on atomics (like cmpxchg) is super confusing. The documentation for libatomic states that deprecated old-style atomic functions are implemented using the new ones, but looking at the functions libatomic actually uses, I don't have any of these in our toolchain build. I modified the kernel-level lock code to work with the threading model described above and these seemed to work.

TTY, IRQHelper

I took a quick digression into the TTY project in Coyotos. Although it appears to provide a very clean interface, much of the implementation is missing. I started to implement the keyboard driver it would need, referring to gnumach and osdev for reference.

Rather than flesh TTY out - especially since I really wanted to get to building a graphical UI - I realised it would be easier to cheat and port the HURD terminal by Brinkmann and Olivetti. It requires a keyboard driver and the VGA driver (which is pretty trivial) but it does provide a fairly full-featured text-mode terminal, and uses xkb to support different layouts and options.

GNU Libc

It was around this time - frustrated by libpthread and lock implementation, and thinking about just doing the HURD port so I could get something running - that I realised I should just switch to glibc, which had a much more sensible implementation of the AQS I'd written by hand. It was always the intention to switch to glibc, in fact there was a glibc release sitting in the original ccs-xenv repo from a decade ago.

I'd spent some time last year thumbing through glibc to figure out what I'd need to do to port it and not being too scared. I started porting some of the mach functionality and made a list of functionality we would need to implement - mmap, madvise, usleep, wait, abort, kill etc - I implemented stubs that would call into a process supervisor to implement these. We'll need a supervisor to support running unixy processes, especially to support fork. I played a little with usleep and RTC support, though I need to check existing RTC code to find out how they are translating into unix time (do I need to worry about leap seconds?). I started playing with mmap.

MMAP

There were two pieces of code in the Coyotos repo to tinker with GPTs: there's the fault handler in ElfSpace, which populates a GPT from a backing space, and the functionality in libcoyimage:CoySpace for inserting special pages into an elf image. I have to admit that Guarded Page Tables somewhat scare me.

The basic idea would be to have an abstract process map, of the sort you can see with the `pmap` coreutil, that held mapped file objects, and that these could be used to populate pages as they were accessed by the application.

There are a few different categories of mmap call. There are anonymous mappings that don't have a file at all, these are to be populated with empty pages from the spacebank, so that is relatively easy. For files, however, it's not clear how much we trust the filesystem with our pages. We want only to assume that the filesystem will reply to requests on each file in a timely manner, and we'll guarantee this at a different level in future.

The second category is MAP_PRIVATE, which effectively copies the file into the address space. The file does not get to know about writes into the local copy. One way to implement this is to allocate a GPT and populate it with writeable pages, but then mark the cap to the GPT as opaque before handing it to the file to populate. Once the file has populated the pages, the process can move these into a different GPT. It can keep the original GPT around for further requests to this file. Also, since the file may be deleted or modified before it is completely paged in, the file handle is given a spacebank to use for keeping the original copy of the data for mapping into this process.

There is also MAP_SHARED, which allows the process to see writes to the file as they happen, and allows the process to write to the file. This is implemented by providing the file with a spacebank which it can use to allocate pages and GPTs. Effectively, the file handle "owns" the first level map, and it marks the GPT caps returned to the process as opaque. This allows it to detect writes, by marking the page as read-only when it is flushed to storage, and then marking it as writeable when the process indicates a fault in the mapped region.

So the interface for mapped files requires the ability to request populating pages, to request superpages (single-level GPTs) for MAP_SHARED, to flush shared mappings to disk, and to handle memory access faults, at a minimum.

Before implementing any of these mappable files, I decided to start with functions for working with GPTs. For example, performing a scan over the address space in a way that enables filling it out with freshly allocated GPTs.

Ummmm

This is approximately where I'm at. I'd spent some time researching last year, thumbing through all sorts of things, where I discovered that most of what I'd written about porting the HURD back in 2020 were wrong. The main points I learned:

- There's no need to port MIG. There are few enough interfaces between Mach and HURD that rewriting them directly in CapIDL was trivial. The trick then is to replace the includes and translate between coyotos-style function names and MIG-style function names.

- There are no asynchronous message sends in GNU. The closest thing is that there is a 6-deep message buffer in pfinet, and realistically I will avoid porting pfinet anyway as it contains GPLv2-Only code.

- Replacing mach_send is mostly easy. Most of the time this is done in order to specify a timeout. This case can be easily supported by a process supervisor, and I needed to stub a lot of calls in order to provide guaranteed delivery while supporting signals anyway.

- On the other hand, reference counting does not carry across so easily. If all of the capabilities that unixy processes interact with are "system capabilities", that is, the supervisor keeps track of how they are used, safe destruction can be performed, but this requires that the supervisor act as a membrane for these.

- Sharing a reply port only happens in two places, one is to support reference counting, and the other is in authorisation code. Since we already have a fix for reference counting, and it's not clear we even need process-internal authorisation in a capability system, I'm ok with this.

By the way, why am I porting parts of the HURD? It's because I think that emulation is a woeful user experience. I want programs I'm using right now to run in my operating system, with all of the magic of transactional behaviour, orthogonal persistence, and code migration available to every running program.

Oh - I also looked into porting the Direct Rendering Manager, especially since dri3 appears to be a capability system - and that's a much bigger job. Some of the code can be translated mechanically, and it looks like it is easy to separate out the GPLv2-Only parts, but it only makes sense to do once we can run posix programs without fuss.

And device-initiated DMA for PCI devices? I'll come back to that, too. I guess that since existing drivers select for themselves if they will use scatterlists or bounce buffers, we might need to let them choose, at least for now.

No specific thoughts about building the Guix library for Coyotos yet, but noting that here as it's still on my list.

William ML Leslie

Alan Karp

unread,

Dec 10, 2022, 7:57:27 PM12/10/22

to cap-...@googlegroups.com

Thanks for sending this. It made fun reading for a rainy afternoon. I have a couple of questions.

Is the Secretary what Unix folks call the user agent? Or maybe it's the user agent + power box?

You said you decided to use notifications, but you listed problems with that choice. Did you resolve them?

--------------
Alan Karp

--
You received this message because you are subscribed to the Google Groups "cap-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cap-talk+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cap-talk/CAHgd1hG7ObheUtw7ByB_4GFEC9fw7%2Bmj%3DxT1SnzRUkWf95QCLQ%40mail.gmail.com.

William ML Leslie

unread,

Dec 14, 2022, 3:34:17 AM12/14/22

to cap-...@googlegroups.com

On Sun, 11 Dec 2022 at 10:57, Alan Karp <alan...@gmail.com> wrote:

Thanks for sending this. It made fun reading for a rainy afternoon. I have a couple of questions.

Is the Secretary what Unix folks call the user agent? Or maybe it's the user agent + power box?

To drill down from the top, there's the concept of the user shell, which represents all of the actions a user can take. It represents capabilities to the user, interprets user intent, and handles log-in to the system. It allows you to run programs and interact with them.

For the most part, the user shell isn't specific to any user. I mean, I hope that people feel comfortable writing and sharing their own shells, in the same way that there are so many launchers on Android or Window Managers on X11. But just like launchers and window managers, the shell isn't usually responsible for tracking which capabilities are available to the person interacting with the machine.

To track what is available to a user, a Unix system will have the shell interact with the file system and the process tree. The file system knows about which static resources are available to a user, and the process tree knows about the user's running programs. Likewise, the secretary allows the shell to access the objects that the user can access, and when the user creates new objects (for example, runs a new program), the secretary can keep track of them.

A powerbox may have access to the secretary, allowing the user to select from their capabilities.

I think the secretary may also serve as a window into protected system functions, and may hold capabilities that aren't enumerable, supporting rights amplification. I don't know if KeyKOS supported anything along those lines.

I'm not sure if there was more to it in KeyKOS, but that is how I've understood it.

You said you decided to use notifications, but you listed problems with that choice. Did you resolve them?

Somewhat. I added a flag to the invocation control word so that you could indicate that you wanted to receive notifications, independent of whether you were in an open or closed wait. Then I assume that notifications within programs using pthread know that notification 0 may be received when a thread might be taking ownership of a lock. The end result is not racy and does not busy-wait.

While that works for implementing locks, it only works because there is no data and no capabilities that are part of the message. It still seems odd that if you want to reliably receive a nonempty message on Coyotos, you need to allocate a process and ensure that process is waiting for the message and doing nothing else. It seems that it's possible to make the only requirement for receiving a message be having some appropriately-typed space to hold it.

To view this discussion on the web visit https://groups.google.com/d/msgid/cap-talk/CANpA1Z32EKqjL8iw4S%3DWpc0EQuSj%2BFn4Fdjo%3DYDSnOCaG%2Br0Zg%40mail.gmail.com.