Apologies for the confusion around atelet/ateom topics this week

63 views
Skip to first unread message

Tim Hockin

unread,
Jun 5, 2026, 8:14:49 PM (7 days ago) Jun 5
to ate...@googlegroups.com
I admit this was not handled as transparently as I would have liked, that's on me.

Here's the summary:

In the past week, we have been exploring a different potential model for how Agent Substrate layers onto Kubernetes, in particular how actors and workers are mapped to pods and nodes.  

Over the last few years, as workloads have gotten “less traditional”, we have seen roughly 3 patterns for how workloads use Kubernetes.

1. Machine & macro-workload management
In this model Kubernetes is used mainly as a way to provision and manage machines, and to set up the “outer shell” of workloads.  A single workload consumes all (or most) of the machine in a way that is essentially invisible to Kubernetes.  It carves off a chunk of capacity (CPU, RAM, accelerators) and divides those up internally. 

2. Workload management
In this model Kubernetes is additionally responsible for creating and managing workloads (Pods), including some notions of fine-grained resource provisioning, plus storage, networking, identity etc.  Each workload uses the resources assigned to it, and while it may further subdivide internally, it’s logically a single workload. 

3. Application management
In this model Kubernetes is involved more deeply in the application itself.  Workloads themselves use Kubernetes in explicit ways. 

The current design of Agent Substrate is using model #2.  We rely on Kubernetes to create and manage our “worker” pods, but inside those worker pods we are responsible for everything.  We made a choice to make each worker single-tenant, but we load tenants in and out over time.  Kubernetes is not involved in the hot-path, though it is involved asynchronously (e.g. log collection, workload lifecycle).  This model is relatively simple to explain and manage, and makes good use of what Kubernetes already offers.

So what was the confusion about?

DRAM is at a premium right now and will be for the foreseeable future.  One of Agent Substrate’s primary objectives is to pack more actors onto limited resources.  The way the current model works for removing and replacing an actor on a worker means:
* take a snapshot
* dump it to disk (local or networked)
* load a different snapshot

While DRAM is scarce, network bandwidth is too, and SSD bandwidth is not far behind.  The act of taking and saving a snapshot has a real cost on those axes.  Worse, taking a snapshot is a linear-time operation.  Larger footprints require more IO to save.

The hypothesis is that saving SOME of a running agent is better than saving ALL of that agent in terms of IO.  That would suggest that using literal swap might be more effective than taking full snapshots (since the kernel has better knowledge of how much needs to be swapped, and because the kernel can evict clean pages before swapping).  So we dreamed up a model which maps more closely to engagement model #1 and set out to see if we could prove it.

The idea was that each Kubernetes node would get a single “mega-atelet” which owns all of the actors of a given flavor (gVisor or microVM) on a given node.  The substrate scheduler would send actor assignments to the mega-atelet, which would manage everything internally.  The kernel could swap processes in and out as needed, and it would be invisible to Kubernetes.  The idea of “workers” is probably not needed at all.  Actors would be invisible to Kubernetes.

You can see how that is a departure from the current design.

Some counter-arguments:
* If a machine is so contended that it needs to swap one actor out in order to load another, there’s a high likelihood that it needs to swap out more than just “some”.
* If we run low on swap space, we will need to actually evict actors which would involve paging them back into RAM in order to serialize them to a checkpoint.
* Managing many actors is a much more complicated problem than managing 1.  We would need to rebuild a significant fraction of Kubelet.

So this week we built a few model applications with different footprints and different page-dirtying behaviors, and measured time to checkpoint them and then restore them.

Results:

We see some improvement in time-to-first-response after resume with swap vs. snapshots (on the order of 20-30%) for bare-metal, but it's much worse with nested virt. 

These tests are imperfect, but we REALLY wanted to wrap this up by the end of the week.  So based on these results I think we should stay with the current model until we have more compelling evidence that an alternative is better enough to justify the additional complexity.  I am fairly confident that we can retrofit the proposed model if we need to, probably as a 2nd operational mode rather than as a replacement for the existing one.  If we get to that point, then we will have more compelling evidence and a better understanding of which shapes of workloads benefit from it.

Again, I apologize for the opacity -- I mostly didn't want to waste people's time chasing something that may not exist, but I understand I may have wasted some of your time inadvertently anyway.

Full steam ahead!

Tim
Reply all
Reply to author
Forward
0 new messages