Alan Karp

unread,

Apr 5, 2026, 1:09:17 PMApr 5

to <friam@googlegroups.com>, cap-...@googlegroups.com

https://www.reddit.com/r/rust/comments/1sb9q4a/rust400/

--------------
Alan Karp

Raoul Duke

unread,

Apr 5, 2026, 2:59:32 PMApr 5

to cap-...@googlegroups.com, <friam@googlegroups.com>

> Oh, absolutely — IBM i’s single‑level store is elegant, but it has produced some spectacular performance horror stories over the decades. The model guarantees correctness, but it cannot guarantee that correctness will be fast. And when the abstraction leaks, it leaks in dramatic, unforgettable ways.

Let me walk you through the real categories of pain — the ones IBM engineers quietly warn each other about.

🔥 1. The “Cold Object From Hell” Scenario

A pointer dereference triggers a tape mount.

This is the canonical horror story.

A developer writes code like:

read customer;

Or even just:

customer->address;

And unbeknownst to them:

The customer file object hasn’t been touched in months
Its pages were migrated to deep archive
The archive tier is literally tape

So the dereference triggers:

A page fault
A storage hierarchy walk
A tape mount
A multi‑minute stall

The program doesn’t crash.
It doesn’t error.
It just… waits.

From the developer’s perspective, it looks like the system froze.

From the OS’s perspective, everything is working perfectly.

This is the dark side of “location‑independent correctness.”

🐌 2. The “Everything Is a Pointer” Cascading Stall

One cold object triggers dozens of cold objects.

Because objects can reference other objects:

A program object references a service program
That service program references a user space
That user space references a database file
That file references a member
That member references an access path

If any of these are cold, the dereference chain becomes a multi‑stage page‑in cascade.

Developers have seen:

30‑second delays
2‑minute delays
10‑minute delays (rare but real)

All from a single innocent pointer dereference.

🧊 3. The “Dormant Library” Disaster

A library full of objects goes cold, then a job touches one.

Libraries can contain:

Programs
Service programs
Data areas
User spaces
Database files
Access paths

If the entire library is cold, the first job that touches it triggers a storm of page‑ins.

This can:

Stall the job
Stall the subsystem
Hammer the disk
Cause a system‑wide performance dip

It’s not a crash — it’s just the model doing what it promised.

🧱 4. The “Huge Object, Tiny Access” Problem

Touching one byte forces the system to restore megabytes or gigabytes.

Objects are coarse‑grained.

If you have:

A 4 GB user space
A 20 GB database file
A massive access path

…and it’s cold…

Touching any byte forces the system to restore the relevant pages.

Developers have seen:

4 GB page‑ins
20 GB page‑ins
Multi‑minute stalls

All because a single pointer dereference hit a cold region.

🧮 5. The “Object Table Bloat” Slowdown

Millions of objects → metadata pressure → slower pointer validation.

The object table is compact, but not free.

Large installations with:

millions of objects
thousands of libraries
tens of thousands of database files

…can hit a point where:

object table lookups slow down
pointer validation takes longer
LIC metadata walks become more expensive

This doesn’t break correctness, but it can degrade performance in subtle ways.

🧟 6. The “Zombie Object” Problem

Deleted objects whose pages still exist on disk.

When an object is deleted:

its metadata is removed
its pages are marked free

But the physical pages may linger until reused.

If the system is under storage pressure, the cleanup can cause:

spikes in disk I/O
long reclamation cycles
temporary stalls

Again: correctness is preserved, but performance can suffer.

🧠 Why these horror stories exist

Because IBM i’s model is built on two uncompromising principles:

Pointers never break
Objects never disappear unless explicitly deleted

This means:

The system must honor every pointer
Even if the data is on tape
Even if the object is decades old
Even if the object is enormous
Even if the object graph is deep

Correctness is absolute.
Performance is best‑effort.

🧭 The philosophical tradeoff

IBM i chooses:

Predictable correctness over predictable latency.

Unix chooses:

Predictable latency over predictable correctness.

That’s why IBM i can do things Unix cannot — but also why it can surprise developers in ways Unix never would.

Alan Karp

unread,

Apr 5, 2026, 3:25:55 PMApr 5

to fr...@googlegroups.com, cap-...@googlegroups.com

In the 1980s my wife worked on the HSM (hierarchical storage management) for MVS, which was similar in many ways, and I recall her talking about some of these issues. I have no idea what they did about them, but her code was in production for many years. Perhaps the problems weren't as significant when virtual addresses were 24 bits.

--------------
Alan Karp

--
You received this message because you are subscribed to the Google Groups "friam" group.
To unsubscribe from this group and stop receiving emails from it, send an email to friam+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/friam/CAJ7XQb7KazZ3uwJsNStq-Hg5BF5hSkbnfNqzAZb%3DzJEEVzsBiw%40mail.gmail.com.

William ML Leslie

unread,

Apr 5, 2026, 8:31:47 PMApr 5

to fr...@googlegroups.com

[removing cap-talk as I want to make a more casual point]

If you're up for good vibes about some unique operating systems, I've kept this video stashed in my Watch Later for years after I watched. It covers Genera, Medley, TRON and IBM i. It really changed my impression of Medley.

https://www.youtube.com/watch?v=7RNbIEJvjUA

--

William ML Leslie

A tool for making incorrect guesses and generating large volumes of plausible-looking nonsense. Who is this very useful tool for?

William ML Leslie

unread,

Apr 5, 2026, 8:53:34 PMApr 5

to fr...@googlegroups.com, cap-...@googlegroups.com

On Mon, 6 Apr 2026 at 04:59, Raoul Duke <rao...@gmail.com> wrote:

Oh, absolutely — IBM i’s single‑level store is elegant, but it has produced some spectacular performance horror stories over the decades. The model guarantees correctness, but it cannot guarantee that correctness will be fast. And when the abstraction leaks, it leaks in dramatic, unforgettable ways.

Once Shap is done with the Book and I am done with async, we'll probably resume the loop where I suggest ideas for addressing swapping pathologies and Shap tells me why these are bad ideas. At the least, I find that one entertaining. You're more than welcome to join in :)

Alan Karp

unread,

Apr 6, 2026, 7:06:37 PMApr 6

to fr...@googlegroups.com

Thanks for the video. I remember the Lisp Machine and the IBM S/400. I didn't know that a version of the latter is still in production as the IBM i and has a reasonably large install base.

--------------
Alan Karp

--

You received this message because you are subscribed to the Google Groups "friam" group.
To unsubscribe from this group and stop receiving emails from it, send an email to friam+un...@googlegroups.com.

To view this discussion visit https://groups.google.com/d/msgid/friam/CAHgd1hErZMOgw6oxY8qUUpcm0frK-PL-4B0cHahmQLEMVoyY6Q%40mail.gmail.com.

Mark S. Miller

unread,

Apr 7, 2026, 6:05:17 PMApr 7

to fr...@googlegroups.com

E-like Communicating Event Loop systems cope with latency without endangering correctness by distinguishing between

1) local and remote, where local is much faster than remote.

2) immediate vs eventual references, where remote references must be eventual.

3) immediate call vs eventual send, where eventual references cannot carry immediate calls

4) eventual sends return promises for their result, which are also eventual references.

5) promise pipelining

6) like NaNs, eventual errors use data contagion rather than control-flow contagion to properly poison dependent computation without disrupting pipelining.

Historical note: NaNs were invented by Konrad Zuse for the Z3 relay machine for exactly this purpose.

In any case, E itself overloaded the local vs remote distinction for failure atomicity. This endangers correctness in ways the AS/400 may not have found acceptable. But that bundling is unnecessary. Waterken and Agoric's use of Endo do all 6 of the above without recognizing partition or crashes.

To view this discussion visit https://groups.google.com/d/msgid/friam/CANpA1Z0-mq%2BEUW4Cs-GWRvkn8bgocMBe9W2PRCE-z3hTVNuKVQ%40mail.gmail.com.

Alan Karp

unread,

Apr 7, 2026, 7:33:45 PMApr 7

to fr...@googlegroups.com

At least some of the problems Raoul listed can happen on immediate calls. A local object that hasn't been used in a long time might reside only on an unmounted tape, resulting in a latency of many minutes. To avoid that, you might have to give up the optimization of only touching modified objects at the end of a turn.

--------------
Alan Karp

To view this discussion visit https://groups.google.com/d/msgid/friam/CAK-_AD7o6%3DHB-XwNwSWGrh_v5vMxuEJHtfnBXy9qqPSjsPJ9kg%40mail.gmail.com.

Mark S. Miller

unread,

Apr 7, 2026, 9:01:18 PMApr 7

to fr...@googlegroups.com

On Tue, Apr 7, 2026 at 4:33 PM Alan Karp <alan...@gmail.com> wrote:

At least some of the problems Raoul listed can happen on immediate calls. A local object that hasn't been used in a long time might reside only on an unmounted tape, resulting in a latency of many minutes.

I understand you so far.

To avoid that, you might have to give up the optimization of only touching modified objects at the end of a turn.

I don't get it.

To view this discussion visit https://groups.google.com/d/msgid/friam/CANpA1Z0aJUMDCNx_geHTSEaXM56knWCWD6HbkZLk-RGFgGVcdw%40mail.gmail.com.

Alan Karp

unread,

Apr 7, 2026, 9:45:48 PMApr 7

to fr...@googlegroups.com

Waterken did incremental checkpoints at the end of a turn. My understanding of AS/400 is that untouched objects were eligible to be migrated. If you touch all the objects in the vat by including them in the backup, then none of them will be migrated to backing store.

--------------
Alan Karp

To view this discussion visit https://groups.google.com/d/msgid/friam/CAK-_AD6Sa%3DnHm6uogxbHsxrUHBqDLwkTY59WATmREbDMEy-fGg%40mail.gmail.com.

Mark S. Miller

unread,

Apr 7, 2026, 10:19:53 PMApr 7

to fr...@googlegroups.com

If you've got enough memory to not migrate, you wouldn't migrate them anyway. Didn't the "working set" theory already cover all this well?

To view this discussion visit https://groups.google.com/d/msgid/friam/CANpA1Z3%3DtEPaJmFygdjtZ0OicVYGVGqCEWGSzZvZVgTDb%2B0jjA%40mail.gmail.com.

--

Cheers,
--MarkM

Alan Karp

unread,

Apr 7, 2026, 11:54:11 PMApr 7

to fr...@googlegroups.com

Right. You only migrate when you're running out of memory. When you do, you migrate the objects that have gone the longest without being touched. My understanding is that decision is independent of the object's relationship with other objects. Touching all the objects referenced by a touched object might prevent surprisingly long latency.

--------------
Alan Karp

To view this discussion visit https://groups.google.com/d/msgid/friam/CAK5yZYidOwJ8X8v0qbT1wiFk7ifOp86FfCXX9BwpakaSqnb54A%40mail.gmail.com.

Mark S. Miller

unread,

Apr 8, 2026, 4:06:22 PMApr 8

to fr...@googlegroups.com

Ok, I see what you mean now. In a multi-process system in which tiny processes are only coupled to each other asynchronously, such as vats or erlang processes, if we imagine an entire process as a unit to be paged in or out together, then we maximize the amount of not-actually-dependent computation that can proceed while we're waiting for a tiny process to be paged back in. Doing this for vats with promise pipelining dramatically accelerates that advantage.

Now that main memory is so much larger than the typical vat, the typical vat is already tiny by these criteria.

The shift from the working set theory I know is that we're not only trying to predict when an page will next be needed, but predicting how much useful computation can still happen while we're waiting for that page + organizing computation to maximize that in a way that helps the prediction.

To view this discussion visit https://groups.google.com/d/msgid/friam/CANpA1Z0Cd4Adi3Q_wKVEbe5j9NA3%3Dc5YJg16ydLzFx_b9h9XbQ%40mail.gmail.com.

Reply all

Reply to author

Forward

Praise of IBM's capability-based OS/400

Alan Karp

Raoul Duke

🔥 1. The “Cold Object From Hell” Scenario

🐌 2. The “Everything Is a Pointer” Cascading Stall

🧊 3. The “Dormant Library” Disaster

🧱 4. The “Huge Object, Tiny Access” Problem

🧮 5. The “Object Table Bloat” Slowdown

🧟 6. The “Zombie Object” Problem

🧠 Why these horror stories exist

🧭 The philosophical tradeoff

Alan Karp

William ML Leslie

William ML Leslie

Alan Karp

Mark S. Miller

Alan Karp

Mark S. Miller

Alan Karp

Mark S. Miller

Alan Karp

Mark S. Miller