Bits and Bobs 5/4/26

27 views

Skip to first unread message

Alex Komoroske

unread,

May 4, 2026, 11:26:45 AM (6 days ago) May 4

I just published my weekly reflections: https://docs.google.com/document/d/1xRiCqpy3LMAgEsHdX-IA23j6nUISdT5nAJmtKbk9wNA/edit?tab=t.0#heading=h.mcxtzlmxbm4u

The distorting effects of heavy subsidies. Fracking our attention. Inhuman agent speed. The anxiety of leverage. AI helping with starting vs finishing. Eternal september for patch tuesday. Goblins vs stewards. Finishing an infinite feed. The Xeno's paradox of infinite software. Cheap code, expensive software. The verifiability horizon.

----

Massively subsidized tokens have distorted AI demand.

When there's a massive subsidy there's no need to be efficient with tokens.
Two places subsidies are common today:

1) Max/Pro plans
2) Employers incentivizing their employees to burn as many tokens as possible.

An insightful comment on HackerNews:

"I see highly trained engineers spend hundreds of thousands of tokens doing what can reliably be accomplished with 150 lines of python.
I think the push from management for us to use AI has made it so we don’t have to be efficient with our consumption, so now we write md files which we feed to Claude in a loop instead of python and bash scripts to do routine tasks."

LLMs are insanely powerful and able to handle surprise in a way mechanistic software never could.
But they are ludicrous overkill for most tasks.

If you have a Max plan and rarely hit the quota, then there’s no reason to not use Extra High Thinking even on simple tasks.

Subsidies must tighten.

First, we were in “burn as many tokens as we can” mode.

Explore.

Second, we’ll get in “maximize the bang for tokens you burn” mode.

Exploit.

As subsidies tighten, there will be much more of a selection pressure for efficient and effective use of tokens.

If you use all of your budgeted tokens and hit the limit, then you are incentivized to figure out higher-leverage uses.

Outcome based pricing is like declarative vs imperative.

Describe what you want to happen, vs how you want it to happen.
Outcomes-base pricing will naturally tend towards verticals, where there can be insight into the full loop.

Technology promised to save cognitive labor, but it made so much more of it.

The anxiety you feel comes down to your opportunity.

All of the high-value things you could be doing but aren't.

Technology gives you leverage, so it increases your opportunity… and also your anxiety.
Mechanistic software doesn’t have enough context or ability to handle surprise to actually handle cognitive labor for you.

Burning LLM tokens can give you an edge.

Even if your burn is wasteful and inefficient.
A terrible equilibrium point: everyone is burning extreme amounts of tokens to get miniscule, temporary edges over others.
That would leave us back where we are today… but just way, way more expensive.
If you opt out of the game you’d get steamrolled.

Sam Schillace points out that AI makes starting easier, but not yet finishing.

In a world of AI, what is not precious is starting but finishing.
Before, starting was hard, so more things that were started were worth finishing.
The curve of effort is logarithmic.

When you look 80% of the way done you’re actually 20% of the way done.

AI can also help with the latter part of that curve, but it needs to be babysat and it’s just as boring and meticulous as before.

One of the reasons the Ralph loop works is it doesn’t let the LLM drift.

It continually resets it back to a known state.
Recentering it so it can’t drift too far.
Incoherencies compound within LLM-produced output.

The fundamental realities of KVCaches dominate what kinds of UXes are viable.

If your session is still in the KVCache, it’s trivial to serve, just stream out the new tokens.
If your session has to be recreated, then it takes going through the whole context.

What counts as a session is “exact prefix match.”
That means that multiple people in the same workflow could share the same prefix.

LLM providers keep your sessions warm in the cache for your next response.
LLM providers have been dropping this from 60 minutes to closer to 5 minutes to get more efficiency.
If you want to cost the model provider a ton, send a single question about 5 minutes after the last one finished, to stay permanently in the cache.

Anjali has some amazing insights on token economics.

Prefill is compute constrained.
Decode is memory constrained.
So the ratio between input and output matters most for their costs.

LLMs work best with work that has crystallized outputs.

Examples: code, law, scribe notes for medicine.
But the vast majority of knowledge work doesn’t have this kind of output.

LLMs are explosively valuable in the domain of code.

Is it because code is the first domain to explode… or is there something about LLMs being a perfect fit for that domain?

Is code a harbinger of what’s to come or just an odd domain that happens to fit better than other domains ever will?

Code is trivial to check if the output is valuable.

If it doesn’t compile, revert.
It can also be executed and have its pixel output critiqued by another agent.
No humans necessary in the loop!
The loop can be closed with no humans in it.
The downstream cost of correction is significantly lowered by that feedback loop.
Compare to, for example, law, where you won’t know for possibly decades if a given choice will work or not.

The "Verifiability horizon” is a key determinant of LLM value in a given domain.

How long after something is done to know if the thing worked.
That sets the pace layer.
Code is fast; law is possibly decades.

Output-based pricing will likely have a Goodhart’s law problem.

The provider will want to just get done precisely what the customer asked for.
Similar to the monkey’s paw.
It’s impossible to specify precisely what you want against a malicious implementer.

The time to specify details compounds super-linearly.
That means that many problems quickly go underwater where it’s cheaper to simply do it yourself than to specify it for others to execute.

You don’t need a malicious implementor, just an optimizing one.

They’ll take even a marginal benefit on the dimension they’re being graded on at a catastrophic cost on a dimension they aren’t being grade on.

Code is now cheap, but making software that actually works is still expensive.

There’s a logarithmic curve to quality.
What looks 80% done is 20% done.
Building with LLMs allows getting superficial results extremely quickly.
But it doesn’t get that last 80% done unless you hound it to go into the details that you haven’t even been deep in yourself.
It used to be that the PM and engineers who specified or wrote the code were deep in it, and also wanted to get to full quality to ship.

The more they worked on it, the closer they got to the endpoint.

LLMs are basically producing Goodhart’s law code.
Gilded turds, that the closer you look, and the more you try to pin down the details, the more you realize they aren’t ready to ship.
A Xeno’s paradox of infinite software.

If the other side is low-trust then you have to defensively distill an entire specification.

They might be low-trust because they have low-capability, or low alignment to your goals.
You have to defensively define all the minute details to make sure they fit with what you want.

A top-down approach.

If it's high-trust then you can specify only the highest level.

Going down into fractal details requires a compounding amount of effort to go increasingly deep.

One of the ways that high-trust can give significant, non-linear, leverage.

A class of engineers that are about to have a very hard time: those who are “below LLM replacement level.”

LLMs have already become much better engineers than the vast majority of us.
The only way to stay above water is to learn new ways of working that take advantage of those new capabilities.
It’s like going to a selective college.

Do you get intimidated and resentful that all of your peers are better than you?
Or do you get inspired by them?

Don’t have AI do the things you already do.

That’s a high bar.

Something you value enough to do it even if it’s hard.
You likely rely on non-obvious context to do it well.
Downside if it does it poorly.

Instead, have AI do the things that you don’t even bother doing.

The things that would be important to you if only they required less cognitive labor.
All upside.

The opposite of a goblin is a steward.

A goblin, the more energy you pump in, the more it diverges.
A steward, the more energy you pump it, the more it converges.
A goblin is all about the short-term.
A steward is all about the long-term.

Let’s say you’re trying to settle an argument using Claude.

Which do you do?
1) Tell it what you think the right answer is and ask Claude to confirm?
2) Ask Claude what the right answer is?
The latter is much more likely to give the right answer.
But the former is significantly more likely to make you feel good.

Someone described the new post-LLM security reality: “It’s like eternal September for patch Tuesday.”

LLMs increase the prevalence of insecure code by making code easier to produce.
LLMs also make discovering insecure code much easier.

Code reviews are largely about alignment.

Nobody checks the assembly out of compilers nowadays, but you used to have to.
If the production process always produces high quality code, you never need to review it.
If the agents give out good output, you don’t need to review it.

A disquieting experience: realizing a thing you thought is going great is actually bad.

You start off by noticing one surprising thread.
As you pull on, you realize it goes deeper than you had realized.
The longer you pull on it, the more you update your priors on how likely other parts of the solution are to be correct.
A thing that looked great might turn out to be terrible.
A gilded turd.

Ideally you tell the agent what to do, not how to do it.

In many domains, the agent will know better than you.
You want it to be able to do it the best way it knows how.

Finding zero days in network software used to be hard enough to do that the equilibrium was mostly safe.

You had to scuba dive to find issues.
These were jobs that were unnatural for humans, so they were inherently self-limiting.

Humans are expensive!

This meant that even though networked software has a massive surface area, it was mostly safe in practice.
But now models like Mythos lower the sea level by multiple meters.
Suddenly the massive surface area of networked software is an existential problem.

This week in the Wild West Roundup.

Google’s Threat Intelligence Group saw a 32% relative increase in malicious Indirect Prompt Injection between November and February in the wild.
Gemini CLI: Remote Code Execution via workspace trust and tool allowlisting bypasses
Your AI Coding Agent Will Run This Exploit For You: How We Found a High-Severity CVE in Cursor.
Ramp’s Sheets AI Exfiltrates Financials.

Trying to run parallel workstreams fracks your attention.

Our modern technical environment constantly fracks our attention.

Agent speed is inhuman.

Exhilarating but overwhelming.

Jesse Vincent released a set of skills for automated cleanroom software construction.

Software used to be expensive enough to produce that it was automatically a strategic moat.
That moat has never been more shallow than today.

Agent Swarms feel like Facebook culture circa 2008.

Hire a swarm of driven but inexperienced 20-somethings, give them a very general target to sight off, and let it go.
It’s expensive and messy but also very likely to find something great.

A flood fill of possibility.

One of the emergent killer use cases for OpenClaw: automated crypto trading.

What could possibly go wrong?

As the cost of production goes down, the diversity of what is viable to publish goes up super-linearly.

Before the printing press only the very most important books are ever copied.
The only books that cleared that threshold were the Bible, and every so often a few classics from antiquities.
Nothing else cleared that floor.
The printing press lowered the Coasian floor by making it orders of magnitude cheaper to produce books.
A linear decrease in friction makes more of the power law above the viability waterline.

The power-law has a curved shape, so the lowered waterline unlocks super-linear value.

Tech is great at solving first order frustrations.

Users’ superficial frustrations are less deep than their fundamental frustrations.

The power of your brakes must be proportional to the power of your accelerator.

If you have only an accelerator and no brakes, you’re putting yourself in danger.

For dangerous APIs, it used to be possible to reduce the harm by increasing the friction.

The harder the user had to work to do the dangerous thing, the more likely they were motivated and capable to understand the risk.

Things like hiding the configuration behind a command-line option.
Or requiring the user to write a string like React’s famous “__SECRET_INTERNALS_DO_NOT_USE_OR_YOU_WILL_BE_FIRED”.

The idea was users couldn’t stumble into doing something dangerous if they had to crawl through a bit of broken glass.

If something bad happened, they couldn’t claim naivete.

But LLMs are happy to crawl through broken glass, making those warnings much less effective.

If you're going to have a compounding domino run, it's really important that the massive last domino lands where you want it to.

Otherwise, it could do significant damage.
Normally when you have to knock over that massive domino yourself, it takes tons of effort so you triple-check before doing it.
But a compounding domino run makes it so easy to knock over.

You can finish a newspaper.

You can’t finish an infinite feed.
A newspaper is, by construction, finite.
An infinite feed is such a gravity well of attention because there’s always “just one more.”
An absurdly powerful short-term gradient that is at odds with your long-term gradient.

Your want and your want-to-want, directly in tension, with no end.

An "app" presumes "silo".

Data and software fused together.

The software creator owning the data is something we just took for granted in the last 30 years.

But when you think about it from first principles, it’s actually kind of weird!

An insightful comment on HackerNews: ”There's no such thing as an app that cares about your privacy or your interests."

"If the app could make another $0.05 selling your location to kidnapping gangs, they'd do it.”
Similar to Stuart Russell’s observation that an optimizing system will always take a miniscule benefit on the dimension it is rewarded for, even at catastrophic cost to the other dimensions.
Companies have learned that users don’t actually take action about privacy, so apps go to 11 on the most user-hostile actions that make them incrementally more profit.

An analysis: What Google thinks you're worth.

The methodology almost certainly comes to the wrong number, but it’s directionally correct.
Rich users are worth a ton of money to Google.
Some users would be worth more if they paid a subscription and had no ads.
Rich users are worth so much to advertisers that no reasonable subscription would cover it.

The power of a friend strongly evangelizing a thing is that you’ll be far more willing to crawl through broken glass.

If the destination is worth it and your friends all emphatically recommend you’ll stick with it even if it’s hard.
“The first season of Parks and Rec sucks but just stick it out it’s worth it!”

Some problems have a good-enough-toehold-with-infinite-ceiling shape.

They start off as being clearly worth doing.
You get a toehold to good-enough with human-level effort.
But then you can expand beyond your horizons, fractally.
This continues on to infinity.
The more PhDs you throw onto the problem, the more value you create.
These problems are excellent strategically.

After the viability point, the more PhDs you throw on them, the more of a moat you create.

When building a game, there's the engine, and there's the content.

Content bugs are “just content,” they’re at a higher pace layer and much easier to fix.
At the beginning, getting the engine pinned down and robust is by far more important.

A frame I learned from my friend Josh: "Good enough for now, safe enough to try"

A way to get consent, not consensus.

The virtual word is fundamentally cacophonic and overwhelming.

So the place to build community is in person in communities.
Those have eroded in the last few decades, but we should strengthen those so they're more important than online.
That’s the way to build a stronger society.
Fractically strong communities.

Multi-local: local innovations that can be shared with other localities.

The things that work in a specific community will be different from other localities in the details, but the fundamentals are likely transferrable.

There’s a local teppanyaki restaurant in Berkeley called Hana Japan that’s similar to Benihana.

The kids love it and always ask for it for their birthdays.
Despite being basically the same as Benihana, it has much better reviews on Google.
It has more soul.
It feels less optimized.
It’s authentic and rougher in a way that makes it feel human, not corporate.

This week Veritasium taught me about the hyper-virality of chemical polymorphs.

It makes sense that there are catchment basins of alignment that are hard to escape in any kind of non-local phenomena, but it’s wild to me just how strong the effect can be for these kinds of chemicals.

Everyone is always looking for anomalies that no one else can see… yet.

When an anomaly shifts from one you can see to one others can see, that's when game-changing value is created.
The danger is that no one will ever see them.

What is the minimal structure that leads to beautiful complexity emerging, as opposed to malignant complexity?

Often those structures are shockingly small and cheap if you can identify them.

A rule of thumb for a productive career: work with people you want to spend time with.

As long as the reason you like spending time with them is not just social, but also that they inspire you, you don’t need much more.
That’s all there is to it.

Some people believe in love at first sight.

An ability to be all-in when it feels right.
No hesitation, total clarity.
When it happens to you in one domain it’s easier to believe in it in other domains, too.

When you see someone behaving in a way you think is irrational, someone doesn’t understand the actual game.