String processing errors

59 views
Skip to first unread message

Alan Karp

unread,
Apr 14, 2026, 3:38:17 PM (9 days ago) Apr 14
to cap-...@googlegroups.com, <friam@googlegroups.com>
https://niyikiza.com/posts/map-territory/ lists a number of well-known problems when you deal with strings.  That page nicely explains the problem when you want to grant access to everything in /data.

Map vs. Territory

Let’s look at the core problem:

The Map: The string the LLM gives you. /data/../etc/passwd

The Territory: The inode the OS actually opens. /etc/passwd

The Vulnerability: Security checks usually validate the Map. Execution touches the Territory. When they disagree, attacks slip through.

Is sanitization the best we can do?  Or do capabilities give us something better?  What about more complicated situations, such as SQL queries and JSON schemas?

--------------
Alan Karp

Chip Morningstar

unread,
Apr 14, 2026, 4:28:24 PM (9 days ago) Apr 14
to cap-...@googlegroups.com
This is exactly the idea captured by Norm’s dictum of “don’t separate designation from authority”.
Capabilities definitely seem like they figure into the answer.

An idea that occurs to me (and almost certainly not original with me) is to represent capabilities as specialized kinds of tokens in the token stream that the LLM works which the LLM itself is not able to synthesize, but which instead need to originate extrinsically.
This would not be a full solution, nor even actually a solution at all per se, but might be one of the building blocks with which a solution could be composed.

— Chip


--
You received this message because you are subscribed to the Google Groups "cap-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cap-talk+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/cap-talk/CANpA1Z3UwZtWtnyFAS_CpL1VUhCFrDQCxDkW_anWZmz16fdYZA%40mail.gmail.com.

Kris Kowal

unread,
Apr 14, 2026, 4:41:46 PM (9 days ago) Apr 14
to cap-talk
With the Endo Familiar/Daemon, we’re obliging the AI agents to interact with capabilities through petnames. That is, the AI assigns their own names to the capabilities they receive. They see “proposed names” on links embedded in messages, but they are suggestions. If the AI agent sees locators (strings, URL-like), even though they are not granted the ability to lift a locator to a capability, they could exfiltrate the locator to a party that could. So, petname systems save another day.

I’ve seen other systems munge API keys, which I think is analogous to the C-list solution. That is, the number is only meaningful for them in the context of their conversation with the mediator between their sandbox and the actual API.

Alan Karp

unread,
Apr 14, 2026, 5:10:36 PM (9 days ago) Apr 14
to cap-...@googlegroups.com
The problem is when you want to delegate permission to a lot of things.  You could create a gazillion individual capabilities, one per entry in /data, say, but that wouldn't cover things you add after delegating.  

I was just hoping there was some way to select the individual entry other than using a string.  Chip may have gotten it, but I don't understand his proposal.

--------------
Alan Karp


Matt Rice

unread,
Apr 14, 2026, 5:33:32 PM (9 days ago) Apr 14
to fr...@googlegroups.com, cap-...@googlegroups.com
On Tue, Apr 14, 2026 at 7:38 PM Alan Karp <alan...@gmail.com> wrote:
>
> https://niyikiza.com/posts/map-territory/ lists a number of well-known problems when you deal with strings. That page nicely explains the problem when you want to grant access to everything in /data.
>
> Map vs. Territory
>
> Let’s look at the core problem:
>
> The Map: The string the LLM gives you. /data/../etc/passwd
>

I'm having difficulty finding a specific document that from what I
recall compares unix filesystems to one of keykos/eros/capros
directory objects
but in those systems ".." is not necessarily a thing, there is no
root, no implicit parent pointers, or tree shape.
Directories may be cyclic, forming a directed graph.

I would argue that they do not "solve" this problem, but avoid it entirely
(there exists no mapping of the directory structure to strings including ".."
unless it was intentionally added and given the arbitrary name "..") .

One thing to look at is hybrid capability systems like capsicum which
do attempt to deal with by switching
capability mode and ambient mode, where after entering capability mode
you can no longer turn strings
into capabilities via the filesystem.

Jonathan S. Shapiro

unread,
Apr 14, 2026, 6:09:10 PM (9 days ago) Apr 14
to cap-...@googlegroups.com, <friam@googlegroups.com>
It's slightly worse than that, because /data might be a symlink to an arbitrary place. In a poorly constructed chroot environment this could be used to trick the passwd program into accessing an entirely fabricated version of /etc/passwd.

I don't think this is an "it should be a capability approach" situation. Ambient authority is involved at each traversal step (at least in UNIX descendents), but almost any problem you can introduce by exploiting ambient authority can also be introduced by exploiting name re-bindings. Name spaces and bindings are much harder to get right than people tend to imagine.

Path canonicalization in UNIX variants is such a well known problem that different shells do not agree on how they canonicalize paths before opening/accessing. Some handle it textually while others walk the presented path segment by segment. If the openat() system call is used to do segment-at-a-time traversal, it's constrained by a sequence of stepwise descriptor walks, albeit with ambient access rights. Since opening '/' is a guarded special case due to chroot enforcement, the risk is that canonicalization is neither transactional nor idempotent. Changes in directory entry bindings before, after, or during a walk can lead to different results. This hazard also exists in capability-based implementations.

These days, the standard specification of open(2) actually states how it is supposed to follow these paths, but the behavior in pre-SVR4 editions of UNIX was inconsistently specified or not specified at all.

One could argue - and there are various reasons to consider - that "directory" objects should implement deep hierarchies rather than single level hierarchies, but this doesn't address the idempotency issue.

For extra credit, say what additional issues are introduced by loop-back mount points, both in modern UNIX but especially in Plan-9 where these can be introduced by non-privileged users in name spaces they control. A stiff drink may be helpful before starting; works best if you give it 15 minutes or so to kick in before you start thinking about this one.

It is sometimes a wonder to me how such a bright group of people got almost everything wrong about namespace security and units of operation. The UNIX process model didn't have a well-specified account for the effect of signal arrival on process and system state until... 1988, which Roger Faulkner, Steve Rago, and Ron Gomes nailed that down during the SVR4 /proc work. I poked my nose in a couple of times while working on the associated debugger support, but didn't have a big hand in it. Roger, Brendan Eich, and I extended it further in 1989/90 to add watchpoint handling with specified behavior for Sun's Solaris and SGI's IRIX, respectfully. So far as I'm aware, Linux still doesn't have a fully specified model, though there was work done in ptrace(2) to clean up a bunch of the bigger issues.


Jonathan


--

Mark S. Miller

unread,
Apr 15, 2026, 2:39:07 AM (8 days ago) Apr 15
to cap-...@googlegroups.com
Yes. For these purposes, think of petnames like lambda names and think of lambda names like c-list indexes.


On Tue, Apr 14, 2026 at 1:41 PM Kris Kowal <cowber...@gmail.com> wrote:


--
  Cheers,
  --MarkM

Mark S. Miller

unread,
Apr 15, 2026, 2:39:59 AM (8 days ago) Apr 15
to cap-...@googlegroups.com
On Tue, Apr 14, 2026 at 2:10 PM Alan Karp <alan...@gmail.com> wrote:
The problem is when you want to delegate permission to a lot of things.  You could create a gazillion individual capabilities, one per entry in /data, say, but that wouldn't cover things you add after delegating.  

I was just hoping there was some way to select the individual entry other than using a string.  Chip may have gotten it, but I don't understand his proposal.

Why is a string worse than a c-list index?
 

Mark S. Miller

unread,
Apr 15, 2026, 2:52:21 AM (8 days ago) Apr 15
to cap-...@googlegroups.com
The problem with Unix paths is that they are close enough to hierarchical that people form hierarchical intuitions about them. There are two ways around this. Provide genuine hierarchies as Jonathan proposes, or fully give up on hierarchy. Consider the meaning of a dotted path like `foo.bar.baz` in most conventional programming languages. `foo` is a lexical name to be looked up in the lexical environment of the utterer. It's value is, let's say, a record. `.bar` looks up the "bar" property of that record, which is let's say another record. `.baz` looks up the "baz" property on that record, whose value can be any first class value in that language. We all know that this is a naming path through an arbitrary graph with labeled edges. Likewise with path names in SPKI/SDSI. We know never to assume lack of cycles or aliasing. Because we're not led to thinking in hierarchy, the violation of hierarchy does not cause problems.

Both have their place. Aside from leacy, almost hierarchical systems like most file systems do not.


--
  Cheers,
  --MarkM

Matt Rice

unread,
Apr 15, 2026, 3:21:23 AM (8 days ago) Apr 15
to cap-...@googlegroups.com
On Wed, Apr 15, 2026 at 6:52 AM Mark S. Miller <eri...@gmail.com> wrote:
>
> The problem with Unix paths is that they are close enough to hierarchical that people form hierarchical intuitions about them. There are two ways around this. Provide genuine hierarchies as Jonathan proposes, or fully give up on hierarchy. Consider the meaning of a dotted path like `foo.bar.baz` in most conventional programming languages. `foo` is a lexical name to be looked up in the lexical environment of the utterer. It's value is, let's say, a record. `.bar` looks up the "bar" property of that record, which is let's say another record. `.baz` looks up the "baz" property on that record, whose value can be any first class value in that language. We all know that this is a naming path through an arbitrary graph with labeled edges. Likewise with path names in SPKI/SDSI. We know never to assume lack of cycles or aliasing. Because we're not led to thinking in hierarchy, the violation of hierarchy does not cause problems.
>
> Both have their place. Aside from leacy, almost hierarchical systems like most file systems do not.
>

Before I got to and after almost overlooking the "in most conventional
programming languages" intially I thought you were talking about
foo.bar.baz as encoding the file extension ".bar.baz" on the file foo,
rather than e.g. the foo.bar.baz module or the member bar.baz in the
structure foo.
For a moment I was totally confused what place encoding random
information in filenames was acceptable when this practice causes
havoc with capabilities and pet name systems where passing the
information by capability redacts the information stored in filename.
> To view this discussion visit https://groups.google.com/d/msgid/cap-talk/CAK5yZYhb0eJaibYPG72X_osgZhBfA56XTqKa1hB8N7%2BktRTm2w%40mail.gmail.com.

Alan Karp

unread,
Apr 15, 2026, 1:06:54 PM (8 days ago) Apr 15
to cap-...@googlegroups.com
On Tue, Apr 14, 2026 at 11:40 PM Mark S. Miller <eri...@gmail.com> wrote:

Why is a string worse than a c-list index?

Where does the c-list index come from?  I want to give you permission to read everything in /data, even things created after the delegation.  You need a way to designate the exact item you want to access, say /data/foo.  The only relevant c-list entry is that for /data.  The only way I know how to do that is for you to specify both the c-list index for /data and a string to designate foo.

--------------
Alan Karp


Jonathan S. Shapiro

unread,
Apr 15, 2026, 8:40:21 PM (8 days ago) Apr 15
to cap-...@googlegroups.com
I like the framing, but the story is not entirely correct. Consider dynamic module resolution.


Jonathan

Mark S. Miller

unread,
Apr 16, 2026, 12:15:44 AM (7 days ago) Apr 16
to cap-...@googlegroups.com
I don't get it. Please explain dynamic module resolution




--
  Cheers,
  --MarkM

Rob Meijer

unread,
Apr 18, 2026, 4:57:07 PM (5 days ago) Apr 18
to cap-...@googlegroups.com
Are they really hierarchical intuitions, or are they arborescent intuitions? Or a combination of the two? I think you can get in trouble by assuming a unix file path or a dot notation chain in a programming language, that in essence is a graph traversal route or path,  refers to an arborescent DAG.   As such, I feel the choice isn't between hierarchy or no hierarchy, but between different constraints on the graph being traversed by the path.

Matt Rice

unread,
Apr 18, 2026, 5:43:14 PM (5 days ago) Apr 18
to cap-...@googlegroups.com
I've considered it more of a compiler portability problem than an
actual programming language issue,
essentially you can have a pre-parser that spits out the file contents
with a header attached including the
filesystem information, and then you have a complete data stream which
doesn't rely on ambient authority, and self-describes.

But when you start looking at porting compilers to capability systems
they are often not written in a way that takes a self describing data
stream as file input. Even worse, compiler tools like the bfd library
will open/close file descriptors by name arbitrarily as it reaches the
open file descriptor limit.
Thus usage of ambient authority tends to get pushed down to the bottom
of the library design.

Anyhow these are the problems I've noticed, but it always feels like
you could write a compiler for a language which encodes filesystem
assumptions for a capability system, either by wrapping individual
datastreams with the missing information or just using a filesystem
archive format as input (.tar, .zip, etc).

So, my feeling here is that the problem here is reliance upon a
hierarchy which is not actually represented within the input stream.
> To view this discussion visit https://groups.google.com/d/msgid/cap-talk/CAMpet1UW7hALoyaWpqq_0wKbcy-FoqYsyhV8HXZ8vgaM97boDw%40mail.gmail.com.

Matt Rice

unread,
Apr 18, 2026, 6:32:13 PM (5 days ago) Apr 18
to cap-...@googlegroups.com
On Sat, Apr 18, 2026 at 9:42 PM Matt Rice <rat...@gmail.com> wrote:
>
> I've considered it more of a compiler portability problem than an
> actual programming language issue,
> essentially you can have a pre-parser that spits out the file contents
> with a header attached including the
> filesystem information, and then you have a complete data stream which
> doesn't rely on ambient authority, and self-describes.
>
> But when you start looking at porting compilers to capability systems
> they are often not written in a way that takes a self describing data
> stream as file input. Even worse, compiler tools like the bfd library
> will open/close file descriptors by name arbitrarily as it reaches the
> open file descriptor limit.
> Thus usage of ambient authority tends to get pushed down to the bottom
> of the library design.
>
> Anyhow these are the problems I've noticed, but it always feels like
> you could write a compiler for a language which encodes filesystem
> assumptions for a capability system, either by wrapping individual
> datastreams with the missing information or just using a filesystem
> archive format as input (.tar, .zip, etc).

I should probably add that these days perhaps the most relevant thing
instead of
.tar, .zip and some fs archive format might be just using git repositories or
whatever version control system directly.

Mark S. Miller

unread,
Apr 18, 2026, 7:21:37 PM (5 days ago) Apr 18
to cap-...@googlegroups.com
On Sat, Apr 18, 2026 at 1:57 PM Rob Meijer <pib...@gmail.com> wrote:
Are they really hierarchical intuitions, or are they arborescent intuitions?

TIL "arborescent". But having learned it, I'm confused. As I should have guessed anyway from the name, "arborescent" simply seems to me to mean "tree". What's the difference between "hierarchical" and "arborescent"? 


Rob Meijer

unread,
Apr 19, 2026, 11:24:26 AM (4 days ago) Apr 19
to cap-...@googlegroups.com
Arborescent differs from just "tree" mostly in its implied directionality and designated root. It differs from hierarchical in its strict single-root nature and its requirement that every non-root vertex has exactly one parent.

Alan Karp

unread,
Apr 22, 2026, 1:25:54 PM (21 hours ago) Apr 22
to cap-...@googlegroups.com, <friam@googlegroups.com>
I've been pondering this problem and may have come up with a solution that doesn't rely on string sanitization.  To refresh your memory, Alice delegates to Bob access to everything in /data, even things added after the delegation.  Bob uses that capability to access some object by specifying a string, say "foo.txt".  Alice's machine interprets that request as being for /data/foo.txt.  Great, but what if the string is "../etc/password"?

The basic solution is that the capability to /data only allows asking for a capability to the thing designated by the specified string.  Alice, who knows everything in /data, keeps a map of strings to capabilities.  Bob then uses his /data capability to ask for a capability to foo.txt.  There won't be an entry in the map if he asks for ../etc/password.  You can avoid the round trip to retrieve the capability by having a /data service that accesses the map and forwards the request using the designated capability.

Does this approach solve the problem?  Does it introduce any vulnerabilities?  Can it be adapted to solve other related problems, such as SQL queries?

--------------
Alan Karp

Rob Meijer

unread,
7:16 AM (4 hours ago) 7:16 AM
to cap-...@googlegroups.com, <friam@googlegroups.com>
It never grew a user base, so I haven't been doing maintenance on it for years, so not 100% sure it runs in the current python ecosystem (there was a weird issue a few years back, but if you follow the readme, that did the trick). But this is basicly what rumpletree does, not with any map, but with a root sparse-cap (multi rooted), and a single server side key. If it still installs, Play around with rumpelbox a bit. It's a demo tool of pyrumpeltree. As said, it's unmaintained because of a zero size user base AFAIK, but I think it fills your need exactly:


I wanted to make it into a users pace fil-system like MinorFS and MattockFS, but never found time to look  into the locking and random access crypto needs properly.

I'dd be hapy to help anyone wanting to take over the project  to get started, but I'm too filled up with other pet projects right now to work on pyrumpeltree or related stuff for now, so if you are interested in adopting it, or porting it, that would be great, if its indeed the fit that I think it is.

If not, it will stay the unmaintained thing it's now without a FUSE filesysyem implementation on top. 

--
You received this message because you are subscribed to the Google Groups "cap-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cap-talk+u...@googlegroups.com.

Rob Meijer

unread,
7:28 AM (3 hours ago) 7:28 AM
to cap-...@googlegroups.com, <friam@googlegroups.com>


On Thu, 23 Apr 2026, 13:16 Rob Meijer, <pib...@gmail.com> wrote:
It never grew a user base, so I haven't been doing maintenance on it for years, so not 100% sure it runs in the current python ecosystem (there was a weird issue a few years back, but if you follow the readme, that did the trick). But this is basicly what rumpletree does, not with any map, but with a root sparse-cap (multi rooted), and a single server side key. If it still installs, Play around with rumpelbox a bit. It's a demo tool of pyrumpeltree. As said, it's unmaintained because of a zero size user base AFAIK, but I think it fills your need exactly:


I wanted to make it into a users pace fil-system like MinorFS and MattockFS, but never found time to look  into the locking and random access crypto needs properly.

I'dd be hapy to help anyone wanting to take over the project  to get started, but I'm too filled up with other pet projects right now to work on pyrumpeltree or related stuff for now, so if you are interested in adopting it, or porting it, that would be great, if its indeed the fit that I think it is.


Just for context, it's a really small project. Just about 150 lines of python for the library and about 300 lines of code for the single codebase demo tools (rumpelbox has a busybox like symlink setup).  Should be easy enough to port or take custody of if it matches your needs, but there are a few mental click moments for many before it makes sense. 

I remember trying to explain it to Zooko many years ago when I didn't have code yet, and failing. But I think explaining it "with" code is probably easier. Especialy because the core of it is only 150 lines. 
Reply all
Reply to author
Forward
0 new messages