contributing to an open problems list

Yudhister Joel Kumar

unread,

Jul 4, 2025, 2:39:43 AMJul 4

to guaranteed-safe-ai

Hi all,

Some colleagues and I are accumulating open problems in AIS to present at ODYSSEY (an alignment theory conference). Would love for y'all to contribute!

Flavors of what we have so far: ARC submitted some technical & conceptual questions in the heuristic arguments agenda, we have directions from the causal incentives group, planning to include singular learning theory, scalable oversight, computational mechanics, etc.

We'd be excited about including material on formal verification! "Submission" is a strong word --- just feel free to share any thoughts you might have as a response to this email! Specific and broad directions both appreciated, we just want to get a sense of what people think is important and/or tractable.

Please forward to anyone who might be interested! And email me if you have any questions.

Best,

Yudhister

Quinn Dougherty

unread,

Jul 4, 2025, 9:55:51 AMJul 4

to Yudhister Joel Kumar, guaranteed-safe-ai

this is great! thanks for doing this.

I often struggle to come up with an open problems list for GSAI, since a lot of what I think about is very rudimentary engineering of "squeezing out better performance via scaffolds and prompts" which isn't an exciting thing to read in an open problems doc. A lot of ecosystem services like transpilers between proof stacks would be good, but i just found out about the IR that the CSLib team is going to do and, well, I think the best approach to a universal transpiler is to wait for that to be matured then ship transpilers from it. Anyway, that's moreso "here are some approaches i haven't had time yet, free idea jubilee" and less "open problems, derive your own damn approaches"

That said, there's also the probabilistic world modeling angle which has a ton-- we could look through the https://www.aria.org.uk/media/3nhijno4/aria-safeguarded-ai-programme-thesis-v1.pdf which has a lot of it outlined, but unclear at this juncture exactly how much of it is being taken care of by ARIA creators and which ones need help.

That said, on the topic of SgAI, I've long thought gatekeeper architectures should be prototyped on toy problems (RIP my SFF funding request that suggested I do this in minecraft), cuz you could sort of build a UX laboratory to experiment with spec elicitation and proof interpretation, plus trying to break the security guarantees of the architecture or implementation, plus just having concrete codebases to sharpen arguments and counterarguments.

--
You received this message because you are subscribed to the Google Groups "guaranteed-safe-ai" group.
To unsubscribe from this group and stop receiving emails from it, send an email to guaranteed-safe...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/guaranteed-safe-ai/CAMjuzuqEj49v7RebienTGOaUztsK1%3DDyB4iGjG47ViJnz1TH%3Dg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Quinn Dougherty

unread,

Jul 4, 2025, 10:13:13 AMJul 4

to Yudhister Joel Kumar, guaranteed-safe-ai

there is one conceptual bottleneck that programming language theorists could work on, which is steganography-free semantics. I have some reasonable, but unsatisfying, notes on a stegfree certs over SKI calculus (unsatisfying cuz I don't have an example of a "message" in the everyday sense of the word that can actually come through in my formalism of "covert channel"), but i'm curious if it's the place to start.

For example, the existence of a covert channel could look like this

```

/-- Hidden meaning detector: denotationally equal, but operationally distinct --/
def HasHiddenMeaning (a b : Term) : Prop :=
DenotationallyEq a b ∧ OperationallyDifferent a b

```

and the stegfree cert would be a proof of the following type

```

/-- StegFree predicate: all observable outputs agree under different observers --/
def StegFree (t1 t2 : Term) : Prop :=
DenotationallyEq t1 t2 ∧
countSymbols t1 = countSymbols t2 ∧
stepCount t1 = stepCount t2 ∧
hasUnusedArg t1 = hasUnusedArg t2
```

which is provable more often than simply the absence of hidden channels.

I don't know what I'd bet on the tractability of this, even over programming languages, but I'm beginning to suspect there are weak forms of steg freedom that we can get, and it'd be great to deeply understand limitations.

not to mention, stegfree semantics wouldn't even try to address "I_WILL_PAY_YOU_ONE_MEELION_DOLLARS_IF_YOU_COLLUDE_WITH_ME = 1" variable naming (which Steve recently reminded me of, but dates back to very old boxing/containment posts).

Then naturally you ask about natural language, or things other than programming languages. and i had this 4o sesh where it gave some info theory formalism about estimating steg capacity, and suggested we could manipulate the steg capacity upper bound, and that was over natural language. But I'm not sure I know enough info theory to evaluate whether it's slop and cosign it.

Yudhister Joel Kumar

unread,

Jul 5, 2025, 3:35:18 PMJul 5

to Quinn Dougherty, guaranteed-safe-ai

Thanks! This is all very helpful. Will follow-up privately with clarifying questions in a bit.

We would be really excited about including open questions relating to the probabilistic world modeling agenda! So if anyone here was working on projects for TA1 in the Safeguarded AI program, I'd love to chat!

Yudhister

Reply all

Reply to author

Forward