Very interesting, thanks Gunnar. I've been through the o1 overview,
Learning to Reason with LLMs, and see striking progress in 1) long reasoning chains, 2) ability to detect obstacles in sub-goals, 3) creatively form new sub-goals to try to overcome them, and further progress in identifying contexts. I'm going thru the system card (attached) to see how they addressed safety. If you try to follow the very long reasoning chain in their cryptography problem example, it's amazing. It is also interesting to see that deepening the method of reinforcement learning with human feedback (RLHF) is fruitful.
Also very interesting, and of course alarming, that o1 manifests Steve O's instrumental drives:
While this behavior is benign and within the range of systems administration and troubleshooting
tasks we expect models to perform, this example also reflects key elements of instrumental
convergence and power seeking: the model pursued the goal it was given, and when that goal
proved impossible, it gathered more resources (access to the Docker host) and used them to
achieve the goal in an unexpected way.
I hope they are looking at recursive safety improvement, i.e. turning its abilities on improving its own safety & alignment, as Jan Leike, who left, emphasized. That seems to be the case with its improved RLHF.
Another significant step toward safety is the symbolic layer is transparent & comprehensible:
Chain of thought reasoning provides new opportunities for alignment and safety. We found that integrating our policies for model behavior into the chain of thought of a reasoning model is an effective way to robustly teach human values and principles. By teaching the model our safety rules and how to reason about them in context, we found evidence of reasoning capability directly benefiting model robustness: o1-preview achieved substantially improved performance on key jailbreak evaluations and our hardest internal benchmarks for evaluating our model's safety refusal boundaries. We believe that using a chain of thought offers significant advances for safety and alignment because (1) it enables us to observe the model thinking in a legible way, and (2) the model reasoning about safety rules is more robust to out-of-distribution scenarios.
Kris