But it turns out that the resulting programs are also generally totally inscrutable to human inspection!
The paper "White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?" (from a team in mostly in Berkeley) claims to have found a transformer-like architecture with the property that the optima of the SGD training process applied to it are themselves (unrolled) alternating optimization processes that optimize a known and understood informational metric. If that is correct, that would mean that there is a mathematically tractable description of what models trained using this architecture are actually doing. So rather then just being enormous inscrutable black boxes made of tensors, their behavior can be reasoned about in close form in terms of what the resulting mesa-optimizer is actually optimizing.. The authors also claim that they would expect of the internals of models using this architecture to be particularly sparse, orthogonal, and thus interpretable. If true, these claims both sound like they would have huge implications for the mathematical analysis of the safety of machine learned models, and for Mechanical Interpretability. Is anyone with the appropriate matchatical and Mech Interpretability skills looking at thos paper and trained models using the variant o transformer architecture that the authors describe?
In the last post, I expressed that it would be nice to have safety data sheets for different optimization processes. These data sheets would describe the precautions necessary to use each optimization process safely. This post describes an approach for writing such safety guidelines, and some high-level guidelines for safely applying optimization.
The Safety Mindset
We can think of an optimization metric as defining what an optimization process is "trying" to do. I'm also persuaded by Yudkowsky's take that we shouldn't build software systems that are "trying" to do things we don't want them to do:
That whole article is worth a read, and describes what Yudkowsky calls the safety mindset, analogous to the security mindset. Where the real adversary isn't hackers trying to subvert your system, it's your own mind that makes assumptions during system design that might not hold for all contexts your system might operate in. One approach that Yudkowsky advocates in both contexts is to limit those assumptions as much as possible.
Safety Guidelines
Here is a high-level overview of some guidelines that might appear on a safety data sheet for optimization processes:
The State of AI Safety
The dominant AI-building paradigms used today generally do not provide any safety guarantees at all. It's straightforward to come up with a family of computer programs which are parameterized by lists of numbers, such as neural networks. And it's standard practice to pick an optimization metric and apply the magic of calculus to efficiently search this large space of computer programs for candidates that score highly on that metric.
This is already not great from a safety perspective. But it turns out that the resulting programs are also generally totally inscrutable to human inspection! It takes additional work, which we don't currently know how to do in general, to take a learned program and turn it into a well-documented and human-legible codebase.
That paradigm is basically a recipe for disaster. And the entire field of AI Safety, and this entire sequence, is about coming up with a different recipe, which reliably allows us to safely build and deploy highly capable software systems.
Up next: back to game theory! Where I think optimization can play an important role in achieving good outcomes.