Presumably some kinds of AI systems, architectures, methods, and ways of building complex systems out of ML models are safer or more alignable than others. Holding capabilities constant, you'd be happier to see some kinds of systems than others.
For example, Paul Christiano suggests "LM agents are an unusually safe way to build powerful AI systems." He says "My guess is that if you hold capability fixed and make a marginal move in the direction of (better LM agents) + (smaller LMs) then you will make the world safer. It straightforwardly decreases the risk of deceptive alignment, makes oversight easier, and decreases the potential advantages of optimizing on outcomes."
My quick list is below; I'm interested in object-level suggestions, meta observations, reading recommendations, etc. I'm particularly interested in design-properties rather than mere safety-desiderata, but safety-desiderata may inspire lower-level design-properties.
All else equal, it seems safer if an AI system:
- Is more interpretable
- If its true thoughts are transparent and expressed in natural language (see e.g. Measuring Faithfulness in Chain-of-Thought Reasoning)
- (what else?);
- Has humans in the loop (even better to the extent that they participate in or understand its decisions, rather than just approving inscrutable decisions);
- Decomposes tasks into subtasks in comprehensible ways, and in particular if the interfaces between subagents performing subtasks are transparent and interpretable;
- Is more supervisable or amenable to AI oversight (what low-level properties determine this besides interpretable-ness and decomposing-tasks-comprehensibly?);
- Is feedforward-y rather than recurrent-y (because recurrent-y systems have hidden states? so this is part of interpretability/overseeability?);
- Is myopic;
- Lacks situational awareness;
- Lacks various dangerous capabilities (coding, weapon-building, human-modeling, planning);
- Is more corrigible (what lower-level desirable properties determine corrigibility? what determines whether systems have those properties?) (note to self: see 1, 2, 3, 4, and comments on 5);
- Is legible and process-based;
- Is composed of separable narrow tools;
- Can't be run on general-purpose hardware.
These properties overlap a lot. Also note that there are nice-properties at various levels of abstraction, like both "more interpretable" and [whatever low-level features make systems more interpretable].
If a path (like LM agents) or design feature is relatively safe, it would be good for labs to know that. An alternative framing for this question is: what should labs do to advance safer kinds of systems?
Obviously I'm mostly interested in properties that might not require much extra-cost and capabilities-sacrifice relative to unsafe systems. A method or path for safer AI is ~useless if it's far behind unsafe systems.
All else equal, I think minimizing model entropy is desirable (i.e. the number of weights). In other words, you want to keep the size of the model class small.
Roughly, alignment could be viewed as constructing a list of constraints or criteria that a model must satisfy in order to be considered safe. As the size of the model class grows, more models will satisfy any particular constraint. The complexity of the constraints likely needs to grow along with the complexity of the model class.
If a large number of models satisfy all the constraints, there is a large amount of behavior that is unconstrained and unaccounted for. We've decided that we don't care about any of the behavioral differences between the models that satisfy all the constraints.
This isn't necessarily true. Modern DL models are semi-organically grown rather than engineered, so the set of SGD discoverable models is much smaller than the set of all possible models. And techniques like iterative amplification further shrink the set of learnable models. Or maybe many of the models are behaviorally identical on the subset of inputs we care about.
That said, thinking about model entropy seems helpful.
Why would we expect the expected level of danger from a model of a certain size to rise as the set of potential solutions grows?