Presumably some kinds of AI systems, architectures, methods, and ways of building complex systems out of ML models are safer or more alignable than others. Holding capabilities constant, you'd be happier to see some kinds of systems than others.
For example, Paul Christiano suggests "LM agents are an unusually safe way to build powerful AI systems." He says "My guess is that if you hold capability fixed and make a marginal move in the direction of (better LM agents) + (smaller LMs) then you will make the world safer. It straightforwardly decreases the risk of deceptive alignment, makes oversight easier, and decreases the potential advantages of optimizing on outcomes."
My quick list is below; I'm interested in object-level suggestions, meta observations, reading recommendations, etc. I'm particularly interested in design-properties rather than mere safety-desiderata, but safety-desiderata may inspire lower-level design-properties.
All else equal, it seems safer if an AI system:
- Is more interpretable
- If its true thoughts are transparent and expressed in natural language (see e.g. Measuring Faithfulness in Chain-of-Thought Reasoning)
- (what else?);
- Has humans in the loop (even better to the extent that they participate in or understand its decisions, rather than just approving inscrutable decisions);
- Decomposes tasks into subtasks in comprehensible ways, and in particular if the interfaces between subagents performing subtasks are transparent and interpretable;
- Is more supervisable or amenable to AI oversight (what low-level properties determine this besides interpretable-ness and decomposing-tasks-comprehensibly?);
- Is feedforward-y rather than recurrent-y (because recurrent-y systems have hidden states? so this is part of interpretability/overseeability?);
- Is myopic;
- Lacks situational awareness;
- Lacks various dangerous capabilities (coding, weapon-building, human-modeling, planning);
- Is more corrigible (what lower-level desirable properties determine corrigibility? what determines whether systems have those properties?) (note to self: see 1, 2, 3, 4, and comments on 5);
- Is legible and process-based;
- Is composed of separable narrow tools;
- Can't be run on general-purpose hardware.
These properties overlap a lot. Also note that there are nice-properties at various levels of abstraction, like both "more interpretable" and [whatever low-level features make systems more interpretable].
If a path (like LM agents) or design feature is relatively safe, it would be good for labs to know that. An alternative framing for this question is: what should labs do to advance safer kinds of systems?
Obviously I'm mostly interested in properties that might not require much extra-cost and capabilities-sacrifice relative to unsafe systems. A method or path for safer AI is ~useless if it's far behind unsafe systems.
My improved (but still somewhat bad, pretty non-exhaustive, and in-progress) list:
(Some comments reference the original list, so rather than edit it I put my improved list here.)
More along these lines (e.g. sorts of things that might improve safety of a near-human-level assistant AI):
- Architecture/design:
- The system uses models trained with gradient descent as non-agentic pieces only, and combines them using classical AI.
- Models trained with gradient descent are trained on closed-ended tasks only (e.g. next token prediction in a past dataset)
- The system takes advantage of mild optimization.
- Learning human reasoning patterns counts as mild optimization.
- If recursion or amplification is used to improve results, we might want more formal m
... (read more)