I though I was going to have more ideas, but after sitting around a bit I'll just post what I had in terms of friendliness architectural/design choices that are disjoint from safety choices:
My improved (but still somewhat bad, pretty non-exhaustive, and in-progress) list:
(Some comments reference the original list, so rather than edit it I put my improved list here.)
More along these lines (e.g. sorts of things that might improve safety of a near-human-level assistant AI):
Maybe relatively safe if:
Another factor for the safest type of AGI is one that can practically be built soon.
The perfect is the enemy of the good. A perfectly safe system that will be deployed five years after the first self-improving AGI is probably useless.
Of course the safest path is to never build an agentic AGI. But that seems unlikely.
This criteria is another argument for language model agents. I've outlined their list of safety advantages here.
Of course, we don't know if language model agents will achieve full AGI.
Another path to AGI that seems both achievable and alignable is loosely brainlike AGI, along the lines of LeCun's proposed H-JEPA. Steve Byrnes' "plan for mediocre alignment" seems extensible to become quite a good plan for this type of AGI.
Someone anonymously suggests:
- Keep pretraining closely matching the human text distribution
- Require all rumination to be done in natural language
- Require approval before taking consequential actions
- Only deploy a model with a second adversarial monitor
- Only deploy a model if it's "dumb" in a quantifiable way
A note on this part:
If its true thoughts are transparent and expressed in natural language...
Claim: it is impossible for the internal thoughts of a mind to both be expressed mostly in natural language, and to use that natural language in a way at all similar to humans (for instance, no steganography). The reason is that, the way humans use natural language in practice, the words are largely used to "pull mental levers", and then most of the load-bearing reasoning happens in the non-linguistic cognition those levers induce. So if language is used the way humans use it, then most of the cognition is hidden away in non-linguistic channels.
You can see this in practice every time someone has trouble expressing their thoughts in words. Or every time someone is able to express their thoughts in words to someone who already has a bunch of shared mental models (even without necessarily shared jargon to point to those models - e.g. one can say "you know that thing where..." and give a couple examples), but is unable to express their thoughts in words to someone who doesn't already have the relevant mental models.
The closest unblocked requirement would be an AI with lots of parts separated by internal interfaces, where those interfaces use natural language in a human-like way. There's still a lot of non-natural-language cognition going on in between the natural language in and the natural language out, but the interfaces might still provide a lot of useful visibility of intermediates. Basically-all of today's LLMs would be examples of such an architecture: they take natural language in, do some magic, then spit out one more token of natural language.
General comment on this list: the mental image behind most of the items seems to be roughly "break the system into parts, which are individually interpretable and nonmalicious". Intermediates expressed in natural language, decomposition into interpretable subtasks, myopia, minimal hidden state, lack of situational awareness, legibility, process-based, and composition of narrow tools all tie into that general pattern.
These all share similar shortcomings: interpretability/tool-ness/corrigibility/etc are not composable, and also (depending on which versions of these ideas one imagines) the not-yet-well-written-up problem of "you don't get to choose the ontology".
That's not really a problem for any of these properties individually - they're each still potentially worthwhile properties which make AI relatively safer. But in aggregate, bear in mind that there's decreasing marginal returns to properties which all pursue roughly-similar upsides and have roughly-similar shortcomings. If this list is your starting point, then there's probably a lot more marginal value in adding properties which which address e.g. the composability problem, rather than additional similar properties to those listed.
I think narrow domain might be one of the most important properties. An AI that is very good at designing drugs and nothing else is much safer than an AGI; it doesn't understand itself or the world at large and can't go on a self improvement rampage. It still has risks but they're ordinary "don't let a radical political terrorist in a BSL4 lab" kinda risks.
Yo Shavit says:
I wish we talked more about which particular AI systems design principles give us confidence in safety.
For example: context is always erased; humans review ~all context and output tokens.
These principles are likely to disappear unless we center them in our analysis & demands.
Less worried rn about issues like "RLHF incentivizes deception"
Much more worried about "a pair of mostly-aligned AI systems talk to each other for 3 hours and then make POST requests, and no human actually reviews the full transcript"
I think it's good this post exists. But I really want to make the distinction between "safe" and "is a solution to the alignment problem," which this post elides. Or maybe "safe" vs. "friendly"?
If we build a superhuman AGI we'd better have solved the alignment problem in the sense of actually making that AGI want to do good things and not bad things. (Past a certain point, "just follow orders" isn't safe unless the order "do good things" works. If it wouldn't work, you've built an unsafe AI, and if it would work, you might as well give it.)
OpenAI's "parallelizable alignment assistant" strategy can work for between 0 and 4 organizations in the world, because it relies on having enough of a lead that you can build something that is safe yet not a solution to the alignment problem, and nobody else will cause an accident in the weeks or months you spend trying to convert this into a solution to the alignment problem.
To look at one example property: taking a random AI and putting a human in the loop makes it more safe. But it does little to nothing for solving alignment. It helps when you're building an AI that's dumber than you, but doesn't really when you're building an AI that's smarter than you.
Or lack of situational awareness. This is actively anti-alignment, because the state of the world is useful information for doing good things. But it's even more anti-capabilities, so it's a fine property to shoot for if you're making an AI that's safe because it has limited capabilities.
I'll come back later with a comment that actually makes suggestions, both ones that trade off for safety and for friendliness.
Well said. I mostly agree, but I'll note that safety-without-friendliness is good as a non-ultimate goal.
Re human in the loop, I mostly agree. Re situational awareness, I mostly agree and I'll add that lack-of-situational-awareness is sometimes a good way to deprive a system of capabilities not relevant to the task it's designed for-- "capabilities" isn't monolithic.
I think my list is largely bad. I think central examples of good-ideas include LM agents and process-based systems. (Maybe because they're more fundamental / architecture-y? Maybe because they're more concrete?)
Looking forward to your future-comment-with-suggestions.
If its true thoughts are transparent and expressed in natural language(see e.g. Measuring Faithfulness in Chain-of-Thought Reasoning)
This seems technically true but a bit of a trap, since it may be easier to get ‘looks like it expresses its thoughts in natural language’ than ‘reliably actually does’ and specifying the difference may be too subtle for people.
what lower-level desirable properties determine corrigibility?
Has corrigible (or alignment) properties embedded in the attention weights.
Here is a comparative analysis of a project I'm using datasets to instruct / hack / sanitize the whole attention mechanism of GPT2-xl in my experiments: A spreadsheet on QKV Mean weights comparisons on various GPT2-xl builds. The spreadsheet currently has four builds, and the numbers you see is the mean weights (half of the attention mechanism in layers 1 to 48, doesn't include the embedding layer):
Presumably some kinds of AI systems, architectures, methods, and ways of building complex systems out of ML models are safer or more alignable than others. Holding capabilities constant, you'd be happier to see some kinds of systems than others.
For example, Paul Christiano suggests "LM agents are an unusually safe way to build powerful AI systems." He says "My guess is that if you hold capability fixed and make a marginal move in the direction of (better LM agents) + (smaller LMs) then you will make the world safer. It straightforwardly decreases the risk of deceptive alignment, makes oversight easier, and decreases the potential advantages of optimizing on outcomes."
My quick list is below; I'm interested in object-level suggestions, meta observations, reading recommendations, etc. I'm particularly interested in design-properties rather than mere safety-desiderata, but safety-desiderata may inspire lower-level design-properties.
All else equal, it seems safer if an AI system:
These properties overlap a lot. Also note that there are nice-properties at various levels of abstraction, like both "more interpretable" and [whatever low-level features make systems more interpretable].
If a path (like LM agents) or design feature is relatively safe, it would be good for labs to know that. An alternative framing for this question is: what should labs do to advance safer kinds of systems?
Obviously I'm mostly interested in properties that might not require much extra-cost and capabilities-sacrifice relative to unsafe systems. A method or path for safer AI is ~useless if it's far behind unsafe systems.