An observation: there are sequences of actions the system can take that might result in very little change in the world state, but which do expand the actions available to the system.
For example, if the system starts with the set of actions that allow it to control an internet-connected web browser, it can use those actions to write and run a program using a browser-based IDE like Replit. Writing the program itself doesn't have a large effect on the world state (it modifies a few kB worth of bits on some hard disks in Replit's servers), but once the program is written, the system has a new action available: run the program. Lots of other kinds of "action-expanding" sequences are possible.
By regarding world states that are similar enough to each other as equivalent, the tree becomes a graph. Can any principles of corrigibility be reformulated strictly in terms of mathematical properties of this graph and the set of actions available at each node?
In my recent post on steering systems, I sketched an AI system which chooses which actions to execute via tree search over world states.
In the original post, the system is meant to illustrate the possibility that component subsystems which are human-level and (possibly) non-dangerous when used individually can be composed straightforwardly into something which is more capable and likely more dangerous.
In this post, I explore how to add corrigibility to such a system, by considering modifications and restrictions to the original design based on principles and desiderata proposed by others.
My conclusion from this exercise is that corrigibility might be straightforward to add to practical systems, but that the resulting system will be much less capable than the unmodified, non-corrigible version. This effectiveness penalty can be thought of as a "corrigibility tax", which is a specific kind of alignment tax.
Background: corrigibility
There are multiple views on corrigibility. For various precise and technical criteria, it has been shown that it is difficult to construct an agent which meets these criteria in a way that is coherent and stable under reflection.
Nevertheless, there are proposed principles and desiderata of corrigible systems, some of which seem practical to add to existing or near-future AI systems. Even if these principles come apart or are ill-defined in the limit of superintelligence, it may be both possible and practical to imbue them in weakly or even strongly superhuman systems. The result may be a system which is both powerful and safe enough for pivotal use, without the need to ramp up the capabilities of the system to the point where corrigibility breaks down under reflection or self-improvement.
Even if you're skeptical of the pivotal use framing or corrigibility as a whole, many of the principles are probably individually desirable when trying to build any kind of safe AI system.
For the purposes of this post, I'll be using the principles of corrigibility listed here, originally written by Eliezer as a glowfic tag in planecrash.
Background: steering systems
My post on steering systems is about framing the capabilities of a system in terms of its ability to choose actions which steer towards particular outcomes. I gave a bunch of examples of how to apply the concept to existing and future AI systems. One of the goals of that post was to convey an intuition for why having safe or "aligned" foundation models doesn't imply systems which are built from those models are safe.
One of the examples I gave was a sketch of a system of my own design, which performs tree search over world states to find worlds that score highly according to a given evaluation function. The system is not meant to be practical; in the original post it was meant to illustrate the ease of and danger posed by composability of weaker, safer systems.
In this post, I'll use the same system for another purpose: illustrating how one might build corrigibility into a powerful AI system.
I'll make the same assumptions in the original post, namely, that the component subsystems (which may be next-gen deep learning models, LLMs, LLM-based agents or chains, or some other near-future construct) are at least human-level at their given individual tasks, but that they are either not capable of or do not desire to "break out" of the system and act agentically in their own right. In other words, I am assuming that there is no inner alignment failure of these component subsystems. In the original post, this was, in some sense, a conservative assumption: the point was to show that even given this assumption, the system could still be dangerous. In this post, it is a more foundational assumption; if it does not hold, the corrigibility properties I introduce in the next section will probably also not hold.
The next part of this section is a quote of the relevant section from steering systems. There are more remarks and explanations in the original post, though only the quoted section is mandatory for understanding the rest of this post.
As a reminder, in the basic design above, it is left up to the pruning heuristic to keep the branching factor under control, and to decide if and when to actually execute the proposed actions vs. letting the world model predict the outcome and then searching deeper in the tree based on the prediction. Actually executing actions has the advantage of effecting the real world, resulting in feedback to improve the accuracy of the world model. The downside is that an action may be irreversible, expensive, slow, or step on the toes of searches in other parts of the tree. Adding various corrigibility properties will involve adding restrictions on when actions are actually executed, which may make the job of P even harder.
Applying each principle to the system
In this section, I'll go through each of the principles outlined in Corrigibility at some small length, and attempt to apply them to modify or restrict the system above.
Note: Most of the quotes from this section are from the tl;dr version and intended to serve as a signpost / quick reminder of the concept in question. I recommend familiarizing yourself with the actual full version as you read this, if you haven't seen it before.
Unpersonhood
This requires the underlying component subsystems to not have qualia; in particular, P and W.
One way of modeling a world containing agents with qualia is to simulate those agents and their thoughts. Sufficiently high fidelity simulations will potentially have qualia, but this seems avoidable in principle - humans themselves are capable of modeling other humans in enough detail to make accurate predictions about their behavior and effects on the world in many situations, apparently without the people in their imaginations having any qualia.
For the pruning heuristic, I am imagining an LLM or LLM-based subsystem, prompted to output plans, probabilities and choices of actions. It seems possible, though unlikely, that GPT-4 could already have some degree of qualia, and GPT-4 itself is not yet human-level at making plans. Perhaps any LLM that is human-level or above at planning will unavoidably have some qualia; if so, this is unfortunate.
A combination of interpretability research, research on the nature of qualia and consciousness, and philosophy may help to answer these questions.
Taskishness
The absolute hardest thing you should use the system for is to accomplish some concrete, bounded pivotal act; don't try to use it to maximize CEV or "solve alignment" or any other unbounded thing of unknown difficulty.
Before attempting a pivotal use, you should test the system on even smaller and more bounded tasks: answering questions of simple fact for which the answer is already known, producing some machine-checkable proofs, moving small, well-defined objects in the real world, building up gradually to superhuman but still bounded and well-defined tasks like cloning a strawberry, or developing a cure for cancer which works in straightforward and legible ways.
Mild optimization
We can try to ensure the system is a mild optimizer by (a) enforcing that the evaluation function provably has certain properties and (b) quantilizing the decision on which actions to execute.
On (a), the initial evaluation function might be a machine learning model, but this should be pruned or distilled into the simplest function possible, and then proved to have a number of properties. Potentially desirable example properties:
Imposing such restrictions on the evaluation function seems likely to limit the kinds of tasks the system can be used for, or at least require that those tasks be formulated with very concrete outcome specifications that can be defined in advance.
On (b), instead of searching for a single world state that meets the target score, we require the search to find multiple sufficiently-diverse candidate world states, and then have the system choose randomly among them before any actions are allowed to execute.
Bounded utilities and probabilities
This property is already partially satisfied by the conditions imposed on the evaluation function to achieve mild optimization. But we also want to ensure that any "intermediate" probabilities and utilities which P considers when planning and pruning are not too extreme.
Concretely, if P is an LLM-based agent, we could require that, at each node, P return human-readable explanations, utilities, and probabilities used to justify its selected actions, and then use another LLM agent to check that those probabilities are both well-calibrated and not too extreme. Both LLM-agents should be spot-checked by humans for accuracy, and mechanistically interpretable enough that we understand how the probability and utility estimates were calculated by the underlying models.
Low impact
(The full description is particularly important for understanding this one.)
Some people refer to this property as impact regularization, and defining it rigorously and then getting it into an AI system seems pretty hard. This property on its own might be "corrigibility-complete": if you can figure out how to get this property, most of the others are trivial or unnecessary.
Some ideas:
I don't think either of these ideas really capture the full essence of low impact, though.
Myopia
Left as an exercise.
Separate questioners
Left as an exercise.
Conservatism
As in the bounded utilities section, we could use another LLM-agent tasked with evaluating plans output by P for conservatism. This is pretty unsatisfying; this property seems tricky.
Conceptual legibility
LLM-like agents naturally have natural language thought transcripts available for inspection. Interpretability research could tell us how the high-level plans map to the underlying cognitive machinery in the model, giving confidence that they are working as intended. For the world modeler, we could study the latent space to ensure that it is legibly modeling the things we care about accurately, and not some proxy for them.
Operator-looping
Some ideas:
Whitelisting
One idea:
Shutdownability/abortability
Left as an exercise.
Behaviorism
Given sufficiently good interpretability tools, it might be relatively easy to check whether any part of the system is modeling other minds or not. But adhering to the strictest interpretation of this property might rule out using the most capable and general models entirely.
Whitelisting cognitive domains could also help with this property; obviously you don't want to include any domain that involves modeling human minds (or any other kind of mind).
This seems like an exemplar of a property that might be straightforward to implement, but has a pretty big penalty in terms of system capability and usefulness, relative to similar systems without this property.
Design-space anti-optimization separation
I think this is mostly already covered by taskishness and what the operators choose to use the system for. Again, you shouldn't use this system to try to get what you (think you) really want; instead, use the system as a tool to solve your problems on your own, as much as possible.
Domaining
Left as an exercise.
Hard problem of corrigibility / anapartistic reasoning
Don't try to add this one, but remember that there are probably things missing from this list.
Concluding thoughts on the application exercise
I left the application of some principles as an exercise, either because I don't have a good solution for them myself, or because I think they are relatively straightforward and repetitive with some of the others. Additionally, some of the ideas that I did flesh out are inadequate or lacking in many ways. I encourage others to propose their own solutions, potentially accompanied by modifications to my original pseudocode. I also didn't include my own pseudocode for any of the proposed modifications.
Another avenue is to explore applications of some of the principles proposed by others in the comments section of this post. Or, take a look back at some corrigibility proposals written by others, which may predate the publication of the principles, and re-evaluate them against more recent ideas.
There are other ways of improving the safety or reliability of the base system which don't involve adding corrigibility. For example, by restricting the kinds of actions the system is permitted to actually execute, you could make the system more of a possibilizer instead of an actualizer.
Conclusion
Others have shown that corrigibility is, in some sense, anti-natural or incoherent in the limit. For systems which are human level or "weakly superhuman", it may be practical but expensive to tack on corrigibility properties, before these properties fall apart under reflection or more superhuman capability levels.
My own intuition is that weakly superhuman levels of intelligence are sufficient for most of the things we might want to do with an AI system, so I think speculating about the behavior and properties of systems in this regime is interesting and promising as a strategy for getting to safe TAI.
Unfortunately, building in corrigibility is likely to take more time than the time required to build a non-corrigible system of equal capability. This is a kind of alignment tax, though the tax may be smaller than the one required to solve alignment in full generality. Solving problems of corrigibility may overlap with other problems in alignment somewhat, but they look a bit more tractable, or at least more concrete, than problems posed by imbuing an agent with values that are aligned to the full complexity and fragility of human values.
I didn't spend a ton of time thinking about how to apply each principle in the applications section. It may be that some of my ideas don't work, or that there are better ones, or straightforward ways of implementing other, better principles. Feel free to comment or post with your own ideas.