Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Context: I sometimes find myself referring back to this tweet and wanted to give it a more permanent home. While I'm at it, I thought I would try to give a concise summary of how each distinct problem would be solved by Safeguarded AI (formerly known as an Open Agency Architecture, or OAA), if it turns out to be feasible.

1. Value is fragile and hard to specify.

See: Specification gaming examples, Defining and Characterizing Reward Hacking[1]

OAA Solution:

1.1. First, instead of trying to specify "value", instead "de-pessimize" and specify the absence of a catastrophe, and maybe a handful of bounded constructive tasks like supplying clean water. A de-pessimizing OAA would effectively buy humanity some time, and freedom to experiment with less risk, for tackling the CEV-style alignment problem—which is harder than merely mitigating extinction risk. This doesn't mean limiting the power of underlying AI systems so that they can only do bounded tasks, but rather containing that power and limiting its use.

Note: The absence of a catastrophe is also still hard to specify and will take a lot of effort, but the hardness is concentrated on bridging between high-level human concepts and the causal mechanisms in the world by which an AI system can intervene. For that...

1.2. Leverage human-level AI systems to automate much of the cognitive labor of formalizing scientific models—from quantum chemistry to atmospheric dynamics—and formalizing the bridging relations between levels of abstraction, so that we can write specifications in a high-level language with a fully explainable grounding in low-level physical phenomena. Physical phenomena themselves are likely to be robust, even if the world changes dramatically due to increasingly powerful AI interventions, and scientific explanations thereof happen to be both robust and compact enough for people to understand. 

2. Corrigibility is anti-natural.

See: The Off-Switch Game, Corrigibility (2014)

OAA Solution: (2.1) Instead of building in a shutdown button, build in a shutdown timer. See You can still fetch the coffee today if you're dead tomorrow. This enables human stakeholders to change course periodically (as long as the specification of non-catastrophe is good enough to ensure that most humans remain physically and mentally intact).

3. Pivotal processes require dangerous capabilities.

See: Pivotal outcomes and pivotal processes

OAA Solution: (3.1) Indeed, dangerous capabilities will be required. Push for reasonable governance. This does not mean creating one world government, but it does mean that the objectives of a pivotal process will need to be negotiated and agreed upon internationally. Fortunately, for now, dangerous capabilities seem to depend on having large amounts of computing hardware, which can be controlled like other highly dangerous substances.

4. Goals misgeneralize out of distribution.

See: Goal misgeneralization: why correct specifications aren't enough for correct goals, Goal misgeneralization in deep reinforcement learning

OAA Solution: (4.1) Use formal methods with verifiable proof certificates[2]. Misgeneralization can occur whenever a property (such as goal alignment) has been tested only on a subset of the state space. Out-of-distribution failures of a property can only be ruled out by an argument for a universally quantified statement about that property—but such arguments can in fact be made! See VNN-COMP. In practice, it will not be possible to have enough information about the world to "prove" that a catastrophe will not be caused by an unfortunate coincidence, but instead we can obtain guaranteed probabilistic bounds via stochastic model checking.

5. Instrumental convergence.

See: The basic AI drives, Seeking power is often convergently instrumental

OAA Solution:

5.1. Specify a task which is bounded in the sense that it is tractable to search for a plan that actually maximizes the score function. Multiple subtle points must be observed to pull this off:

5.1.1. The performance metric itself should be bounded on both ends, or even simply binary (did a "catastrophe" occur, or not?).

5.1.2. The domain of the performance metric should not be world-states, but rather entire trajectories (including the trajectory of the world-state and the trajectory of actions). Otherwise, the ends may always justify the means.

5.1.3. The score function cannot be the expected value of a future outcome, even a binary performance metric, because in an uncertain world one can always obtain higher probability of the desired outcome by gaining power. So, the score function should be capped at a reasonable probability, e.g. 98%.

5.1.4. It may be that the easiest plan to find involves an unacceptable degree of power-seeking and control over irrelevant variables. Therefore, the score function should penalize divergence of the trajectory of the world state from the trajectory of the status quo (in which no powerful AI systems take any actions).

5.1.5. The incentives under 5.1.4 by default are to take control over irrelevant variables so as to ensure that they proceed as in the anticipated "status quo". Infrabayesian uncertainty about the dynamics is the final component that removes this incentive. In particular, the infrabayesian prior can (and should) have a high degree of Knightian uncertainty about human decisions and behaviour. This makes the most effective way to limit the maximum divergence (of human trajectories from the status quo) actually not interfering.

5.2. The shutdown timer (2.1) also constrains the extent to which generic power-seeking is instrumental, because it needs to pay off within a short period of time, and astronomical stakes like the cosmic endowment are out of the picture, truncated to a short finite time horizon.

6. Pivotal processes likely require incomprehensibly complex plans.

See: List of Lethalities #30

OAA Solution:

6.1. Instead of hoping that a team of humans will be able to check the plan, instead hope that a team of humans can check the ontology, and the specification, in a formal and intelligible language. Generally, scientific principles and requirements are vastly simpler than engineering blueprints.

6.2. Then use formal methods (4.1) to check that the plan satisfies the specification. The formal methods can be accelerated by superintelligent heuristics to find a proof certificate, as long as the certificate checker is itself formally verified (8.2).

7. Superintelligence can fool human supervisors.

See: Reinforcement Learning from Human Feedback/Challenges, Obfuscated Arguments Problem

OAA Solution: (7.1) Instead of relying on human feedback about the apparent consequences of plans, rely on auditing formal specifications. Use human feedback about simulated consequences only to identify potentially forgotten formal safety properties. Because human feedback (or models thereof) are not then the optimization target for policy behaviour, the standard mechanism by which human-fooling outputs are convergently generated is absent.

8. Superintelligence can hack software supervisors.

See: Reward Tampering Problems and Solutions in Reinforcement Learning

OAA Solution:

8.1. Before training a model large enough to pose a serious accident risk, ensure that the computational environment is itself formally verified. That an unhackable software isolation environment is possible has been demonstrated by HACMS. It does require a large amount of software engineering effort to do this for AI-relevant compute systems, but if the effort is put in, it will work. It does not require homomorphic encryption or other gadgets with substantial runtime overhead, unless one's threat model includes output side-channels; see 12.4.

8.2. As mentioned briefly in 6.2, ensure that formal verifiers which are deployed alongside the superintelligent search process inside the secured computational environment are also, themselves, fully formally verified. This does not create an infinite regress, because verification processes themselves are amenable to verification by hand in an interactive theorem prover such as Lean.

8.3. Prevent "input side-channels" from outside the system by ensuring deterministic executions.

9. Humans cannot be first-class parties to a superintelligence values handshake.

See: Values handshakes

OAA Solution: (9.1) Instead of becoming parties to a values handshake, keep superintelligent capabilities in a box and only extract plans that solve bounded tasks for finite time horizons and verifiably satisfy safety criteria that include not violating the natural boundaries of humans. This can all work without humans ever being terminally valued by AI systems as ends in themselves.

10. Humanlike minds/goals are not necessarily safe.

See: Joseph Stalin

OAA Solution: (10.1) Do not rely on humanlike minds to perform transformative tasks.

11. Someone else will deploy unsafe superintelligence first (possibly by stealing it from you).

See: Can the Singularity be avoided? (Vinge, 1993)

OAA Solution

11.1. Implement effectively-global Compute Thresholds,
11.2. and the necessary monitoring infrastructure

12. Unsafe superintelligence in a box might figure out what’s going on and find a way to exfiltrate itself by steganography and spearphishing.

See: AI Boxing

OAA Solution:

12.1. Limit and penalize the complexity of extracted plans, so there is not enough bandwidth to encode extra superintelligence-sized messages.

12.2. Strictly control access to the physical substrate so that no one is in a position to assist in exfiltration.

12.3. Use zeroization to ensure that attempts to invade the data centre cannot result in exfiltration either.

12.4. I think these measures together with 8.3 are roughly sufficient, but if civilization wanted to go the extra mile, we could wait until it’s tractable to use cryptographic boxing.

13. We are ethically obligated to propose pivotal processes that are as close as possible to fair Pareto improvements for all citizens, both by their own lights and from a depersonalized well-being perspective.

See: moral philosophy? But more instrumentally, we need the stakeholders (including engineers and governments) to feel good about what they are working on and be able to work with each other without hiding their goals.

OAA Solution: Accept differing models and specifications from all stakeholders; search for a Nash bargaining solution with respect to the random dictator policy, which balances fairness and Pareto-optimality.

  1. ^

    For the record, I register an objection to the use of the phrase "reward hacking" for what others call "specification gaming" because I prefer to reserve the word "hacking" for behaviour which triggers the failure of a different software system to perform its intended function; most specification gaming examples do not actually involve hacking.

  2. ^

    Probably mostly not dependent-type-theory proofs. Other kinds of proof certificates include reach-avoid supermartingales (RASMs), LFSC proof certificates, and Alethe proofs. OAA will almost surely involve creating a new proof certificate language that is adapted to the modelling language and the specification language, and will support using neural networks or other learned representations as argument steps (e.g. as RASMs), some argument steps that are more like branch-and-bound, some argument steps that are more like tableaux, etc., but with a small and computationally efficient trusted core (unlike, say, Agda, or Metamath at the opposite extreme).

New Comment
24 comments, sorted by Click to highlight new comments since: Today at 3:50 AM

4. Goals misgeneralize out of distribution.

See: Goal misgeneralization: why correct specifications aren't enough for correct goals, Goal misgeneralization in deep reinforcement learning

OAA Solution: (4.1) Use formal methods with verifiable proof certificates[2]. Misgeneralization can occur whenever a property (such as goal alignment) has been tested only on a subset of the state space. Out-of-distribution failures of a property can only be ruled out by an argument for a universally quantified statement about that property—but such arguments can in fact be made! See VNN-COMP. In practice, it will not be possible to have enough information about the world to "prove" that a catastrophe will not be caused by an unfortunate coincidence, but instead we can obtain guaranteed probabilistic bounds via stochastic model checking.

 

Based on the Bold Plan post and this one my main point of concern is that I don't believe in the feasibility of the model checking, even in principle. The state space S and action space A of the world model will be too large for techniques along the lines of COOL-MC which (if I understand correctly) have to first assemble a discrete-time Markov chain by querying the NN and then try to apply formal verification methods to that. I imagine that actually you are thinking of learned coarse-graining of both S and A, to which one applies something like formal verification.

Assuming that's correct, then there's an inevitable lack of precision on the inputs to the formal verification step. You have to either run the COOL-MC-like process until you hit your time and compute budget and then accept that you're missing state-action pairs, or you coarse-grain to some degree within your budget and accept a dependence on the quality of your coarse-graining. If you're doing an end-run around this tradeoff somehow, could you direct me to where I can read more about the solution?

I know there's literature on learned coarse-grainings of S and A in the deep RL setting, but I haven't seen it combined with formal verification. Is there a literature? It seems important.

I'm guessing that this passage in the Bold Plan post contains your answer:

> Defining a sufficiently expressive formal meta-ontology for world-models with multiple scientific explanations at different levels of abstraction (and spatial and temporal granularity) having overlapping domains of validity, with all combinations of {Discrete, Continuous} and {time, state, space}, and using an infra-bayesian notion of epistemic state (specifically, convex compact down-closed subsets of subprobability space) in place of a Bayesian state

In which case I see where you're going, but this seems like the hard part?

[-]davidad8moΩ6100

I think you’re directionally correct; I agree about the following:

  • A critical part of formally verifying real-world systems involves coarse-graining uncountable state spaces into (sums of subsets of products of) finite state spaces.
  • I imagine these would be mostly if not entirely learned.
  • There is a tradeoff between computing time and bound tightness.

However, I think maybe my critical disagreement is that I do think probabilistic bounds can be guaranteed sound, with respect to an uncountable model, in finite time. (They just might not be tight enough to justify confidence in the proposed policy network, in which case the policy would not exit the box, and the failure is a flop rather than a foom.)

Perhaps the keyphrase you’re missing is “interval MDP abstraction”. One specific paper that combines RL and model-checking and coarse-graining in the way you’re asking for is Formal Controller Synthesis for Continuous-Space MDPs via Model-Free Reinforcement Learning.

[-]davidad8moΩ350

That being said— I don’t expect existing model-checking methods to scale well. I think we will need to incorporate powerful AI heuristics into the search for a proof certificate, which may include various types of argument steps not limited to a monolithic coarse-graining (as mentioned in my footnote 2). And I do think that relies on having a good meta-ontology or compositional world-modeling framework. And I do think that is the hard part, actually! At least, it is the part I endorse focusing on first. If others follow your train of thought to narrow in on the conclusion that the compositional world-modeling framework problem, as Owen Lynch and I have laid it out in this post, is potentially “the hard part” of AI safety, that would be wonderful…

Thanks, that makes a lot of sense to me. I have some technical questions about the post with Owen Lynch, but I'll follow up elsewhere.

[-]Wei Dai8moΩ494

"AI-powered memetic warfare makes all humans effectively insane" a catastrophe that I listed in an earlier comment, which seems one of the hardest to formally specify. It seems values-complete or metaphilosophy-complete to me, since without having specified human values or having solved metaphilosophy, how can we check whether an AI-generated argument is trying to convince us of something that is wrong according to actual human values, or wrong according to normative philosophical reasoning?

I don't see anything in this post or the linked OAA post that addresses or tries to bypass this difficulty?

[-]davidad8moΩ230

OAA bypasses the accident version of this by only accepting arguments from a superintelligence that have the form “here is why my proposed top-level plan—in the form of a much smaller policy network—is a controller that, when combined with the cyberphysical model of an Earth-like situation, satisfies your pLTL spec.” There is nothing normative in such an argument; the normative arguments all take place before/while drafting the spec, which should be done with AI assistants that are not smarter-than-human (CoEm style).

There is still a misuse version: someone could remove the provision in 5.1.5 that the model of Earth-like situations should be largely agnostic about human behavior, and instead building a detailed model of how human nervous systems respond to language. (Then, even though the superintelligence in the box would still be making only descriptive arguments about a policy, the policy that comes out would likely emit normative arguments at deployment time.) Superintelligence misuse is covered under problem 11.

If it’s not misuse, the provisions in 5.1.4-5 will steer the search process away from policies that attempt to propagandize to humans.

[-]Wei Dai8moΩ450

If it’s not misuse, the provisions in 5.1.4-5 will steer the search process away from policies that attempt to propagandize to humans.

Ok I'll quote 5.1.4-5 to make it easier for others to follow this discussion:

5.1.4. It may be that the easiest plan to find involves an unacceptable degree of power-seeking and control over irrelevant variables. Therefore, the score function should penalize divergence of the trajectory of the world state from the trajectory of the status quo (in which no powerful AI systems take any actions).

5.1.5. The incentives under 5.1.4 by default are to take control over irrelevant variables so as to ensure that they proceed as in the anticipated “status quo”. Infrabayesian uncertainty about the dynamics is the final component that removes this incentive. In particular, the infrabayesian prior can (and should) have a high degree of Knightian uncertainty about human decisions and behaviour. This makes the most effective way to limit the maximum divergence (of human trajectories from the status quo) actually not interfering.

I'm not sure how these are intended to work. How do you intend to define/implement "divergence"? How does that definition/implementation combined with "high degree of Knightian uncertainty about human decisions and behaviour" actually cause the AI to "not interfere" but also still accomplish the goals that we give it?

In order to accomplish its goals, the AI has to do lots of things that will have butterfly effects on the future, so the system has to allow it to do those things, but also not allow it to "propagandize to humans". It's just unclear to me how you intend to achieve this.

This doesn't directly answer your questions, but since the OAA already requires global coordination and agreement to follow the plans spit out by the superintelligent AI, maybe propagandizing people is not necessary. Especially if we consider that by the time the OAA becomes possible, the economy and science are probably already largely automated by CoEms and don't need to involve motivated humans.

Then, the time-boundedness of the plan raises the chances that the plan doesn't concern with changing people's values and preferences as a side effect (which will be relevant for the ongoing work of shaping the constraints and desiderata for the next iteration of the plan). Some such interference with values will inevitably happen, though. That's what Davidad considers when he writes "A de-pessimizing OAA would effectively buy humanity some time, and freedom to experiment with less risk, for tackling the CEV-style alignment problem—which is harder than merely mitigating extinction risk."

There are a lot of catastrophes that humans did or could do to themselves. In that regard, AI is like any multi-purpose tool, such as a hammer. We have to sort these out too earlier or later, but isn't this orthogonal to the alignment question?

It is often considered as such, but my concern is less with “the alignment question” (how to build AI that values whatever its stakeholders value) and more with how to build transformative AI that probably does not lead to catastrophe. Misuse is one of the ways that it can lead to catastrophe. In fact, in practice, we have to sort misuse out sooner than accidents, because catastrophic misuses become viable at a lower tech level than catastrophic accidents.

There is a serious issue with your proposed solution to problem 13. Using a random dictator policy as a negotiation baseline is not suitable for the situation, where billions of humans are negotiating about the actions of a clever and powerful AI. One problem with using this solution, in this contexts, is that some people have strong commitments to moral imperatives, along the lines of ``heretics deserve eternal torture in hell''. The combination of these types of sentiments, and a powerful and clever AI (that would be very good at thinking up effective ways of hurting heretics), leads to serious problems when one uses this negotiation baseline. A tiny number of people with sentiments along these lines, can completely dominate the outcome.

Consider a tiny number of fanatics with this type of morality. They consider everyone else to be heretics, and they would like the AI to hurt all heretics as much as possible. Since a powerful and clever AI would be very good at hurting a human individual, this tiny number of fanatics, can completely dominate negotiations. People that would be hurt as much as possible (by a clever and powerful AI), in a scenario where one of the fanatics are selected as dictator, can be forced to agree to very unpleasant negotiated positions, if one uses this negotiation baseline (since agreeing to such an unpleasant outcome, can be the only way to convince a group of fanatics, to agree to not ask the AI to hurt heretics, as much as possible, in the event that a fanatic is selected as dictator).

This post, explore these issues in the context of the most recently published version of CEV: Parliamentarian CEV (PCEV). PCEV has a random dictator negotiation baseline. The post shows that PCEV results in an outcome massively worse than extinction (if PCEV is successfully implemented, and pointed at billions of humans).

Another way to look at this, is to note that the concept of ``fair Pareto improvements'' has counterintuitive implications, when the question is about AI goals, and some of the people involved, has this type of morality. The concept was not designed with this aspect of morality in mind. And it was not designed to apply to negotiations about the actions of a clever and powerful AI. So, it should not be very surprising, to discover that the concept has counterintuitive implications, when used in this novel context. If some change in the world improves the lives of heretics, then this is making the world worse, from the perspective of those people, that would ask an AI to hurt all heretics as much as possible. For example: reducing the excruciating pain of a heretic, in a way that does not affect anyone else in any way, is not a ``fair Pareto improvement'', in this context. If every person is seen as a heretic by at least one group of fanatics, then the concept of ``fair Pareto improvements'' has some very counterintuitive implications, when it is used in this context.

Yet another way of looking at this, is to take the perspective of human individual Steve, who will have no special influence over an AI project. In the case of an AI, that is describable as doing what a group wants, Steve has a serious problem (and this problem is present, regardless of the details of the specific Group AI proposal). From Steve's perspective, the core problem, is that an arbitrarily defined abstract entity, will adopt preferences, that is about Steve. But, if this is any version of CEV (or any other Group AI), directed at a large group, then Steve has had no meaningful influence, regarding the adoption of those preferences, that refer to Steve. Just like every other decision, the decision of what Steve-preferences the AI will adopt, is determined by the outcome of an arbitrarily defined mapping, that maps large sets of human individuals, into the space of entities that can be said to want things. Different sets of definitions, lead to completely different such ``Group entities''. These entities all want completely different things (changing one detail can for example change which tiny group of fanatics, will end up dominating the AI in question). Since the choice of entity is arbitrary, there is no way for an AI to figure out that the mapping ``is wrong'' (regardless of how smart this AI is). And since the AI is doing what the resulting entity wants, the AI has no reason to object, when that entity wants the AI to hurt an individual. Since Steve does not have any meaningful influence, regarding the adoption of those preferences, that refer to Steve, there is no reason for him to think that such an AI will want to help him, as opposed to want to hurt him. Combined with the vulnerability of a human individual, to a clever AI that tries to hurt that individual as much as possible, this means that any group AI would be worse than extinction, in expectation.

Discovering that doing what a group wants, is bad for human individuals in expectation, should not be particularly surprising. Groups and individuals are completely different types of things. So, this should be no more surprising, than discovering that any reasonable way of extrapolating Dave, will lead to the death of every single one of Dave's cells. Doing what one type of thing wants, might be bad for a completely different type of thing. And aspects of human morality, along the lines of ``heretics deserve eternal torture in hell'' shows up throughout human history. It is found across cultures, and religions, and continents, and time periods. So, if an AI project is aiming for an alignment target, that is describable as ``doing what a group wants'', then there is really no reason for Steve to think, that the result of a successful project, would want to help him, as opposed to want to hurt him. And given the large ability of an AI to hurt a human individual, the success of such a project would be massively worse than extinction (in expectation).

The core problem, from the perspective of Steve, is that Steve has no control over the adoption of those preferences, that refer to Steve. One can give each person influence over this decision, without giving anyone any preferential treatment (see for example MPCEV in the post about PCEV, mentioned above). Giving each person such influence, does not introduce contradictions, because this influence is defined in ``AI preference adoption space'', not in any form of outcome space. This can be formulated as an alignment target feature that is necessary, but not sufficient, for safety. Let's refer to this feature as the: Self Preference Adoption Decision Influence (SPADI) feature. (MPCEV is basically what happens, if one adds the SPADI feature to PCEV. Adding the SPADI feature to PCEV, solves the issue, illustrated by that thought experiment)

The SPADI feature is obviously very underspecified. There will be lots of border cases whose classification will be arbitrary. But there still exists many cases, where it is in fact clear, that a given alignment target, does not have the SPADI feature. Since the SPADI feature is necessary, but not sufficient, these clear negatives are actually the most informative cases. In particular, if an AI project is aiming for an alignment target, that clearly does not have the SPADI feature. Then the success of this AI project, would be worse than extinction, in expectation (from the perspective of a human individual, that is not given any special influence over the AI project). While there are many border cases, regarding what alignment targets could be described as having the SPADI feature, CEV is an example of a clear negative (in other words: there exists no reasonable set of definitions, according to which there exists a version of CEV, that has the SPADI feature). This is because building an AI that is describable as ``doing what a group wants'', is inherent in the core concept, of building an AI, that is describable as: ``implementing the Coherent Extrapolated Volition of Humanity''.

In other words: the field of alignment target analysis is essentially an open research question. This question is also (i): very unintuitive, (ii): very under explored, and (iii): very dangerous to get wrong. If one is focusing on necessary, but not sufficient, alignment target features. Then it is possible to mitigate dangers related to someone successfully hitting a bad alignment target, even if one does not have any idea of what it would mean, for an alignment target to be a good alignment target. This comment outlines a proposed research effort, aimed at mitigating this type of risk.

These ideas also have implications for the Membrane concept, as discussed here and here.

(It is worth noting explicitly that the problem is not strongly connected to the specific aspect of human morality discussed in the present comment (the ``heretics deserve eternal torture in hell'' aspect). The problem is about the lack of meaningful influence, regarding the adoption of self referring preferences. In other words, it is about the lack of the SPADI feature. It just happens to be the case, that this particular aspect of human morality is both (i): ubiquitous throughout human history, and also (ii): well suited for constructing thought experiments, that illustrates the dangers of alignment target proposals, that lack the SPADI feature. If this aspect of human morality disappeared tomorrow, the basic situation would not change (the illustrative thought experiments would change. But the underlying problem would remain. And the SPADI feature would still be necessary for safety).)

[-]niplav8moΩ142

One issue I see with this plan is that it seems to rely on some mathematics that appear to me to not be fully worked out, e.g. infrabayesianism and «boundaries» (for which I haven't been able to find a full mathematical description), and it looks unclear to me whether they will actually be finished in time, and if they are, whether they lead to algorithms that are efficient enough to be scaled to such an ambitious project.

[-]davidad8moΩ256

That’s basically correct. OAA is more like a research agenda and a story about how one would put the research outputs together to build safe AI, than an engineering agenda that humanity entirely knows how to build. Even I think it’s only about 30% likely to work in time.

I would love it if humanity had a plan that was more likely to be feasible, and in my opinion that’s still an open problem!

Thanks for the clarification!

"Instead of building in a shutdown button, build in a shutdown timer."

-> Isn't that a form of corrigibility with an added constraint? I'm not sure what would prevent you from convincing humans that it's a bad thing to respect the timer, for example. Is it because we'll formally verify we avoid deception instance? It's not clear to me but maybe I've misunderstood.

A system with a shutdown timer, in my sense, has no terms in its reward function which depend on what happens after the timer expires. (This is discussed in more detail in my previous post.) So there is no reason to persuade humans or do anything else to circumvent the timer, unless there is an inner alignment failure (maybe that’s what you mean by “deception instance”). Indeed, it is the formal verification that prevents inner alignment failures.

I guess the shutdown timer would be most important in the training stage, so that it (hopefully) learns only to care about the short term.

Yes, the “shutdown timer” mechanism is part of the policy-scoring function that is used during policy optimization. OAA has multiple stages that could be considered “training”, and policy optimization is the one that is closest to the end, so I wouldn’t call it “the training stage”, but it certainly isn’t the deployment stage.

We hope not merely that the policy only cares about the short term, but also that it cares quite a lot about gracefully shutting itself down on time.

Note: The absence of a catastrophe is also still hard to specify and will take a lot of effort, but the hardness is concentrated on bridging between high-level human concepts and the causal mechanisms in the world by which an AI system can intervene. For that...

Is the lack of a catastrophe intended to last forever, or only a fixed amount of time (ie 10 years, until turned off)

For all Time.

Say this AI looks to the future, it sees everything disassembled by nanobots. Self replicating bots build computers. Lots of details about how the world was are being recorded. Those recordings are used in some complicated calculation. Is this a catastrophe? 

The answer sensitively depends on the exact moral valance of these computations, not something easy to specify. If the catastrophe prevention AI bans this class of scenarios, it significantly reduces future value, if it permits them, it lets through all sorts of catastrophes. 

For a while.

If the catastrophe prevention is only designed to last a while while other AI is made, then we can wait for the uploading. But then, an unfriendly AI can wait too. Unless the anti catastrophe AI is supposed to ban all powerful AI systems that haven't been greenlit somehow? (With a greenlighting process set up by human experts, and the AI only considers something greenlit if it sees it signed with a particular cryptographic key.) And the supposedly omnipotent catastrophe prevention AI has been programmed to stop all other AI's exerting excess optimization on us. (in some way that lets us experiment while shielding us from harm.) 

Tricky. But maybe doable.

[-]davidad8moΩ120

Yes, it's the latter. See also the Open Agency Keyholder Prize.

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year. Will this post make the top fifty?

2. Corrigibility is anti-natural.


Hello! I can see a route where corrigibility can become part of the AI's attention mechanism - and is natural to its architecture. 

If alignment properties are available in the training data and is amplified by a tuning data - that is very much possible to happen.

Thanks!

There’s something to be said for this, because with enough RLHF, GPT-4 does seem to have become pretty corrigible, especially compared to Bing Sydney. However, that corrigible persona is probably only superficial, and the larger and more capable a single Transformer gets, the more of its mesa-optimization power we can expect will be devoted to objectives which are uninfluenced by in-context corrections.

Sorry for not specifying the method, but I wasn't referring to RL-based or supervised learning methods. There's a lot of promise in using a smaller dataset that explains corrigibility characteristics, as well as a shutdown mechanism, all fine-tuned through unsupervised learning.

I have a prototype at this link where I modified GPT2-XL to mention a shutdown phrase whenever all of its attention mechanisms activate and determine that it could harm humans due to its intelligence. I used unsupervised learning to allow patterns from a smaller dataset to achieve this.