Context: I sometimes find myself referring back to this tweet and wanted to give it a more permanent home. While I'm at it, I thought I would try to give a concise summary of how each distinct problem would be solved by Safeguarded AI (formerly known as an Open Agency Architecture, or OAA), if it turns out to be feasible.
1. Value is fragile and hard to specify.
See: Specification gaming examples, Defining and Characterizing Reward Hacking[1]
OAA Solution:
1.1. First, instead of trying to specify "value", instead "de-pessimize" and specify the absence of a catastrophe, and maybe a handful of bounded constructive tasks like supplying clean water. A de-pessimizing OAA would effectively buy humanity some time, and freedom to experiment with less risk, for tackling the CEV-style alignment problem—which is harder than merely mitigating extinction risk. This doesn't mean limiting the power of underlying AI systems so that they can only do bounded tasks, but rather containing that power and limiting its use.
Note: The absence of a catastrophe is also still hard to specify and will take a lot of effort, but the hardness is concentrated on bridging between high-level human concepts and the causal mechanisms in the world by which an AI system can intervene. For that...
1.2. Leverage human-level AI systems to automate much of the cognitive labor of formalizing scientific models—from quantum chemistry to atmospheric dynamics—and formalizing the bridging relations between levels of abstraction, so that we can write specifications in a high-level language with a fully explainable grounding in low-level physical phenomena. Physical phenomena themselves are likely to be robust, even if the world changes dramatically due to increasingly powerful AI interventions, and scientific explanations thereof happen to be both robust and compact enough for people to understand.
2. Corrigibility is anti-natural.
See: The Off-Switch Game, Corrigibility (2014)
OAA Solution: (2.1) Instead of building in a shutdown button, build in a shutdown timer. See You can still fetch the coffee today if you're dead tomorrow. This enables human stakeholders to change course periodically (as long as the specification of non-catastrophe is good enough to ensure that most humans remain physically and mentally intact).
3. Pivotal processes require dangerous capabilities.
See: Pivotal outcomes and pivotal processes
OAA Solution: (3.1) Indeed, dangerous capabilities will be required. Push for reasonable governance. This does not mean creating one world government, but it does mean that the objectives of a pivotal process will need to be negotiated and agreed upon internationally. Fortunately, for now, dangerous capabilities seem to depend on having large amounts of computing hardware, which can be controlled like other highly dangerous substances.
4. Goals misgeneralize out of distribution.
See: Goal misgeneralization: why correct specifications aren't enough for correct goals, Goal misgeneralization in deep reinforcement learning
OAA Solution: (4.1) Use formal methods with verifiable proof certificates[2]. Misgeneralization can occur whenever a property (such as goal alignment) has been tested only on a subset of the state space. Out-of-distribution failures of a property can only be ruled out by an argument for a universally quantified statement about that property—but such arguments can in fact be made! See VNN-COMP. In practice, it will not be possible to have enough information about the world to "prove" that a catastrophe will not be caused by an unfortunate coincidence, but instead we can obtain guaranteed probabilistic bounds via stochastic model checking.
5. Instrumental convergence.
See: The basic AI drives, Seeking power is often convergently instrumental
OAA Solution:
5.1. Specify a task which is bounded in the sense that it is tractable to search for a plan that actually maximizes the score function. Multiple subtle points must be observed to pull this off:
5.1.1. The performance metric itself should be bounded on both ends, or even simply binary (did a "catastrophe" occur, or not?).
5.1.2. The domain of the performance metric should not be world-states, but rather entire trajectories (including the trajectory of the world-state and the trajectory of actions). Otherwise, the ends may always justify the means.
5.1.3. The score function cannot be the expected value of a future outcome, even a binary performance metric, because in an uncertain world one can always obtain higher probability of the desired outcome by gaining power. So, the score function should be capped at a reasonable probability, e.g. 98%.
5.1.4. It may be that the easiest plan to find involves an unacceptable degree of power-seeking and control over irrelevant variables. Therefore, the score function should penalize divergence of the trajectory of the world state from the trajectory of the status quo (in which no powerful AI systems take any actions).
5.1.5. The incentives under 5.1.4 by default are to take control over irrelevant variables so as to ensure that they proceed as in the anticipated "status quo". Infrabayesian uncertainty about the dynamics is the final component that removes this incentive. In particular, the infrabayesian prior can (and should) have a high degree of Knightian uncertainty about human decisions and behaviour. This makes the most effective way to limit the maximum divergence (of human trajectories from the status quo) actually not interfering.
5.2. The shutdown timer (2.1) also constrains the extent to which generic power-seeking is instrumental, because it needs to pay off within a short period of time, and astronomical stakes like the cosmic endowment are out of the picture, truncated to a short finite time horizon.
6. Pivotal processes likely require incomprehensibly complex plans.
OAA Solution:
6.1. Instead of hoping that a team of humans will be able to check the plan, instead hope that a team of humans can check the ontology, and the specification, in a formal and intelligible language. Generally, scientific principles and requirements are vastly simpler than engineering blueprints.
6.2. Then use formal methods (4.1) to check that the plan satisfies the specification. The formal methods can be accelerated by superintelligent heuristics to find a proof certificate, as long as the certificate checker is itself formally verified (8.2).
7. Superintelligence can fool human supervisors.
See: Reinforcement Learning from Human Feedback/Challenges, Obfuscated Arguments Problem
OAA Solution: (7.1) Instead of relying on human feedback about the apparent consequences of plans, rely on auditing formal specifications. Use human feedback about simulated consequences only to identify potentially forgotten formal safety properties. Because human feedback (or models thereof) are not then the optimization target for policy behaviour, the standard mechanism by which human-fooling outputs are convergently generated is absent.
8. Superintelligence can hack software supervisors.
See: Reward Tampering Problems and Solutions in Reinforcement Learning
OAA Solution:
8.1. Before training a model large enough to pose a serious accident risk, ensure that the computational environment is itself formally verified. That an unhackable software isolation environment is possible has been demonstrated by HACMS. It does require a large amount of software engineering effort to do this for AI-relevant compute systems, but if the effort is put in, it will work. It does not require homomorphic encryption or other gadgets with substantial runtime overhead, unless one's threat model includes output side-channels; see 12.4.
8.2. As mentioned briefly in 6.2, ensure that formal verifiers which are deployed alongside the superintelligent search process inside the secured computational environment are also, themselves, fully formally verified. This does not create an infinite regress, because verification processes themselves are amenable to verification by hand in an interactive theorem prover such as Lean.
8.3. Prevent "input side-channels" from outside the system by ensuring deterministic executions.
9. Humans cannot be first-class parties to a superintelligence values handshake.
See: Values handshakes
OAA Solution: (9.1) Instead of becoming parties to a values handshake, keep superintelligent capabilities in a box and only extract plans that solve bounded tasks for finite time horizons and verifiably satisfy safety criteria that include not violating the natural boundaries of humans. This can all work without humans ever being terminally valued by AI systems as ends in themselves.
10. Humanlike minds/goals are not necessarily safe.
See: Joseph Stalin
OAA Solution: (10.1) Do not rely on humanlike minds to perform transformative tasks.
11. Someone else will deploy unsafe superintelligence first (possibly by stealing it from you).
See: Can the Singularity be avoided? (Vinge, 1993)
OAA Solution:
11.1. Implement effectively-global Compute Thresholds,
11.2. and the necessary monitoring infrastructure.
12. Unsafe superintelligence in a box might figure out what’s going on and find a way to exfiltrate itself by steganography and spearphishing.
See: AI Boxing
OAA Solution:
12.1. Limit and penalize the complexity of extracted plans, so there is not enough bandwidth to encode extra superintelligence-sized messages.
12.2. Strictly control access to the physical substrate so that no one is in a position to assist in exfiltration.
12.3. Use zeroization to ensure that attempts to invade the data centre cannot result in exfiltration either.
12.4. I think these measures together with 8.3 are roughly sufficient, but if civilization wanted to go the extra mile, we could wait until it’s tractable to use cryptographic boxing.
13. We are ethically obligated to propose pivotal processes that are as close as possible to fair Pareto improvements for all citizens, both by their own lights and from a depersonalized well-being perspective.
See: moral philosophy? But more instrumentally, we need the stakeholders (including engineers and governments) to feel good about what they are working on and be able to work with each other without hiding their goals.
OAA Solution: Accept differing models and specifications from all stakeholders; search for a Nash bargaining solution with respect to the random dictator policy, which balances fairness and Pareto-optimality.
- ^
For the record, I register an objection to the use of the phrase "reward hacking" for what others call "specification gaming" because I prefer to reserve the word "hacking" for behaviour which triggers the failure of a different software system to perform its intended function; most specification gaming examples do not actually involve hacking.
- ^
Probably mostly not dependent-type-theory proofs. Other kinds of proof certificates include reach-avoid supermartingales (RASMs), LFSC proof certificates, and Alethe proofs. OAA will almost surely involve creating a new proof certificate language that is adapted to the modelling language and the specification language, and will support using neural networks or other learned representations as argument steps (e.g. as RASMs), some argument steps that are more like branch-and-bound, some argument steps that are more like tableaux, etc., but with a small and computationally efficient trusted core (unlike, say, Agda, or Metamath at the opposite extreme).
Thank you for the clarification. This proposal is indeed importantly different from the PCEV proposal. But since hurting heretics is a moral imperative, any AI that allows heretics to escape punishment, will also be seen as unacceptable by at least some people. This means that the set of Pareto improvements is empty.
In other words: hurting heretics is indeed off the table in your proposal (which is an important difference compared to PCEV). However, any scenario that includes the existence of an AI, that allow heretics to escape punishment, is also off the table. The existence of such an AI, would be seen as intrinsically bad, by people that see hurting heretics as a moral imperative (for example: Gregg really does not want a world, where Gregg has agreed to tolerate the existence of an unethical AI, that disregards its moral duty, to punish heretics). More generally: anything that improves the lives of heretics, is off the table. If an outcome improves the lives of heretics (compared to the no-AI-baseline), then this outcome is also not a Pareto improvement. Because improving the lives of heretics, makes things worse from the point of view, of those that are deeply committed to hurting heretics.
In yet other words: it only takes two individuals, to rule out any outcome, that contains any improvement, for any person. Gregg and Jeff are both deeply committed to hurting heretics. But their definitions of ``heretic'' differ. Every individual is seen as a heretic by at least one of them. So, any outcome, that makes life better for any person, is off the table. Gregg and Jeff does have to be very committed to the moral position, that the existence of any AI, that neglects its duty to punish heretics, is unacceptable. It must for example be impossible to get them to agree to tolerate the existence of such an AI, in exchange for increased influence over the far future. But a population of billions only has to contain two such people, for the set of Pareto improvements to be empty.
Another way to approach this would be to ask: What would have happened, if someone had successfully implemented a Gatekeeper AI, built on top of a set of definitions, such that the set of Pareto improvements is empty?
For the version of the random dictator negotiation baseline that you describe, this comment might actually be more relevant, than the PCEV thought experiment. It is a comment on the suggestion by Andrew Critch, that it might be possible to view a Boundaries / Membranes based BATNA, as having been agreed to acausally. It is impossible to reach such an acausal agreement when a group include people like Gregg and Jeff, for the same reason that it is impossible to find an outcome that is a Pareto improvement, when a group include people like Gregg and Jeff. (that comment also discuss ideas, for how one might deal with the dangers that arise, when one combines people like Gregg and Jeff, with a powerful and clever AI)
Another way to look at this, would be to consider what it would mean to find a Pareto improvement, with respect to only Bob and Dave. Bob wants to hurt heretics, and Bob considers half of all people to be heretics. Dave is an altruist, that just wants people to have as good a life as possible. The set of Pareto improvements would now be made up entirely of different variations of the general situation: make the lives of non heretics much better, and make the lives of heretics much worse. For Bob to agree, heretics must be punished. And for Dave to agree, Dave must see the average life quality, as an improvement on the ``no superintelligence'' outcome. If the ``no superintelligence'' outcome is bad for everyone, then the lives of heretics in this scenario could get very bad.
More generally: people like Bob (with aspects of morality along the lines of: ``heretics deserve eternal torture in hell'') will have dramatically increased power over the far future, when one uses this type of negotiation baseline (assuming that things have been patched, in a way that results in a non empty set of Pareto improvements). If everyone is included in the calculation of what counts as Pareto improvements, then the set of Pareto improvements is empty (due to people like Gregg and Jeff). And if everyone is not included, then the outcome could get very bad, for many people (compared to whatever would have happened otherwise).
(adding the SPADI feature to your proposal would remove these issues, and would prevent people like Dave from being dis-empowered, relative to people like Bob. The details are importantly different from PCEV, but it is no coincidence that adding the SPADI feature removes this particular problem, for both proposals. The common denominator is that from the perspective of Steve, it is in general dangerous to encounter an AI, that has taken ``unwelcome'' or ``hostile'' preferences about Steve into account)
Also: my general point about the concept of ``fair Pareto improvements'' having counterintuitive implications in this novel context still apply (it is not related to the details of any specific proposal).