This is a linkpost for https://github.com/epurdy/saferl/blob/main/saferl_draft.pdf
Section 2.3 seems to be the part that addresses alignment, and the proposed solution is to use reinforcement learning (train the AI on examples of what humans would do) and then to give up (either by leaving a human in the loop forever or just deciding that turning people into paperclips really is better).
The way these kinds of problems keep getting buried deep in the writing (sometimes through linked PDF's) makes me I really think this is some sort of Sokal-hoax-style prank.
What's so bad about keeping a human in the loop forever? Do we really think we can safely abdicate our moral responsibilities?
In this post, we link to an accessible, academically written control-theoretic account of how to perform safe model-based reinforcement learning in a domain with significant moral constraints.
The ethicophysics can most properly be thought of as a new scientific paradigm in AI safety designed to give rigorous theoretical and historical justification to a particular style of learning cost functions for safety-critical systems.
In the rest of this post, we simply copy and paste the introduction of the paper.
###
We describe LQPR (Linear-Quadratic-Program-Regulator), an algorithm for model-based reinforcement learning that allows us to prove PAC-style bounds on the behavior of the system. We describe proposed experiments that can be performed in the DeepMind AI Safety Gridworlds domain; we have not had time to implement these experiments yet, but we provide predictions as to the outcome of each experiment. Potential future work may include scaling up this work to non-trivial tasks using a neural network approximator, as well as proving additional theoretical results about the safety and stability of such a system. We believe that this system is a potential basis for aligning large language models and other powerful near-term AI’s with human preferences.
1 Introduction In this section, we lay out necessary context. 1.1 Background
There are three dominant approaches to discussing ethics in philosophy. These approaches are consequentialism, deontology, and virtue ethics. Any truly safe RL algorithm needs to be capable of incorporating all three of these approaches, since each has weaknesses that are addressed by the other two. The algorithm described in this document is capable of implementing all three of these approaches. Many have noted, especially Yudkowsky, that consequentialist agents are likely to be badly misaligned. This perspective seems quite accurate to us.
Meanwhile, it seems likely that deontological agents will be hamstrung by their strict adherence to rules, and end up paying too heavy of an alignment tax to be capable. Virtue ethics might be the sweet spot between these two extremes, since it seems likely that it is both alignable and capable. For the sake of completeness, we provide implementations of all three ethical paradigms within a single unified framework. The wise reader is encouraged not to implement a consequentialist agent without proving a sufficiently reassuring number of theorems that are verified to the limits of human capability. Even given such reassurances, it still seems unwise to implement a strongly capable consequentialist agent for anything other than military purposes.
1.2 Scope This paper is intended to lay the theoretical foundations for safe model-based reinforcement learning. We do not discuss the problem of generalization, since that seems to require significant extensions that we do not have designs for yet. We also examine only smaller, simpler versions of the relevant problem. We will briefly indicate how these might be extended when such an extension seems relatively clear.