I applaud the effort. Big upvote for actually trying to solve the problem, by coming up with a way to create safe, aligned AGI. If only more people were doing this instead of hand wringing, arguing, or "working on the problem" in poorly-thought-out, too-indirect-to-probably-help-in-time ways. Good job going straight for the throat.
That said: It seems to me like the problem isn't maximization or even optimization; it's conflicting goals.
If I have a goal to make some paperclips, not as many as I can, just a few trillion, I may still enter a deadly conflict with humanity. If humanity knows about me and my paperclips goal, they'll shut me down. The most certain way to get those paperclips made may be to eliminate unpredictable humanity's ability to mess with my plans.
For essentialaly this reason, I think quantilization is and was recognized as a dead-end. You don't have to take your goals to the logical extreme to still take them way too far for humanity's good.
I read the this post, but not the remainder yet, so you might've addressed this elsewhere.
Alex Turner's post you referenced first convinces me that his arguments about "orbit-level power-seeking" apply to maximizers and quantilizers/satisficers. Let me reiterate that we are not suggesting quantilizers/satisficers are a good idea, but that I firmly believe explicit safety criteria rather than plain randomization should be used to select plans.
He also claims in that post that the "orbit-level power-seeking" issue affects all schemes that are based on expected utility: "There is no clever EU-based scheme which doesn't have orbit-level power-seeking incentives." I don't see a formal proof of that claim though, maybe I missed it. The rationale he gives below that claim seems to boil down to a counting argument again, which suggests to me some tacit assumption that the agent still chooses uniformly at random from some set of policies. As this is not what we suggest, I don't see how it applies to our algorithms.
Re power-seeking in general: I believe one important class of safety criteria one should use to select from the many possible plans that can fulfill an aspiration-type goal is criteria that aim to quantify the amount of power/resources/capabilities/control potential the agent has at each time step. There are some promising metrics for this already (including "empowerment", reachability, and Alex Turner's AUP). We are currently investigating some versions of such measures, including ones we believe might be novel. A key challenge in doing so is again tractability. Counting the reachable states for example might be intractable, but approximating that number by a recursively computable metric based on Wasserstein distance and Gaussian approximations to latent state distributions seems tractable and might turn out to be good enough.
Thank you for the warm encouragement.
We tried to be careful not to claim that merely making the decision algorithm aspiration-based is already sufficient to solve the AI safety problem, but maybe we need to add an even more explicit disclaimer in that direction. We explore this approach as a potentially necessary ingredient for safety, not as a complete plan for safety.
In particular, I perfectly agree that conflicting goals are also a severe problem for safety that needs to be addressed (while I don't believe there is a unique problem for safety that deserves being called "the" problem). In my thinking, the goals of an AGI system are always the direct or indirect consequences of the task it is given by some human that is authorized to give the system a task. If that is the case, the problem of conflicting goals is ultimately an issue of conflicting goals between humans. In your paperclip example, the system should reject the task of producing a trillion paperclips because that likely interferes with the foreseeable goals of other humans. I firmly believe we need to find a design feature that makes sure that the system rejects tasks that are conflicting with other human goals in this way. For the most powerful systems, we might have to do something like what davidad suggests in his Open Agency Architecture, where plans devised by the AGI need to be approved by some form of human jury. I believe such a system would reject almost any maximization-type goals and would only accept almost exclusively aspiration-type goals, and this is the reason why I want to find out how such a goal could then be fulfilled in a rather safe way.
Re quantilization/satisficing: I think that apart from the potentially conflicting goals issue, there are at least two more issues with plain satisficing/quantilization (understood as picking a policy uniformly at random from those that promise at least X return in expectation or among the top X% percent of the feasibility interval): (1) It might be computationally intractable in complex environments that require many steps, unless one finds a way to do that sequentially (i.e., from time step to time step). (2) The unsafe ways to fulfill the goal might not be scarce enough to have sufficiently small probability when choosing policies uniformly at random. The latter is the reason why I currently believe that the freedom to solve a given aspiration-type goal in all kinds of different ways should be used to select a policy that does so in a rather safe way, as judged on the basis of some generic safety criteria. This is why we also investigate in this project how generic safety criteria (such as those discussed for impact regularization in the maximization framework) should be integrated (see post #3 in the sequence).
Sequence Summary. This sequence documents research by SatisfIA, an ongoing project on non-maximizing, aspiration-based designs for AI agents that fulfill goals specified by constraints ("aspirations") rather than maximizing an objective function. We aim to contribute to AI safety by exploring design approaches and their software implementations that we believe might be promising but neglected or novel. Our approach is roughly related to but largely complementary to concepts like quantilization and satisficing (sometimes called "soft-optimization"), Decision Transformers, and Active Inference.
This post describes the purpose of the sequence, motivates the research, describes the project status, our working hypotheses and theoretical framework, and has a short glossary of terms. It does not contain results and can safely be skipped if you want to get directly into the actual research.
Epistemic status: We're still in the exploratory phase, and while the project has yielded some preliminary insights, we don't have any clear conclusions at this point. Our team holds a wide variety of opinions about the discoveries. Nothing we say is set in stone.
Purpose of the sequence
Motivation
We share a general concern regarding the trajectory of Artificial General Intelligence (AGI) development, particularly the risks associated with creating AGI agents designed to maximize objective functions. We have two main concerns:
(I) AGI development might be inevitable
(We assume this concern needs no further justification)
(II) It might be impossible to implement an objective function the maximization of which would be safe
The conventional view on A(G)I agents (see, e.g., Wikipedia) is that they should aim to maximize some function of the state or trajectory of the world, often called a "utility function", sometimes also called a "welfare function". It tacitly assumes that there is such an objective function that can adequately make the AGI behave in a moral way. However, this assumption faces several significant challenges:
Given these concerns, the implication is clear: AI safety research should spend considerable effort than currently on identifying and developing AGI designs that do not rely on the maximization of an objective function. Given our impression that currently not enough researchers pursue this, we chose to work on it, complementing existing work on what some people call "mild" or "soft optimization", such as Quantilization or Bayesian utility meta-modeling. In contrast to the latter approaches, which are still based on the notion of a "(proxy) utility function", we explore the apparently mostly neglected design alternative that avoids the very concept of a utility function altogether.[1] The closest existing aspiration-based design we know of is Decision Transformer, which is inherently a learning approach. To complement this, we focus on a planning approach.
Project history and status
The SatisfIA project is a collaborative effort by volunteers from different programs: AI Safety Camp, SPAR, and interns from different ENSs, led by Jobst Heitzig at PIK's Complexity Science Dept. and the FutureLab on Game Theory and Networks of Interacting Agents that has started to work on AI safety in 2023.
Motivated by the above thoughts, two of us (Jobst and Clément) began to investigate the possibility of avoiding the notion of utility function altogether and develop decision algorithms based on goals specified through constraints called "aspirations", to avoid risks from Goodhart's law and extreme actions. We originally worked in a reinforcement learning framework, modifying existing temporal difference learning algorithms, but soon ran into issues more related to the learning algorithms than to the aspiration-based policies we actually wanted to study (see Clément's earlier post).
Because of these issues, discussions at VAISU 2023 and FAR Labs, comments from Stuart Russell, and concepts like davidad's Open Agency Architecture and other safe by design / provably safe approaches, Jobst then switched from a learning framework to a model-based planning framework and adapted the optimal planning algorithms from agentmodels.org to work with aspirations instead, which worked much better and allowed us to focus on the decision making aspect.
As of spring 2024, the project is participating in AI Safety Camp and SPAR with about a dozen people investing a few hours a week. We are looking for additional collaborators. Currently, our efforts focus on theoretical design, prototypical software implementation, and simulations in simple test environments. We will continue documenting our progress in this sequence. We also plan to submit a first academic conference paper soon.
Working hypotheses
We use the following working hypotheses during the project, which we ask the reader to adopt as hypothetical premises while reading our text (you don't need to believe them, just assume them to see what might follow from them).
(III) We should not allow the maximization of any function of the state or trajectory of the world
Following the motivation described above, our primary hypothesis posits that it must not be allowed that an AGI aims to maximize any form of objective function that evaluates the state or trajectory of the world in the sense of a proxy utility function. (This does not necessarily rule out that the agent employs any type of optimization of any objective function as part of its decision making, as long as that function is not only a function of the state or trajectory of the world. For example, we might allow some form of constrained maximization of entropy[2] or minimization of free energy[2] or the like, which are functions of a probabilistic policy rather than of the state of the world.)
(IV) The core decision algorithm must be hard-coded
The 2nd hypothesis is that to keep an AGI from aiming to maximize some utility function, the AGI agent must use a decision algorithm to pick actions or plans on the basis of the available information, and that decision algorithm must be hard-coded and cannot be allowed to emerge as a byproduct of some form of learning or training process. This premise seeks to ensure that the foundational principles guiding the agent's decision making are known in advance and have verifiable (ideally even provable) properties. This is similar to the "safe by design" paradigm and implies a modular design where decision making and knowledge representation are kept separate. In particular, it rules out monolithic architectures (like using only a single transformer or other large neural network that represents a policy, and a corresponding learning algorithm).
(V) We should focus on model-based planning first and only consider learning later
Although in reality, an AGI agent will never have a perfect model of the world and hence also needs some learning algorithm(s) to improve its imperfect understanding of the world on the basis of incoming data, our 3rd hypothesis is that for the design of the decision algorithm, it is helpful to hypothetically assume in the beginning that the agent already possesses a fixed, sufficiently good probabilistic world model that can predict the consequences of possible courses of action, so that the decision algorithm can choose actions or plans on the basis of these predictions. The rationale is that this "model-based planning" framework is simpler and mathematically more convenient to work with[3] and allows us to address the fundamental design issues that arise even without learning, before addressing additional issues related to learning. (see also Project history)
(VI) There are useful generic, abstract safety criteria largely unrelated to concrete human values
We hypothesize the existence of generic and abstract safety criteria that can enhance AGI safety in a broad range of scenarios. These criteria focus on structural and behavioral aspects of the agent's interaction with the world, largely independent of the specific semantics or contextual details of actions and states and mostly unrelated to concrete human values. Examples include the level of randomization in decision-making, the degree of change introduced into the environment, the novelty of behaviors, and the agent's capacity to drastically alter world states. Such criteria are envisaged as broadly enhancing safety, very roughly analogous to guidelines such as caution, modesty, patience, neutrality, awareness, and humility, without assuming a one-to-one correspondence to the latter.
(VII) One can and should provide certain guarantees on performance and behavior
We assume it is possible to offer concrete guarantees concerning certain aspects of the AGI's performance and behavior. By adhering to the outlined hypotheses and safety criteria, our aim is to develop AGI systems that exhibit behaviour that is in some essential respects predictable and controlled, reducing risk while fulfilling their intended functions within safety constraints.
(VIII) Algorithms should be designed with tractability in mind, ideally employing Bellman-style recursive formulas
Because the number of possible policies grows exponentially fast with the number of possible states, any algorithm that requires scanning a considerable portion of the policy space (like, e.g., "naive" quantilization over full policies would) soon becomes intractable in complex environments. In optimal control theory, this problem is solved by exploiting certain mathematical features of expected values that allow to make decisions sequentially and compute the relevant quantities (V- and Q-values) in an efficient recursive way using the Hamilton–Jacobi–Bellman equation. Based on our preliminary results, we hypothesize that a very similar recursive approach is possible in our aspiration-based framework in order to keep our algorithms tractable in complex environments.
Theoretical framework
Agent-environment interface. Our theoretical framework and methodological tools draw inspiration from established paradigms in artificial intelligence (AI), aiming to construct a robust understanding of how agents interact with their environment and make decisions. At the core of our approach is the standard agent–environment interface, a concept that we later elaborate through standard models like (Partially Observed) Markov Decision Processes. For now, we consider agents as entirely separate from the environment — an abstraction that simplifies early discussions, though we acknowledge that the study of agents which are a part of their environment is an essential extension for future exploration.
Simple modular design. To be able to analyse decision-making more clearly, we assume a modular architecture that divides the agent into two main components: the world model and the decision algorithm. The world model represents the agent's understanding and representation of its environment. It encompasses everything the agent knows or predicts about the environment, including how it believes the environment responds to its actions. The decision algorithm, on the other hand, is the mechanism through which the agent selects actions and plans based on its world model and set goals. It evaluates potential actions and their predicted outcomes to choose courses of action that appear safe and aligned with its aspirations.
Studying a modular setup can also be justified by the fact that some leading figures in AI push for modular designs even though current LLM-based systems are not yet modular in this way.
Information theory. For formalizing generic safety criteria, we mostly employ an information theoretic approach, using the mathematically very convenient central concept of Shannon entropy (unconditional, conditional, mutual, directed, etc.) and derived concepts (e.g., channel capacity). This also allows us to compare and combine our findings with approaches based on the free energy principle and active inference. where goals are formulated as desired probability distributions of observations.
High-level structure of studied decision algorithms
The decision algorithms we currently study have the following high-level structure:
Depending on the type of aspiration, the details can be designed so that the algorithm comes with certain guarantees about the fulfillment of the goal.
The next post is about one such simple type of aspiration: aspirations concerning the expected value of a single evaluation metric.
Appendix: Small glossary
This is how we use certain potentially ambiguous terms in this sequence:
Agent: We employ a very broad definition of "agent" here: The agent is a machine with perceptors that produce observations, and actuators that it can use to take actions, and that uses a decision algorithm to pick actions that it then takes. Think: a household's robot assistant, a firm's or government's strategic AI consultant. We do not assume that the agent has a goal of its own, let alone that it is a maximizer.
Aspiration: A goal formalized via a set of constraints on the state or trajectory or probability distribution of trajectories of the world. E.g., "The expected position of the spaceship in 1 week from now should be the L1 Lagrange point, its total standard deviation should be at most 100 km, and its velocity relative to the L1 point should be at most 1 km/h".
Decision algorithm: A (potentially stochastic) algorithm whose input is a sequence of observations (that the agent's perceptors have made) and whose output is an action (which the agent's actuators then take). A decision algorithm might use a fixed "policy" or a policy it adapts from time to time, e.g. via some form of learning, or some other way of deriving a decision.
Maximizer: An agent whose decision algorithm takes an objective function and uses an optimization algorithm to find the (typically unique) action at which the predicted expected value of the objective function (or some other summary statistic of the predicted probability distribution of the values of this function) is (at least approximately) maximal[4], and then outputs that action.
Optimization algorithm: An algorithm that takes some function f (given as a formula or as another algorithm that can be used to compute values f(x) of that function), calculates or approximates the (typically unique) location x of the global maximum (or minimum) of f, and returns that location x. In other words: an algorithm O:f↦O(f)≈argmaxxf(x).
Utility function: A function (only defined up to positive affine transformations) of the (predicted) state of the world or trajectory of states of the world that represents the preferences of the holder of the utility function over all possible probability distributions of states or trajectories of the world in a way conforming to the axioms of expected utility theory or some form of non-expected utility theory such as cumulative prospect theory.
Loss function: A non-negative function of potential outputs of an algorithm (actions, action sequences, policies, parameter values, changes in parameters, etc.) that is meant to represent something the algorithm shall rather keep small.
Goal: Any type of additional input (additional to the observations) to the decision algorithm that guides the direction of the decision algorithm, such as:
Planning: The activity of considering several action sequences or policies ("plans") for a significant planning time horizon, predicting their consequences using some world model, evaluating those consequences, and deciding on one plan the basis of these evaluations.
Learning: Improving one's beliefs/knowledge over time on the basis of some stream of incoming data, using some form of learning algorithm, and potentially using some data acquisition strategy such as some form of exploration.
Satisficing: A decision-making strategy that entails selecting an option that meets a predefined level of acceptability or adequacy rather than optimizing for the best possible outcome. This approach emphasizes practicality and efficiency by prioritizing satisfactory solutions over the potentially exhaustive search for the optimal one. See also the satisficer tag.
Quantilization: A technique in decision theory, introduced in a 2015 paper by Jessica Taylor. Instead of selecting the action that maximizes expected utility, a quantilizing agent chooses randomly among the actions which rank higher than a predetermined quantile in some base distribution of actions. This method aims to balance between exploiting known strategies and exploring potentially better but riskier ones, thereby mitigating the impact of errors in the utility estimation. See also the quantilization tag.
Reward Function Regularization: A method in machine learning and AI design where the reward function is modified by adding penalty terms or constraints to discourage certain behaviors or encourage others, such as in Impact Regularization. This approach is often used to incorporate safety, ethical, or other secondary objectives into the optimization process. As such, it can make use of the same type of safety criteria as an aspiration-based algorithm might.
Safety Criterion: A (typically quantitative) formal criterion that assesses some potentially relevant safety aspect of a potential action or plan. E.g., how much change in the world trajectory could this plan maximally cause? Safety criteria can be used as the basis of loss functions or in reward function regularization to guide behavior.
Using Aspirations as a Means to Maximize Reward: An approach in adaptive reinforcement learning where aspiration levels (desired outcomes or benchmarks) are adjusted over time based on performance feedback. Here, aspirations serve as dynamic targets to guide learning and action selection towards ultimately maximizing cumulative rewards. This method contrasts with using aspirations as fixed goals, emphasizing their role as flexible benchmarks in the pursuit of long-term utility optimization. This is not what we assume in this project, where we assume no such underlying utility function.
Acknowledgements
In addition to AISC, SPAR, VAISU, FAR Labs, and GaNe, we'd like to thank everyone who contributed or continues to contribute to the SatisfIA project for fruitful discussions and helpful comments and suggestions.
This might not be trivial of course if the agent might "outsource" or otherwise cause the maximization of some objective function by others, a question that is related to "reflective stability" in the sense of this post. An rule additional "do not maximize" would hence be something like "do not cause maximization by other agents", but this is beyond the scope of the project at the moment.
Information-theoretic, not thermodynamic.
This is also why Sutton and Barto's leading textbook on RL starts with model-based algorithms and turns to learning only in chapter 6.
That the algorithm uses an optimization algorithm that it gives the function to maximize to is crucial in this definition. Without this condition, every agent would be a maximizer, since for whatever output a the decision algorithm produces, one can always find ex post some function f for which argmaxxf(x)=a.