I'm not excited by gridworlds, because they tend to to skip straight to representing the high-level objects we're supposed to value, without bothering to represent all the low-level structure that actually lets us learn and generalize values in the real world.
Do you have plans for how to deal with this, or plans to think about richer environments?
This is an AI Safety Camp 10 project that I will be leading. With this post, I am looking for external collaborators, ideas, questions, resource suggestions, feedback, and other thoughts.
Summary
Based on various sources of anthropological research, I have compiled a preliminary list of universal (cross-cultural) human values. It seems to me that various of these universal values resonate with concepts from AI safety, but use different keywords. It might be useful to map these universal values to more concrete definitions using concepts from AI safety.
One notable detail in this research is that in case of AI and human cooperation, the values are not symmetric as they would be in case of human-human cooperation. This arises because we can change the goal composition of agents, but not of humans. Additionally there is the crucial difference that agents can be relatively easily cloned, while humans cannot. Therefore, for example, a human may have a universal need for autonomy, while an AI agent might imaginably not have that need built-in. If that works out, then the agent would instead have a need to support human autonomy.
The objective of this project would be to implement these mappings of concepts into tangible AI safety benchmark environments.
The non-summary
A related subject is balancing multiple human values (as the title says, it is in plural!). The human values and needs have to be met to a reasonable degree, that is, considering balancing all other human values as well. In this context, balancing is not the same as “tradeoff”. In some interpretations and use cases, tradeoff means linear rate of substitution between objectives, but as economists know well - generally humans prefer averages in all objectives to extremes in a few objectives. This means a naive approach of summing up the rewards of an AI agent would not yield aligned results. It is essential to use nonlinear utility functions for transforming the rewards before summing them up in the RL algorithm.
The current compiled list of universal human values is available in this document: "Universal ethical values - Survey of values"
https://docs.google.com/document/d/1ZZiToC149g9vKJGZRhktFmLYdB5J63nbClvCN_CxqAM/edit?usp=sharing (We may publish it as a separate LW post in the future).
It might be also interesting to consider how agents could internally represent the diversity of human needs, for which there are more than hundred words for representing various nuances. Take a look for example at this list of needs from the framework of Nonviolent Communication (scroll down to the second half of the webpage to see the list of needs): https://www.orvita.be/en/card/#:~:text=meaning%20(1)-,purpose,-goal%0Avision%0Adream . One of the central ideas of NVC is making distinction between expressed strategies / stances versus implicit actual needs. The needs can be compared to ultimate values, while strategies are only instrumental values. One way to experiment with such scenarios would be utilising Sims. There have been LLM interfaces built for Sims. Among other Sims interfaces, you may want to take a look at this one: https://github.com/joonspk-research/generative_agents .
On a related note, in economics, there are inherently multi-objective and nonlinear concepts like diminishing returns, concave utility functions, marginal utility, indifference curves, convex preferences, complementary goods, Cobb-Douglas utilities, willingness to accept, and willingness to pay, prospect theory, etc. These and many other well known formulations and phenomena from economics need to be introduced to AI safety in order for both humans and agents to better understand and implement our preferences and values. When planning new benchmarks, we can include some themes derived from these utility and preference theories in economics as well. An utility monster-like AI would not only be unsafe, it would also be economically unsound.
For implementing these benchmarks, it might be helpful that I have created a convenient framework which enables implementing multi-agent multi-objective environments. This framework was built as an elaborate fork of DeepMind's gridworlds framework. Additionally, I have already implemented about a dozen benchmarks using this framework, so the framework has been validated and these existing benchmarks can be also utilised as an example code for implementing the new environments. But we can also use different frameworks for implementing the benchmarks, if the team prefers so.
The multi-agent multi-objective gridworlds framework is available here: https://github.com/levitation-opensource/ai-safety-gridworlds This framework has been made compatible with PettingZoo ang Gym APIs, therefore testing AI agents on it is easy and follows industry standard interfaces. At the same time, the framework is extended from previously popular DeepMind’s Gridworlds, therefore enabling easy adoption of many existing gridworld environments and their conversion into multi-objective, multi-agent scenarios. You can see screenshots of the framework in this working paper: "From homeostasis to resource sharing: Biologically and economically compatible multi-objective multi-agent AI safety benchmarks" https://arxiv.org/abs/2410.00081 .
Motivation
The present-day rapid advancement of AI technologies necessitates the development of safe and reliable AI systems that align with human values. While notable progress has been made in defining and implementing safety protocols over the recent years, there remains a gap in integrating universal human values into AI safety benchmarks in a more systematic manner. My project aims to bridge this gap by planning and potentially building new multi-objective, multi-agent AI safety benchmark environments that incorporate themes of universal human values.
Drawing from extensive anthropological research, I've compiled a list of universal (cross-cultural) human values. These values often resonate with AI safety concepts but are expressed using different terminology. Mapping these universal values to concrete definitions using AI safety concepts can provide a more robust framework for developing safe AI systems. Likewise, we can then better note the kinds of universal human values that might not yet have a good coverage in the form of corresponding AI safety concepts. For example, human autonomy might be one of such potentially neglected concepts, which differs from the usually assumed power and achievement values - if an AI does all we ask for, or even more, before we even ask, then that might contradict our need for autonomy.
One critical aspect of this research is recognizing the asymmetry between AI and human cooperation. Unlike humans, we can alter the goal composition of AI agents and clone them relatively easily. This difference means that agents can be designed without certain intrinsic needs (e.g., autonomy) and instead be programmed to support human autonomy. They may still gain a limited need for autonomy because of instrumental reasons, but at least it might not need to be built-in.
Implementing and balancing the plurality of these universal human values is essential, as humans prefer a harmonious average across all objectives rather than extremes in a few.
Theory of Change
By integrating universal human values into AI safety benchmarks, we can develop AI agents that better understand and align with human needs. These benchmarks will serve as testing grounds for AI systems, ensuring they perform optimally across multiple objectives that reflect human values. This approach can reduce the risk of misalignment between AI behaviour and human expectations, thereby mitigating potential hazards associated with AGI/TAI development.
Mostly this project aims at outer alignment. Though I think there are also a couple of ways how inner alignment can be affected.
First, my hypothesis is that if the AI is trained on sufficiently many objectives pulling in different directions, then it will be increasingly less likely that the model would overfit to some random objective. Instead, the model would hopefully find a middle ground between the objectives in the training data. This is similar to how old fashioned machine learning models overfit less when you have more data points. Even if the model still has some alien objectives inside it, these alien objectives would become drowned by the plurality of different human-values based objectives that were explicitly present in the training data.
Secondly, the way we formulate the mathematics of balancing multiple objectives is closer to the theme of inner alignment. The formulation of the model may affect its personality somewhat. Think for example about the difference between RL models and control systems models. The latter have the concept of optimal homeostatic values baked in, while with RL models you need to tweak their maximising nature somewhat. Likewise, we move closer to inner alignment work with the general understanding that we need to use nonlinear utility functions. In other words, linear summation of rewards across objectives without nonlinear transformations before summation would not be acceptable - it would lead to maximisation of a single easiest to achieve objective. With certain objectives or dynamics of these objectives, it might be easier to achieve outer alignment, if the agent also has approximately right inner alignment. You can read more about my earlier research on balancing from this paper: "Using soft maximin for risk averse multi-objective decision-making" https://link.springer.com/article/10.1007/s10458-022-09586-2 .
That being said, I definitely acknowledge the risk of treacherous turn or “sharp left turn”. I imagine that this risk can manifest in various ways and some of the related problems were the motivation why I became interested in AI safety in the first place. In my mind, the approaches we explore in this project are not intended to solve all problems. The approaches we implement are not exclusive to other AI safety approaches - various approaches can be combined in the future into a hybrid solution.
Project Plan
Steps Involved:
First Step
The initial step is to perform an analysis of the universal human values list and map each value to corresponding AI safety concepts. This mapping will form the foundation for designing the benchmark environments.
Backup Plan
Potential Challenges:
Backup Strategies:
Scope
Included
Excluded
Most Ambitious Version
Least Ambitious Version
Output
At the end of the project, we will have:
Risks and downsides (externalities)
The project carries minimal risk of negative externalities. Since we are focusing on benchmark environments rather than advancing AI capabilities directly, the risk of inadvertently accelerating AI capabilities is low. There is a slight risk that misinterpretation of human values could lead to flawed benchmarks, but this can be mitigated through analysis, peer review, and open collaboration. This project is a conversation starter. No significant infohazards or ethical concerns are anticipated.