This is actually pretty cool! Feels like it's doing the type of reasoning that might result in critical insight, and maybe even is one itself. It's towards the upper tail of the distribution of research I've read by people I'm not already familiar with.
I think there's big challenges to this solving AGI alignment including: probably this restriction bounds AI's power a lot, but still feels like a neat idea and I hope you continue to explore the space of possible solutions.
Thanks, this was pretty interesting.
Big problem is the free choice of "conceptual language" (universal Turing machine) when defining simplicity/comprehensibility. You at various points rely on an assumption that there is one unique scale of complexity (one ladder of ), and it'll be shared between the humans and the AI. That's not necessarily true, which creates a lot of leaks where an AI might do something that's simple in the AI's internal representation but complicated in the human's.
It's OK to make cars pink by using paint ("spots of paint" is an easier to optimize/comprehend variable). It's not OK to make cars pink by manipulating individual water droplets in the air to create an elaborate rainbow-like illusion ("individual water droplets" is a harder to optimize/comprehend variable).
This raises a second problem, which is the "easy to optimize" criterion, and how it might depend on the environment and on what tech tree unlocks (both physical and conceptual) the agent already has. Pink paint is pretty sophisticated, even though our current society has commodified it so we can take getting some for granted. Starting from no tech tree unlocks at all, you can probably get to hacking humans before you can recreate the Sherwin Williams supply chain. But if we let environmental availability weigh on "easy to to optimize," then the agent will be happy to switch from real paint to a hologram or a human-hack once the technology for those becomes developed and commodified.
When the metric is a bit fuzzy and informal, it's easy to reach convenient/hopeful conclusions about how the human-intended behavior is easy to optimize, but it should be hard to trust those conclusions.
You at various points rely on an assumption that there is one unique scale of complexity (one ladder of ), and it'll be shared between the humans and the AI. That's not necessarily true, which creates a lot of leaks where an AI might do something that's simple in the AI's internal representation but complicated in the human's.
I think there are many somewhat different scales of complexity, but they're all shared between the humans and the AI, so we can choose any of them. We start with properties () which are definitely easy to understand for humans. Then we gradually relax those properties. According to the principle, properties will capture all key variables relevant to the human values much earlier than top human mathematicians and physicists will stop understanding what those properties might describe. (Because most of the time, living a value-filled life doesn't require using the best mathematical and physical knowledge of the day.) My model: "the entirety of human ontology >>> the part of human ontology a corrigible AI needs to share".
This raises a second problem, which is the "easy to optimize" criterion, and how it might depend on the environment and on what tech tree unlocks (both physical and conceptual) the agent already has. Pink paint is pretty sophisticated, even though our current society has commodified it so we can take getting some for granted. Starting from no tech tree unlocks at all, you can probably get to hacking humans before you can recreate the Sherwin Williams supply chain.
There are three important possibilities relevant to your hypothetical:
My point here is that I don't think technology undermines the usefulness of my metric. And I don't think that's a coincidence. According to the principle, one or both of the below should be true:
If neither were true, we would believe that technology radically changed fundamental human values at some point in the past. We would see life without technology as devoid of most non-trivial human values.
When the metric is a bit fuzzy and informal, it's easy to reach convenient/hopeful conclusions about how the human-intended behavior is easy to optimize, but it should be hard to trust those conclusions.
The selling point of my idea is that it comes with a story for why it's logically impossible for it to fail or why all of its flaws should be easy to predict and fix. Is it easy to come up with such story for other ideas? I agree that it's too early to buy that story. But I think it's original and probable enough to deserve attention.
Remember that I'm talking about a Task-directed AGI, not a Sovereign AGI.
I apologize, I didn't read in full, but I'm curious if you considered the case of, for example, the Mandelbrot set? A very simple equation specifies an infinitely precise, complicated set. If human values have this property than it would be correct to say the Kolmogorov complexity of human values is very low, but there are still very exacting constraints on the universe for it to satisfy human values.
Don't worry about not reading it all. But could you be a bit more specific about the argument you want to make or the ambiguity you want to clarify? I have a couple of interpretations of your question.
Interpretation A:
Interpretation B:
My response to B is that my metric of simplicity is different from Kolmogorov Complexity.
I want to show a philosophical principle which, I believe, has implications for many alignment subproblems. If the principle is valid, it might allow to
This post clarifies and expands on ideas from here and here. Reading the previous posts is not required.
The Principle
The principle and its most important consequences:
Justification:
There are value systems for which the principle is false. In that sense, it's empirical. However, I argue that it's a priori true for humans, no matter how wrong our beliefs about the world are. So the principle is not supposed to be an "assumption" or "hypothesis", like e.g. the Natural Abstraction hypothesis.
Formalization
How do we define easiness of comprehension? We choose variables describing our sensory data. We choose what properties (X) of those variables count as "easily comprehensible". Now when we consider any variable (observable or latent), we check how much its behavior fits the properties. We can order all variables from the most comprehensible to the least comprehensible (V1,V2,V3...Vn).
Let's give a specific example of X properties. Imagine a green ball in your visual field. What properties would make this stimulus easier to comprehend? Continuous movement (the ball doesn't teleport from place to place), smooth movement (the ball doesn't abruptly change direction), low speed (the ball doesn't change too fast compared to other stimuli), low numerosity (the ball doesn't have countless distinct parts). That's the kind of properties we need to abstract and capture when defining X.
How do we define easiness of optimization? Some V variables are variables describing actions (A), ordered from the most comprehensible to the least comprehensible actions (A1,A2,A3...An). We can check if changes in an Aj variable are correlated with changes in a Xi variable. If yes, Aj can optimize Xi. Easiness of optimization is given by the index of Aj. This is an incomplete definition, but it conveys the main idea.
Formalizing all of this precisely won't be trivial. But I'll give intuition pumps for why it's a very general idea which doesn't require getting anything exactly right on the first try.
Example: our universe
Consider these things:
According to the principle:
This makes sense. Why would we rebind our utility to something which we couldn't meaningfully interact with, perceive or understand previously?
Example: Super Mario Bros.
Let's see how the principle applies to a universe very different from our own. A universe called Super Mario Bros.
When playing the game, it's natural to ask: what variables (observable or latent) change in ways which are the easiest to comprehend? Which of those changes are correlated with simple actions of the playable character or with simple inputs?
Let's compare a couple of things from the game:
According to the principle:
This makes sense. If you care about playing the game, it's hard to care about things which are tangential or detrimental to the main gameplay.
Example: reverse-engineering programs with a memory scanner
There's a fun and simple way to hack computer programs, based on searching and filtering variables stored in a program's memory.
For example, do you want to get infinite lives in a video game? Then do this:
Oftentimes you'll end up with at least two variables: one controlling the actual number of lives and the other controlling the number of lives displayed on the screen. Here's a couple of tutorial videos about this type of hacking: Cheat Engine for Idiots, Unlocking the Secrets of my Favorite Childhood Game.
It's a very general approach to reverse engineering a program. And the idea behind my principle is that the variables humans care about can be discovered in a similar way, by filtering out all variables which don't change according to certain simple rules.
If you still struggle to understand what "easiness of optimization/comprehension" means, check out additional examples in the appendix. There you'll also find a more detailed explanation of the claims behind the principle.
Philosophy: Anti-Copernican revolutions
(This is a vague philosophical point intended to explain what kind of "move" I'm trying to make by introducing my principle.)
There are Copernican revolutions and Anti-Copernican revolutions.
Copernican revolutions say "external things matter more than our perspective". The actual Copernican revolution is an example.
Anti-Copernican revolutions say "our perspective matters more than external things". The anthropic principle is an example: instead of asking "why are we lucky to have this universe?" we ask "why is this universe lucky to have us?". What Immanuel Kant called his "Copernican revolution" is another example: instead of saying "mental representations should conform to external objects" he said "external objects should conform to mental representations".[4] Arguably, Policy Alignment is also an example ("human beliefs, even if flawed, are more important than AI's galaxy-brained beliefs").
With my principle, I'm trying to make an Anti-Copernican revolution too. My observation is the following: for our abstractions to be grounded in anything at all, reality has to have certain properties — therefore, we can deduce properties of reality from introspective information about our abstractions.
Visual illustration
The green bubble is all aspects of reality humans can optimize or comprehend. It's a cradle of simplicity in a potentially infinite sea of complexity. The core of the bubble is V1, the outer layer is Vn.
The outer layer contains, among other things, the last theory of physics which has some intuitive sense. The rest of the universe, not captured by the theory, is basically just "noise".
We care about the internal structure of the bubble (its internals are humanly comprehensible concepts). We don't care about the internal structure of the "noise". Though we do care about predicting the noise, since the noise might accidentally accumulate into a catastrophic event.
The bubble has a couple of nice properties. It's humanly comprehensible and it has a gradual progression from easier concepts to harder concepts (just like in school). We know that the bubble exists, no matter how wrong our beliefs are. Because if it doesn't exist, then all our values are incoherent and the world is incomprehensible or uncontrollable.
Note that the bubble model applies to 5 somewhat independent things: the laws of physics, ethics, cognition & natural language, conscious experience, and mathematics.
Philosophy: a new simplicity measure
Idea 1. Is "objects that are easy to manipulate with the hand" a natural abstraction? I don't know. But imagine I build an AI with a mechanical hand. Now it should be a natural abstraction for the AI, because "manipulating objects with the hand" is one of the simplest actions the AI can perform. This suggests that it would be nice to have an AI which interprets reality in terms of the simplest actions it can take. Because it would allow us to build a common ontology between humans and the AI.
Idea 2. The simplest explanation of the reward is often unsafe because it's "too smart". If you teach a dumber AI to recognize dogs, it might learn the shape and texture of a dog; meanwhile a superintelligent AI will learn a detailed model of the training process and Goodhart it. This suggests that it would be nice to have an AI which doesn't just search for the simplest explanation with all of its intelligence, but looks for the simplest explanations at different levels of intelligence — and is biased towards "simpler and dumber" explanations.
The principle combines both of those ideas and gives them additional justification. It's a new measure of simplicity.
Human Abstractions
Here I explain how the principle relates to the following problems: the pointers problem; the diamond maximizer problem; environmental goals; identifying causal goal concepts from sensory data; ontology identification problem; eliciting latent knowledge.
According to the principle, we can order all variables by how easy they are to optimize/comprehend (V1,V2,V3...Vn). We can do this without abrupt jumps in complexity or empty classes. Vj+1 can have greater predictive power than Vj, because it has less constraints.
That implies the following:
Related: Natural Abstraction Hypothesis
If NAH is true, referents of human concepts have relatively simple definitions.
However, my principle implies that referents of human concepts have a relatively simple definition even if human concepts are not universal (i.e. it's not true that "a wide variety of cognitive architectures will learn to use approximately the same high-level abstract objects/concepts to reason about the world").
Related: compositional language
One article by Eliezer Yudkowsky kinda implies that there could be a language for describing any possible universe on multiple levels, a language in which defining basic human goals would be pretty easy (no matter what kind of universe humans live in):
But why would such a language be feasible to figure out? It seems like creating it could require considering countless possible universes.
My principle explains "why" and proposes a relatively feasible method of creating it.
Related: ELK and continuity (Sam Marks)
The principle could be used to prove that the properties for enforcing continuity can't themselves be discontinuous functions of the underlying latent state (unless something really weird is going on, in which case humans should be alerted). If we use X properties to define "continuity".
Related: ELK and modes of prediction
A proposal by Derek Shiller, Beth Barnes and Nate Thomas, Oam Patel:
The principle could be used to prove that we can force predictors to not be "fundamentally different" from each other, so we can force the reporter to change continuously.
Low Impact
Here I explain how the principle relates to Impact Regularization.
Allowed consequences
According to the principle, we can order all variables by how easy they are to optimize/comprehend (V1,V2,V3...Vn). We could use it to differentiate between "impacts explainable by more coarse-grained variables (Vj)" and "impacts explainable by more fine-grained variables (Vk)":
Sources of impact
Some hard to optimize/comprehend variables (Vk) are "contained" within easy to optimize/comprehend variables (Vj). For example:
We could use this fact to search for unusual sources of complicated impacts.
Goodhart's Curse
According to the principle, we can order all variables by how easy they are to optimize/comprehend (V1,V2,V3...Vn). If we could map the variables inside an AI to this order, we could measure how much optimization the AI does. Which is relevant to the Goodhart problem.
The intuition here is that to become substantially better at optimizing something, you need to perform progressively more complicated actions and take into account progressively more incomprehensible considerations; you can become better by just trying different combinations of the same kind of actions/considerations (Vj), but then you're fundamentally limited. However, there's also a stronger claim:
An agent exploiting a certain kind of variables (Vj) can't game an outer objective, unless something starts exploiting less optimizable/comprehensible variables (Vk) — which the agent can notice or stop. Assuming the outer objective is chosen with minimal competence and the agent's environment is somewhat stable.
Example: collecting stones
Imagine a superintelligent robot tasked to place stones into a pile. A couple of cameras observe the robot. "Being seen collecting stones into a pile" is the outer objective. "Actually collecting stones into a pile" is the intent behind the outer objective.
The robot could get reward by going around and grabbing stones (the robot's and stones' movement can be described by Vj variables). The robot could also hack the cameras to produce fake video, but that would require manipulating variables which are substantially harder to optimize/comprehend (Vk). If the robot sticks to manipulating Vj, everything should be fine.
Another agent could hack the cameras. Less likely, the environment itself could transform into a state where the cameras are hacked. But any of that would imply that Vk variables have changed in a way directly related to how Vj variables can optimize the outer objective. (If cameras are hacked, collecting real stones suddenly becomes completely useless for optimizing the outer objective.) The robot can report or stop that.
In some part of the environment getting stones could be as hard as hacking the cameras. For example, maybe we need to hack the enemy's cameras to steal their stones. In such case we could whitelist exploiting Vk variables there. The robot can ensure that Vk optimization doesn't "spill" into the rest of the world.
Example: Speedrunning
Imagine you measure "how good one can play a video game" (the intent) by "the speed of completing the game" (the outer objective).
This outer objective can be Goodharted with glitches (anomalously fast movement, teleportation, invincibility, getting score points out of nothing). However, some of the below will be true:
If the player sticks to Vj, Goodharting the outer objective is impossible. But expert performance is still possible.
Related: Shard Theory
I believe the point of "Don't align agents to evaluations of plans" can be reformulated as:
Make agents terminally value easy to optimize/comprehend variables (Vj), so they won't Goodhart by manipulating hard to optimize/comprehend variables (Vk).
My principle supports this point.
More broadly, a big aspect of Shard Theory can be reformulated as:
Early in training, Reinforcement Learning agents learn to terminally value easy to optimize/comprehend variables ("shards" are simple computations about simple variables)... that's why they're unlikely to Goodhart their own values by manipulating hard to optimize/comprehend variables.
If Shard Theory is true, the principle should give insight into how shards behave in all RL agents. Because the principle is true for all agents whose intelligence & values develop gradually and who don't completely abandon their past values.
Related: Mechanistic Anomaly Detection
See glider example, strawberry example, Boolean circuit example, diamond example.
I believe the idea of Mechanistic Anomaly Detection can be described like this:
Any model (Mn) has "layers of structure" and therefore can be split into versions, ordered from versions with less structure to versions with more structure (M1,M2,M3...Mn). When we find the version with the least structure which explains the most instances[6] of a phenomenon we care about (Mj), it defines the latent variables we care about.
This is very similar to the principle, but more ambitious (makes stronger claims about all possible models) and more abstract (doesn't leverage even the most basic properties of human values).
Interpretability
Claim A. Say you can comprehend Vj variables, but not Vk. You still can understand what Vk variable is the most similar to a Vj variable (and if the former causes the latter); if a change of a Vk variable harms or helps your values (and if the change is necessary or unnecessary); if a Vk variable is contained within a particular part of your world-model or not. According to the principle, this knowledge can be obtained automatically.
Claim B. Take an Vk ontology (which describes real things) and a simpler Vj ontology (which might describe nonexistent things). Whatever Vj ontology describes, we can automatically check if there's anything in Vk that corresponds to it OR if searching for a correspondence is too costly.
This is relevant to interpretability.
Example: Health
Imagine you can't comprehend how the human body works. Сonsider those statements by your doctor:
Despite not understanding it all, you understand everything relevant to your values. For example, from the last statement you understand that the doctor doesn't respect your values.
Now, imagine you're the doctor. You have a very uneducated patient. The patient might say stuff like "inside my body <something> moves from one of my hands to another" or "inside my body <something> keeps expanding below my chest". Whatever they'll describe, you'll know if you know any scientific explanation of that OR if searching for an explanation is too costly.
Related: Ramsey–Lewis method
The above is similar to Ramsification.
Hypothetical ELK proposal
Claims A and B suggest a hypothetical interpretability method. I'll describe it with a metaphor:
In this metaphor, Einstein = an incomprehensible AI. Village idiot = an easily interpretable AI. It's like the broken telephone game, except we're fixing broken links.
If some assumptions hold (it's cheap to translate brains into models made of V variables; describing Einstein's cognition doesn't require variables more complex than Vn; producing a non-insane non-bizarre clone doesn't take forever), the proposal above gives a solution to ELK.
Building AGI
Consider this:
If my principle is formalized, we might obtain a bounded solution to outer and inner alignment. (I mean Task-directed AGI level of outer alignment.) Not saying the procedure is gonna be practical.
Appendix
Comparing variables
Here are some additional examples of comparing variables based on their X properties.
Consider what's easier to optimize/comprehend:
Here are the answers:
The analysis above is made from the human perspective and considers "normal" situations (e.g. the situations from the training data, like in MAD).
A more detailed explanation of the claims behind the principle
Imagine you have a set of some properties (X). Let's say it's just one property, "speed". You can assign speed to any "simple" variable (observable or latent). However, you can combine multiple simple variables, with drastically different speeds, into a single "complex" variable. As a result, you get a potentially infinite-dimensional space V (a complex variable made out of n simple variables is an n-dimensional object). You reduce the high-dimensional space into a low-dimensional space V. For simplicity, let's say it's a one-dimensional space. Inevitably, such reduction requires making arbitrary choices and losing information.
Here are the most important consequences of the principle, along with some additional justifications:
There are value systems for which claims 1-4 aren't true. In that sense, they're empirical. However, I argue that the claims are a priori true for humans, no matter how wrong our beliefs about the world are.
"Could" is important here. You can optimize/comprehend a thing (in principle) even if you aren't aware of its existence. For example: cave people could easily optimize "the amount of stone knives made of quantum waves" without knowing what quantum waves are; you could in principle easily comprehend typical behavior of red-lipped batfishes even if you never decide to actually do it.
For example, I care about subtle forms of pleasure because they're similar to simpler forms of pleasure. I care about more complex notions of "fairness" and "freedom" because they're similar to simpler notions of "fairness" and "freedom". I care about the concept of "real strawberries" because it's similar to the concept of "sensory information about strawberries". Etc.
Or consider prehistoric people. Even by today's standards, they had a lot of non-trivial positive values ("friendship", "love", "adventure", etc.) and could've easily lived very moral lives, if they avoided violence. Giant advances in knowledge and technology didn't change human values that much. Humans want to have relatively simple lives. Optimizing overly complex variables would make life too chaotic, uncontrollable, and unpleasant.
Note that it would be pretty natural to care about the existence of the pseudorandom number generator, but "the existence of the PRNG" is a much more comprehensible variable than "the current value of the PRNG".
Also, as far as I'm aware, Super Mario Bros. doesn't actually have a pseudorandom number generator. But just imagine that it does.
"Up to now it has been assumed that all our cognition must conform to the objects; but all attempts to find out something about them a priori through concepts that would extend our cognition have, on this presupposition, come to nothing. Hence let us once try whether we do not get farther with the problems of metaphysics by assuming that the objects must conform to our cognition, which would agree better with the requested possibility of an a priori cognition of them, which is to establish something about objects before they are given to us. This would be just like the first thoughts of Copernicus, who, when he did not make good progress in the explanation of the celestial motions if he assumed that the entire celestial host revolves around the observer, tried to see if he might not have greater success if he made the observer revolve and left the stars at rest." (c.) The Critique of Pure Reason, by Immanuel Kant, Bxvi–xviii
I mean, we do care about that model being inner-aligned. But this is a separate problem.
the most instances in training