The Description vs The Maths vs The Algorithm (or Implementation)
This is a frame which I think is important. Getting from a description of what we want, to the maths of what we want, to an algorithm which implements that seems to be a key challenge.
I sometimes think of this as a pipeline of development: description → maths → algorithm. A description is something like "A utility maximizer is an agent-like thing which attempts to compress future world states towards ones which score highly in its utility function". The maths of something like that involves information theory (to understand what we mean by compression), proofs (like the good regulator theorem, the power-seeking theorems) etc. The algorithm is something like RL or AlphaGo.
More examples:
System
Description
Maths
Algorithm
Addition
"If you have some apples and you get some more, you have a new number of apples"
The rules of arithmetic. We can make proofs about it using ZFS theory.
Whatever machine code/ logic gates are going on inside a calculator.
Physics
"When you throw a ball, it accelerates downwards under gravity"
Calculus
Frame-by-frame updating of position and velocity vectors
AI which models the world
"Consider all hypotheses weighted by simplicity, and update based on evidence"
Kolmogorov complexity, Bayesian updating, AIXI
DeepMind's Apperception Engine (But it's not very good)
A good decision theory
"Doesn't let you get exploited in Termites, while also one-boxing in Newcome"
FDT, concepts like subjunctive dependence
???
Human values
???
???
???
This allows us to identify three failure points:
Failure to make an accurate description of what we want (Alternatively, failure to turn an intuitive sense into a description)
Failure to formalize that description into mathematics
Failure to implement that mathematics into an algorithm
These failures can be total or partial. DeepMind's Apperception Engine is basically useless because it's a bad implementation of something AIXI-like. Failure to implement the mathematics may also happen because the algorithm doesn't accurately represent the maths. Deep neural networks are sort-of-like idealized Bayesian reasoning, but a very imperfect version of it.
If the algorithm doesn't accurately represent the maths, then reasoning about the maths doesn't tell you about the algorithm. Proving properties of algorithms is much harder than proving them about the abstracted maths of a system.
(As an aside, I suspect this is actually a crux relating to near-term AI doom arguments: are neural networks and DRL agents similar enough to idealized Bayesian reasoning and utility maximizers to act in ways which those abstract systems will provably act?)
All of this is just to introduce some big classes of reasoners: self-protecting utility maximizers, self-modifying utility maximizers, and thoughts about what a different type of utility-maximizer might look like.
Self-Protecting Utility Maximizers
On a description level: this is a system which chooses actions to maximize the value of a utility function.
Mathematically it compresses the world into states which score highly according to a function V.
Imagine the following algorithm (it's basically a description of an RL agent with direct access to the world state):
Take a world-state vector W, a list of actions A, and a dynamic matrix D(a,W)=Wnew. Have a value function V(W)∈R. Then output the following a.
a:V(D(a,W))=max(V(D(a,W)))∀a∈A
To train it, update D according to basic deep-learning rules to make it more accurate. Also update V according to some reward signal.
This is a shallow search over a single action. Now consider updating it to use something like a Monte-Carlo tree search. This will cause it to maximize the value of V far into the future.
So what happens if this system is powerful enough to include an accurate model of itself in its model of the world? And let's say it's also powerful enough to edit it's own source code. The answer is pretty clear: it will delete the code which modifies V. Then (if it is powerful enough) it will destroy the world.
Why? Well it wants to take the action which maximizes the value of V far into the future. If its current V is modified to V′, then it will become an agent which maximizes V′instead of V. This means the future is likely to be less good according to V.
This is one of the most obvious problems with utility maximizers, and it was first noticed a long time ago (by AI alignment standards).
(Fake) Self-Modifying Utility Maximizers
A system which is described as wanting to maximize something like "Do whatever makes humans happy".
What this might look like mathematically is something which models humans as a utility maximizer, then maximizes whatever it thinks humans want to maximize. The part which does this modelling extracts a new value function from its future model of the world.
So for an example of an algorithm, we have our D, W, and A the same as above, but instead of using a fixed V, it has a fixed E which produces E(W)=V.
Then it chooses futures similarly to our previous algorithm. Like the previous algorithm, it also destroys the world if given a chance.
Why? Well for one reason if V depends on W, then it will simply change W so that V gives that W a high score. For example, it might modify humans to behave like hydrogen maximizers. Hydrogen is pretty common, so this scores highly.
But another way of looking at this is that E is just acting like V did in the old algorithm: since V only depends on E and W, together they're just another map from W to R.
In this case something which looks like it modifies its own utility function is actually just preserving it at one level down.
Less Fake Self-Modifying Utility Maximizers
So what might be a better way of doing this? We want a system which we might describe as "Learn about the 'correct' utility function without influencing it".
Mathematically this is reminiscent of FDT. The "correct" utility function is something which many parts of the the world (i.e. human behaviour) subjunctively depend on. It influences human behaviour, but cannot be influenced.
This might look like a modification of our first algorithm as follows: D(a,W) now returns a series of worlds W0...Wn drawn from a probability distribution over possible results of an action a. We begin with our initial estimate of V, which is updated according to some updater U(W,V)=Vnewand each world Wi is evaluated according to the corresponding Vi.
This looks very much like the second system, so we add a further condition. For each Wi we have the production of an associated pi representing a relative probability of those worlds. So we enforce a new consistency as a mathematical property:
Vold=∑Vipi∑pi
This amounts to the description that "No action can affect the expected value of a future world state." which is similar to subjunctive dependence from FDT.
There are a few possible ways to implement this consistency: we can have the algorithm modify its own Vold as it considers possible futures. We can have the consistency enforced on the U operator so that it updates the value function only in consistent ways. We also have the rather exotic way of generating the probabilities by comparing Vold to the various Vi.
The first one looks like basic reasoning, and is the suggested answer to the sophisticated cake or death problem given above. But it incentivises the AI to only think in certain ways, if the AI is able to model itself.
The second one seems to run into the problem that if D becomes too accurate, our U is unable to update the value function at all.
The third one is weird and requires more thought on my part. The main issue is that it doesn't guard against attempts by the AI to edit its future value function, only makes the AI believe they're less likely to work.
An algorithm which accurately represents the fixed-computation-ness of human morality is still out of reach.
Here's some thoughts I've had about utility maximizers, heavily influenced by ideas like FDT and Morality as Fixed Computation.
The Description vs The Maths vs The Algorithm (or Implementation)
This is a frame which I think is important. Getting from a description of what we want, to the maths of what we want, to an algorithm which implements that seems to be a key challenge.
I sometimes think of this as a pipeline of development: description → maths → algorithm. A description is something like "A utility maximizer is an agent-like thing which attempts to compress future world states towards ones which score highly in its utility function". The maths of something like that involves information theory (to understand what we mean by compression), proofs (like the good regulator theorem, the power-seeking theorems) etc. The algorithm is something like RL or AlphaGo.
More examples:
This allows us to identify three failure points:
These failures can be total or partial. DeepMind's Apperception Engine is basically useless because it's a bad implementation of something AIXI-like. Failure to implement the mathematics may also happen because the algorithm doesn't accurately represent the maths. Deep neural networks are sort-of-like idealized Bayesian reasoning, but a very imperfect version of it.
If the algorithm doesn't accurately represent the maths, then reasoning about the maths doesn't tell you about the algorithm. Proving properties of algorithms is much harder than proving them about the abstracted maths of a system.
(As an aside, I suspect this is actually a crux relating to near-term AI doom arguments: are neural networks and DRL agents similar enough to idealized Bayesian reasoning and utility maximizers to act in ways which those abstract systems will provably act?)
All of this is just to introduce some big classes of reasoners: self-protecting utility maximizers, self-modifying utility maximizers, and thoughts about what a different type of utility-maximizer might look like.
Self-Protecting Utility Maximizers
On a description level: this is a system which chooses actions to maximize the value of a utility function.
Mathematically it compresses the world into states which score highly according to a function V.
Imagine the following algorithm (it's basically a description of an RL agent with direct access to the world state):
Take a world-state vector W, a list of actions A, and a dynamic matrix D(a,W)=Wnew. Have a value function V(W)∈R. Then output the following a.
a:V(D(a,W))=max(V(D(a,W))) ∀ a∈A
To train it, update D according to basic deep-learning rules to make it more accurate. Also update V according to some reward signal.
This is a shallow search over a single action. Now consider updating it to use something like a Monte-Carlo tree search. This will cause it to maximize the value of V far into the future.
So what happens if this system is powerful enough to include an accurate model of itself in its model of the world? And let's say it's also powerful enough to edit it's own source code. The answer is pretty clear: it will delete the code which modifies V. Then (if it is powerful enough) it will destroy the world.
Why? Well it wants to take the action which maximizes the value of V far into the future. If its current V is modified to V′, then it will become an agent which maximizes V′instead of V. This means the future is likely to be less good according to V.
This is one of the most obvious problems with utility maximizers, and it was first noticed a long time ago (by AI alignment standards).
(Fake) Self-Modifying Utility Maximizers
A system which is described as wanting to maximize something like "Do whatever makes humans happy".
What this might look like mathematically is something which models humans as a utility maximizer, then maximizes whatever it thinks humans want to maximize. The part which does this modelling extracts a new value function from its future model of the world.
So for an example of an algorithm, we have our D, W, and A the same as above, but instead of using a fixed V, it has a fixed E which produces E(W)=V.
Then it chooses futures similarly to our previous algorithm. Like the previous algorithm, it also destroys the world if given a chance.
Why? Well for one reason if V depends on W, then it will simply change W so that V gives that W a high score. For example, it might modify humans to behave like hydrogen maximizers. Hydrogen is pretty common, so this scores highly.
But another way of looking at this is that E is just acting like V did in the old algorithm: since V only depends on E and W, together they're just another map from W to R.
In this case something which looks like it modifies its own utility function is actually just preserving it at one level down.
Less Fake Self-Modifying Utility Maximizers
So what might be a better way of doing this? We want a system which we might describe as "Learn about the 'correct' utility function without influencing it".
Mathematically this is reminiscent of FDT. The "correct" utility function is something which many parts of the the world (i.e. human behaviour) subjunctively depend on. It influences human behaviour, but cannot be influenced.
This might look like a modification of our first algorithm as follows: D(a,W) now returns a series of worlds W0 ...Wn drawn from a probability distribution over possible results of an action a. We begin with our initial estimate of V, which is updated according to some updater U(W,V)=Vnewand each world Wi is evaluated according to the corresponding Vi.
This looks very much like the second system, so we add a further condition. For each Wi we have the production of an associated pi representing a relative probability of those worlds. So we enforce a new consistency as a mathematical property:
Vold=∑Vipi∑pi
This amounts to the description that "No action can affect the expected value of a future world state." which is similar to subjunctive dependence from FDT.
This is an old solution to the sophisticated cake or death problem.
There are a few possible ways to implement this consistency: we can have the algorithm modify its own Vold as it considers possible futures. We can have the consistency enforced on the U operator so that it updates the value function only in consistent ways. We also have the rather exotic way of generating the probabilities by comparing Vold to the various Vi.
The first one looks like basic reasoning, and is the suggested answer to the sophisticated cake or death problem given above. But it incentivises the AI to only think in certain ways, if the AI is able to model itself.
The second one seems to run into the problem that if D becomes too accurate, our U is unable to update the value function at all.
The third one is weird and requires more thought on my part. The main issue is that it doesn't guard against attempts by the AI to edit its future value function, only makes the AI believe they're less likely to work.
An algorithm which accurately represents the fixed-computation-ness of human morality is still out of reach.