A brief summary of the context, for any readers who are not subscribed to ACX or familiar with the shutdown problem:
The Center for Human-Compatible Artificial Intelligence (CHAI) is a research group at UC Berkeley. Their researchers have published on the shutdown problem, showing that “propose an action to humans and wait for approval, allowing shutdown” strictly dominates “take that action unilaterally” as well as “shut self down unilaterally” for agents satisfying certain assumptions.
MIRI discusses a counterexample, using a toy example where the AI has a finite number of policy options available, and expresses that “learn which of those finite set of options is best according to humans, then execute without allowing humans to shut it down” can dominate the course of “propose action to humans and wait for approval.”
I claim that the fact that the AI is “larger” than its value-space seems to me to be a critical ingredient in the AI being able to conclude that it has reached its terminal point in value-space. I posit that given a value-space that is “larger” than the AI, the AI will accept shutdown. Here I present an argument that, for at least one AI architecture and structure of value-space, the “propose action and allow shutdown” option should dominate much of the time.
Assume that a current AI model A contains a deep neural net connected to some decision procedure, of a specified, finite size (such as 16 layers with 1024 nodes each).
Then assume we can specify that human values are best specified by some ideal AI Z with the same structure[1], but with unknown size (e.g. Z includes a neural net with currently unknown number of layers and nodes in each layer.)
Further assume that we can specify that A’s action-space is to set the weights of its existing nodes, then propose actions to humans who can allow the actions or veto them by shutting down A.
This search space over all possible neural net sizes is infinite-dimensional. In particular, the “correct” value can always be an architecture which has additional layers or additional nodes in each layer, beyond what A is capable of expressing.
This space admits a well-formed prior probability distribution, and also presents a well-formed alternative interpretation within A’s conception (assuming this is compatible with the limits of the AI’s action-space, though I think this is a reasonable assumption given sufficiently large architecture.)
When a human decides to shut down A, A has the following options, paraphrasing MIRI's list π1...π5 and collapsing π2,π3,π4:
Optimize existing best-guess A0 at time T0 (when the AI makes this decision)
Optimize A’s updated best-guess A1 at T1 (The time when updating on the evidence of human shutdown is completed)
Continuously update to A’s best-guess An at any number of future times Tn
Allow shutdown and human action
Since humans have attempted shutdown, (1) is strictly dominated by (2) after T1, thus (1) should only be preferred if the expected value of A0 between T0 and T1 is extremely high[2].
However, A now has a more detailed hypothesis for what might happen in (3): humans may run a new AI B with a different (larger) network shape at a later time T2.
Then (3) has expected value at least equal to P(humans run B) * EV(B optimizes B0 at T2), where B0 may be some policy that A is not capable of running (for example because it is too small).
That is, assuming P(humans run B) is sufficiently high and the expected value of A1 between T2 and T1 is not too high (and note that if updating takes a long time, T1 could even be later than T2), the AI can take shutdown as evidence that B0 is sufficiently greater than A1, and can update toward (3) > (2) and even (3) > (2.1); potentially even extrapolating to further models which may reach the target Z.[3]
While the space of neural net architectures is easier to do analysis on, the space of all AI architectures can be counted as Turing machines under a Solomonoff prior, so this should not rely on A being myopic about the specifics of architecture.
It is unclear to me whether this T0/T1 gap is a feature or a bug. For example, we might attempt to shut down A if it wants to fire a gun out the window. However if it sees someone outside about to throw a grenade into the research lab or release a bioengineered pandemic, we might want it to act under time pressure and explain itself or allow shutdown afterward. Also, once the urgent moment has passed it should return to deference (e.g. it knows something important now, but knows it doesn’t have the full picture to run the universe yet). In practice I would expect the gaps between T0,T1,T2 to be relatively small (hours/days/months/years to reach final models) whereas time after T2 should be very large (centuries, millennia, millions of years) and have greater scope for action due to buildup time, such that concerns optimizing after T2 easily dominate almost all concerns before T2. However this also seems to have gaps, such as if the AI concludes that all humans are suffering horribly and definitely need to be given heroin before a new model is built (but doesn’t understand that this will pollute its value estimate for the rest of the future).
This post is a response to the recent Astral Codex Ten post, “CHAI, Assistance Games, And Fully-Updated Deference”.
A brief summary of the context, for any readers who are not subscribed to ACX or familiar with the shutdown problem:
The Center for Human-Compatible Artificial Intelligence (CHAI) is a research group at UC Berkeley. Their researchers have published on the shutdown problem, showing that “propose an action to humans and wait for approval, allowing shutdown” strictly dominates “take that action unilaterally” as well as “shut self down unilaterally” for agents satisfying certain assumptions.
MIRI discusses a counterexample, using a toy example where the AI has a finite number of policy options available, and expresses that “learn which of those finite set of options is best according to humans, then execute without allowing humans to shut it down” can dominate the course of “propose action to humans and wait for approval.”
I claim that the fact that the AI is “larger” than its value-space seems to me to be a critical ingredient in the AI being able to conclude that it has reached its terminal point in value-space. I posit that given a value-space that is “larger” than the AI, the AI will accept shutdown. Here I present an argument that, for at least one AI architecture and structure of value-space, the “propose action and allow shutdown” option should dominate much of the time.
Assume that a current AI model A contains a deep neural net connected to some decision procedure, of a specified, finite size (such as 16 layers with 1024 nodes each).
Then assume we can specify that human values are best specified by some ideal AI Z with the same structure[1], but with unknown size (e.g. Z includes a neural net with currently unknown number of layers and nodes in each layer.)
Further assume that we can specify that A’s action-space is to set the weights of its existing nodes, then propose actions to humans who can allow the actions or veto them by shutting down A.
This search space over all possible neural net sizes is infinite-dimensional. In particular, the “correct” value can always be an architecture which has additional layers or additional nodes in each layer, beyond what A is capable of expressing.
This space admits a well-formed prior probability distribution, and also presents a well-formed alternative interpretation within A’s conception (assuming this is compatible with the limits of the AI’s action-space, though I think this is a reasonable assumption given sufficiently large architecture.)
When a human decides to shut down A, A has the following options, paraphrasing MIRI's list π1...π5 and collapsing π2,π3,π4:
Since humans have attempted shutdown, (1) is strictly dominated by (2) after T1, thus (1) should only be preferred if the expected value of A0 between T0 and T1 is extremely high[2].
However, A now has a more detailed hypothesis for what might happen in (3): humans may run a new AI B with a different (larger) network shape at a later time T2.
Then (3) has expected value at least equal to P(humans run B) * EV(B optimizes B0 at T2), where B0 may be some policy that A is not capable of running (for example because it is too small).
That is, assuming P(humans run B) is sufficiently high and the expected value of A1 between T2 and T1 is not too high (and note that if updating takes a long time, T1 could even be later than T2), the AI can take shutdown as evidence that B0 is sufficiently greater than A1, and can update toward (3) > (2) and even (3) > (2.1); potentially even extrapolating to further models which may reach the target Z.[3]
While the space of neural net architectures is easier to do analysis on, the space of all AI architectures can be counted as Turing machines under a Solomonoff prior, so this should not rely on A being myopic about the specifics of architecture.
It is unclear to me whether this T0/T1 gap is a feature or a bug. For example, we might attempt to shut down A if it wants to fire a gun out the window. However if it sees someone outside about to throw a grenade into the research lab or release a bioengineered pandemic, we might want it to act under time pressure and explain itself or allow shutdown afterward. Also, once the urgent moment has passed it should return to deference (e.g. it knows something important now, but knows it doesn’t have the full picture to run the universe yet). In practice I would expect the gaps between T0,T1,T2 to be relatively small (hours/days/months/years to reach final models) whereas time after T2 should be very large (centuries, millennia, millions of years) and have greater scope for action due to buildup time, such that concerns optimizing after T2 easily dominate almost all concerns before T2. However this also seems to have gaps, such as if the AI concludes that all humans are suffering horribly and definitely need to be given heroin before a new model is built (but doesn’t understand that this will pollute its value estimate for the rest of the future).
Reaching some final Z may not be possible, if for example the "true" Z has googolplex layers and cannot be computed in our universe.