I see little hope of a good agreement to pause AI development unless
leading AI researchers agree that a pause is needed, and help write
the rules. Even with that kind of expert help, there's a large risk
that the rules will be ineffective and cause arbitrary collateral
damage.
Yoshua Bengio has a reputation that makes him one of the best people to
turn to for such guidance. He has now
suggested
restrictions on AI development that are targeted specifically at agenty
AI.
If turned into a clear guideline, that would be a much more desirable
method of slowing the development of dangerous AI. Alas, Bengio seems to
admit that he isn't yet able to provide that clarity.
Clarifying Risky Agentiness
What do we want to limit via a restriction on agentiness? I'll start by
imagining what an omniscient standards authority would want, and later
examine how feasible the restrictions are.
Drexler's CAIS paper outlines an
approach that would produce superintelligence without ceding much agency
to AIs. Careful adherence to his guidelines would produce systems that
are powerful and fairly safe. Yet he sounds pessimistic about defining a
clear line that would distinguish excessively agenty systems from safe
ones.
The key factor here is the distinction between narrow and broad scope of
goals. Appropriately narrow goals cause systems to focus on limited time
periods and limited aspects of reality. E.g. a translator can care only
about outputting a translation of its current text input, and not care
about improving its ability to do future translations.
Simple Deep Learning systems have a clear distinction that can be made
between training and inference, with inference having a clearly narrow
short-term goal of simply applying existing abilities to achieve an
immediate output.
Alas, anything that provides memory between successive inferences blurs
that distinction, making it hard to analyze the extent to which longer
term goals are creeping into the system. ChatGPT's value depends on
having it know what tokens it has previously generated. That amounts to
giving it memory that could enable longer-term goals. So I see no easy
way to preserve an easily articulated distinction between short-term and
long-term goals.
TurnTrout's impact
regularization
ideas provide another path to limiting the scope of AI goals: preserve
attainable utility, and minimize impact. His Conservative Agency via
Attainable Utility Preservation
describes an AUP penalty which, if strong enough relative to an AI's
primary goals, will minimize the extent to which the AI instrumentally
converges on
broader-than-intended power-seeking goals.
He suggests penalizing impact as much as possible, adjusting such
penalties to be as high as is consistent with the relevant demands for
increasing capability.
I expect an omniscient authority could use this approach to ensure that
AIs retain a fairly safe tool-like focus on a pretty narrow
understanding of the goals that they're given.
This fits poorly within the framework of a stereotypical regulatory
authority. Any realistic attempt at measuring attainable utility, or
desires for increased AI capability, would become dominated by arbitrary
guessing. I also expect problems with detecting whether AI developers
are implementing it as intended.
In sum, this looks like a great approach if all AI developers earnestly
aim to implement it responsibly. I'm confused as to whether it has much
value in a less responsible world.
Restricting Compute
A majority of serious suggestions for slowing AI development involve
limiting how much
compute can be
used in training any AI. It's attractive because we can imagine simple
rules that might only need to constrain a few companies with the biggest
AI budgets.
We ought to be cautious about relying on what's easy to measure, rather
than what best describes risks. I previously
wrote:
It's far from obvious whether such a limit would slow capability
growth much.
One plausible scenario is that it would mainly cause systems to be
developed in a more modular way. That might make us a bit safer by
pushing development more toward what Drexler recommends. Or it might
fool most people into thinking there's a pause, while capabilities
grow at 95% of the pace they would otherwise have grown at.
I'm growing more confident that limits on training compute would cause
some sort of slowdown in the rate at which AIs become more powerful.
I now support modest limits on how fast training compute can be
increased, provided that such proposals are accompanied by caveats to
the effect that this can't function as much more than a band-aid.
Enforcing Standards
I'll only say a little here about whether a pause/slowdown would be
obeyed.
The feasibility of widespread obedience likely depends on an AI-induced
accident that's as scary as the worst features of Hiroshima and COVID
combined. With a lesser scare, or no scare, any pause will likely be too
weakly enforced to matter much.
I estimate the probability of a well-enforced worldwide pause at around
5 to 10%. That sounds discouraging. But it shouldn't surprise us to
notice that most actions have very little chance of saving or dooming
the world. In most plausible futures, this blog post won't matter. I'm
focusing my attention on futures where a pause will have important
effects.
I don't expect to find any single approach to AI that "solves"
alignment. Rather, I expect there are many small things we can do to
slightly improve our odds. It's plausible enough that we're close to
solving alignment that it seems useful to focus on small improvements in
our odds, when we can't find big improvements.
Ideas about Evaluation
There's no simple way to write standards so that developers will be
completely clear on whether and how they need to comply.
There's likely some adequately shared intuitions about which current
systems are sufficiently AGI-like to need their risks evaluated. But as
soon as billions of dollars start riding on these decisions, there will
be unpleasant disputes as to which systems need to be checked for
compliance.
One idea that comes to mind is to have a standardized procedure for
asking GPT-5 (or something of that power) to evaluate any new system.
The basic idea is that the developer needs to show all the relevant code
to a specially configured GPT, and then ask an exact set of questions
that are designed to, say, evaluate how much the candidate system cares
about events weeks or years in the future.
It should also include some measure of how broad the AI's scope is. I
don't want an AI that's specialized for predicting bond prices 10
years into the future to be considered riskier than an AI that cares
only about maximizing a company's current quarter profits. I'm very
unclear on how to ask GPT about this scope.
Advantages:
It should yield fast results for safe systems, so it would be less
of a burden on those that are clearly safe than is the case with
most complex standards. This is unlike most standards that require
an authority to confirm compliance, where human-related delays can
cost developers via delaying their decision.
It requires that some authority explicitly endorse the power of an
AI. If regulators see their jobs being replaced by software, that
will increase the political concerns about AI. This will help
politicians decide that it's safe to say that there are bigger
concerns than, say, hate speech.
It limits the risk that standards will be used to entrench
incumbents.
Disadvantages:
It will often give inconclusive results? I assume they'll be
trained to be more conservative than ChatGPT in saying "I don't
know".
It's hard to predict whether GPT-5 (or whatever) is competent
enough to detect risks. It wouldn't be too surprising if it could
be readily fooled into approving dangerous software.
I'll estimate a 25% chance that AIs become competent enough to support
a valuable version of this before it's too late to benefit from a
slowdown in AI progress.
Agency is a property of pulling the future back in time; it's when a
system selects actions by conditioning on the future. Agency is when
any object ... takes the shape of the future before the future does
and thereby steers the future.
But DeepMind's approach isn't of much direct use for evaluating
compliance with a standard. They seem to need costly experiments on
fully trained systems, whereas I see a need for fairly cheap decisions
to be available at the start of training. Not to mention that banning
all agents would be drastic overkill - allowing myopic agents seems
pretty desirable. I still want to commend DeepMind for clarifying our
thoughts about what an agent is.
Concluding Thoughts
Now is not quite the right time to expect competent restrictions on AI
capabilities.
The situation is unstable. It seems moderately urgent to think more
clearly about what kinds of restrictions would be desirable and
effective.
I'm not smart enough to provide a clear proposal for how to buy more
time. I hope this post nudges people to move toward slightly better
guesses.
I won't be surprised if some sort of global restrictions are enacted in
a few years. I have very little idea whether they'll be wise.
I previously said:
Yoshua Bengio has a reputation that makes him one of the best people to turn to for such guidance. He has now suggested restrictions on AI development that are targeted specifically at agenty AI.
If turned into a clear guideline, that would be a much more desirable method of slowing the development of dangerous AI. Alas, Bengio seems to admit that he isn't yet able to provide that clarity.
Clarifying Risky Agentiness
What do we want to limit via a restriction on agentiness? I'll start by imagining what an omniscient standards authority would want, and later examine how feasible the restrictions are.
Drexler's CAIS paper outlines an approach that would produce superintelligence without ceding much agency to AIs. Careful adherence to his guidelines would produce systems that are powerful and fairly safe. Yet he sounds pessimistic about defining a clear line that would distinguish excessively agenty systems from safe ones.
The key factor here is the distinction between narrow and broad scope of goals. Appropriately narrow goals cause systems to focus on limited time periods and limited aspects of reality. E.g. a translator can care only about outputting a translation of its current text input, and not care about improving its ability to do future translations.
Simple Deep Learning systems have a clear distinction that can be made between training and inference, with inference having a clearly narrow short-term goal of simply applying existing abilities to achieve an immediate output.
Alas, anything that provides memory between successive inferences blurs that distinction, making it hard to analyze the extent to which longer term goals are creeping into the system. ChatGPT's value depends on having it know what tokens it has previously generated. That amounts to giving it memory that could enable longer-term goals. So I see no easy way to preserve an easily articulated distinction between short-term and long-term goals.
TurnTrout's impact regularization ideas provide another path to limiting the scope of AI goals: preserve attainable utility, and minimize impact. His Conservative Agency via Attainable Utility Preservation describes an AUP penalty which, if strong enough relative to an AI's primary goals, will minimize the extent to which the AI instrumentally converges on broader-than-intended power-seeking goals.
He suggests penalizing impact as much as possible, adjusting such penalties to be as high as is consistent with the relevant demands for increasing capability.
I expect an omniscient authority could use this approach to ensure that AIs retain a fairly safe tool-like focus on a pretty narrow understanding of the goals that they're given.
This fits poorly within the framework of a stereotypical regulatory authority. Any realistic attempt at measuring attainable utility, or desires for increased AI capability, would become dominated by arbitrary guessing. I also expect problems with detecting whether AI developers are implementing it as intended.
In sum, this looks like a great approach if all AI developers earnestly aim to implement it responsibly. I'm confused as to whether it has much value in a less responsible world.
Restricting Compute
A majority of serious suggestions for slowing AI development involve limiting how much compute can be used in training any AI. It's attractive because we can imagine simple rules that might only need to constrain a few companies with the biggest AI budgets.
We ought to be cautious about relying on what's easy to measure, rather than what best describes risks. I previously wrote:
I'm growing more confident that limits on training compute would cause some sort of slowdown in the rate at which AIs become more powerful.
I now support modest limits on how fast training compute can be increased, provided that such proposals are accompanied by caveats to the effect that this can't function as much more than a band-aid.
Enforcing Standards
I'll only say a little here about whether a pause/slowdown would be obeyed.
The feasibility of widespread obedience likely depends on an AI-induced accident that's as scary as the worst features of Hiroshima and COVID combined. With a lesser scare, or no scare, any pause will likely be too weakly enforced to matter much.
I estimate the probability of a well-enforced worldwide pause at around 5 to 10%. That sounds discouraging. But it shouldn't surprise us to notice that most actions have very little chance of saving or dooming the world. In most plausible futures, this blog post won't matter. I'm focusing my attention on futures where a pause will have important effects.
I don't expect to find any single approach to AI that "solves" alignment. Rather, I expect there are many small things we can do to slightly improve our odds. It's plausible enough that we're close to solving alignment that it seems useful to focus on small improvements in our odds, when we can't find big improvements.
Ideas about Evaluation
There's no simple way to write standards so that developers will be completely clear on whether and how they need to comply.
There's likely some adequately shared intuitions about which current systems are sufficiently AGI-like to need their risks evaluated. But as soon as billions of dollars start riding on these decisions, there will be unpleasant disputes as to which systems need to be checked for compliance.
One idea that comes to mind is to have a standardized procedure for asking GPT-5 (or something of that power) to evaluate any new system. The basic idea is that the developer needs to show all the relevant code to a specially configured GPT, and then ask an exact set of questions that are designed to, say, evaluate how much the candidate system cares about events weeks or years in the future.
It should also include some measure of how broad the AI's scope is. I don't want an AI that's specialized for predicting bond prices 10 years into the future to be considered riskier than an AI that cares only about maximizing a company's current quarter profits. I'm very unclear on how to ask GPT about this scope.
Advantages:
Disadvantages:
I'll estimate a 25% chance that AIs become competent enough to support a valuable version of this before it's too late to benefit from a slowdown in AI progress.
DeepMind's work on Discovering Agents clarifies how to distinguish an agent from a non-agent. I like this summary from the gears to ascension:
But DeepMind's approach isn't of much direct use for evaluating compliance with a standard. They seem to need costly experiments on fully trained systems, whereas I see a need for fairly cheap decisions to be available at the start of training. Not to mention that banning all agents would be drastic overkill - allowing myopic agents seems pretty desirable. I still want to commend DeepMind for clarifying our thoughts about what an agent is.
Concluding Thoughts
Now is not quite the right time to expect competent restrictions on AI capabilities.
The situation is unstable. It seems moderately urgent to think more clearly about what kinds of restrictions would be desirable and effective.
I'm not smart enough to provide a clear proposal for how to buy more time. I hope this post nudges people to move toward slightly better guesses.
I won't be surprised if some sort of global restrictions are enacted in a few years. I have very little idea whether they'll be wise.