New(ish) AI control ideas

Stuart_Armstrong

EDIT: this post is no longer being maintained, it has been replaced by this new one.

I recently went on a two day intense solitary "AI control retreat", with the aim of generating new ideas for making safe AI. The "retreat" format wasn't really a success ("focused uninterrupted thought" was the main gain, not "two days of solitude" - it would have been more effective in three hour sessions), but I did manage to generate a lot of new ideas. These ideas will now go before the baying bloodthirsty audience (that's you, folks) to test them for viability.

A central thread running through could be: if you want something, you have to define it, then code it, rather than assuming you can get if for free through some other approach.

To provide inspiration and direction to my thought process, I first listed all the easy responses that we generally give to most proposals for AI control. If someone comes up with a new/old brilliant idea for AI control, it can normally be dismissed by appealing to one of these responses:

The AI is much smarter than us.
It’s not well defined.
The setup can be hacked.
- By the agent.
- By outsiders, including other AI.
- Adding restrictions encourages the AI to hack them, not obey them.
The agent will resist changes.
Humans can be manipulated, hacked, or seduced.
The design is not stable.
- Under self-modification.
- Under subagent creation.
Unrestricted search is dangerous.
The agent has, or will develop, dangerous goals.

Important background ideas:

I decided to try and attack as many of these ideas as I could, head on, and see if there was any way of turning these objections. A key concept is that we should never just expect a system to behave "nicely" by default (see eg here). If we wanted that, we should define what "nicely" is, and put that in by hand.

I came up with sixteen main ideas, of varying usefulness and quality, which I will be posting in the coming weekdays in comments (the following links will go live after each post). The ones I feel most important (or most developed) are:

Anti-pascaline agent
Anti-restriction-hacking (EDIT: I have big doubts about this approach, currently)
Creating a satisficer (EDIT: I have big doubts about this approach, currently)
Crude measures
False miracles
Intelligence modules
Models as definitions
Added: Utility vs Probability: idea synthesis

While the less important or developed ideas are:

Added: A counterfactual and hypothetical note on AI design
Added: Acausal trade barriers
Anti-seduction
Closest stable alternative
Consistent Plato
Defining a proper satisficer
Detecting subagents
Added: Humans get different counterfactual
Added: Indifferent vs false-friendly AIs
Resource gathering and pre-corrigied agent
Time-symmetric discount rate
Values at compile time
What I mean

Please let me know your impressions on any of these! The ideas are roughly related to each other as follows (where the arrow Y→X can mean "X depends on Y", "Y is useful for X", "X complements Y on this problem" or even "Y inspires X"):

EDIT: I've decided to use this post as a sort of central repository of my new ideas on AI control. So adding the following links:

Short tricks:

High-impact from low impact:

High impact from low impact, best advice:

Overall meta-thoughts:

Pareto-improvements to corrigible agents:

Predicted corrigibility: pareto improvements

AIs in virtual worlds:

Low importance AIs:

Counterfactual agents detecting agent's influence

Wireheading:

Superintelligence and wireheading

AI honesty and testing:

EDIT: this post is no longer being maintained, it has been replaced by this new one.

A central thread running through could be: if you want something, you have to define it, then code it, rather than assuming you can get if for free through some other approach.

The AI is much smarter than us.
It’s not well defined.
The setup can be hacked.
- By the agent.
- By outsiders, including other AI.
- Adding restrictions encourages the AI to hack them, not obey them.
The agent will resist changes.
Humans can be manipulated, hacked, or seduced.
The design is not stable.
- Under self-modification.
- Under subagent creation.
Unrestricted search is dangerous.
The agent has, or will develop, dangerous goals.

Important background ideas:

Anti-pascaline agent
Anti-restriction-hacking (EDIT: I have big doubts about this approach, currently)
Creating a satisficer (EDIT: I have big doubts about this approach, currently)
Crude measures
False miracles
Intelligence modules
Models as definitions
Added: Utility vs Probability: idea synthesis

While the less important or developed ideas are:

Added: A counterfactual and hypothetical note on AI design
Added: Acausal trade barriers
Anti-seduction
Closest stable alternative
Consistent Plato
Defining a proper satisficer
Detecting subagents
Added: Humans get different counterfactual
Added: Indifferent vs false-friendly AIs
Resource gathering and pre-corrigied agent
Time-symmetric discount rate
Values at compile time
What I mean

EDIT: I've decided to use this post as a sort of central repository of my new ideas on AI control. So adding the following links:

Short tricks:

High-impact from low impact:

High impact from low impact, best advice:

Overall meta-thoughts:

Pareto-improvements to corrigible agents:

Predicted corrigibility: pareto improvements

AIs in virtual worlds:

Low importance AIs:

Counterfactual agents detecting agent's influence

Wireheading:

Superintelligence and wireheading

AI honesty and testing:

I'm not convinced that sufficiently intelligent agents would create subagents with utility functions that lack terms of the original's UF, at least with a suitable precaution. The example you used (an AI wanting to stay in the box letting out an agent to convert all box-hazards into raw material) seems as though the Boxed AI would want to ensure that the Unboxed Agent was Boxed-AI-Friendly. What would then happen if the Boxed AI had an unalterable belief that its utility function were likely to change in the future, and it couldn't predict how?

Some formalized difference between intentional probability manipulation and unintentional/unpredicted but causally-related happenings would be nice. Minimized intentional impact would then be where an AI would not wish to effect actions on issues of great impact and defer to humans. I'm not sure how it would behave when a human then deferred to the AI. It seems like it would be a sub-CEV result, because the human would be biased, have scope insensitivity, prejudices, etc...And then it seems like the natural improvement would be to have the AI implement DWIM CEV.

Has much thought gone into defining utility functions piecewise, or multiplicatively wrt some epistemic probabilities? I'm not sure if I'm just reiterating corrigibility here, but say an agent has a utility function of some utility function U that equals U/P("H") + H*P("H"), where P("H") is the likelihood that the Gatekeeper thinks the AI should be halted and H is the utility function rewarding halting and penalizing continuation. That was an attempt at a probabilistic piecewise UF that equals "if P("H") then H else U."

Apologies for any incoherency, this is a straight-up brain dump.

34

New(ish) AI control ideas

34

34

34

New(ish) AI control ideas

34

34