Recent AI safety work

paulfchristiano

32 Recent AI safety work

by paulfchristiano

30th Dec 2014

2 min read

32

(Crossposted from ordinary ideas).

I’ve recently been thinking about AI safety, and some of the writeups might be interesting to some LWers:

Ideas for building useful agents without goals: approval-directed agents, approval-directed bootstrapping, and optimization and goals. I think this line of reasoning is very promising.
A formalization of one piece of the AI safety challenge: the steering problem. I am eager to see more precise, high-level discussion of AI safety, and I think this article is a helpful step in that direction. Since articulating the steering problem I have become much more optimistic about versions of it being solved in the near term. This mostly means that the steering problem fails to capture the hardest parts of AI safety. But it’s still good news, and I think it may eventually cause some people to revise their understanding of AI safety.
Some ideas for getting useful work out of self-interested agents, based on arguments: of arguments and wagers, adversarial collaboration [older], and delegating to a mixed crowd. I think these are interesting ideas in an interesting area, but they have a ways to go until they could be useful.

I’m excited about a few possible next steps:

Under the (highly improbable) assumption that various deep learning architectures could yield human-level performance, could they also predictably yield safe AI? I think we have a good chance of finding a solution---i.e. a design of plausibly safe AI, under roughly the same assumptions needed to get human-level AI---for some possible architectures. This would feel like a big step forward.
For what capabilities can we solve the steering problem? I had originally assumed none, but I am now interested in trying to apply the ideas from the approval-directed agents post. From easiest to hardest, I think there are natural lines of attack using any of: natural language question answering, precise question answering, sequence prediction. It might even be possible using reinforcement learners (though this would involve different techniques).
I am very interested in implementing effective debates, and am keen to test some unusual proposals. The connection to AI safety is more impressionistic, but in my mind these techniques are closely linked with approval-directed behavior.
I’m currently writing up a concrete architecture for approval-directed agents, in order to facilitate clearer discussion about the idea. This kind of work that seems harder to do in advance, but at this point I think it’s mostly an exposition problem.

Personal Blog

32

New Comment

Rendering 0/6 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 4:25 AM

Moderation Log

32 Recent AI safety work

by paulfchristiano

30th Dec 2014

2 min read

32

(Crossposted from ordinary ideas).

I’ve recently been thinking about AI safety, and some of the writeups might be interesting to some LWers:

Ideas for building useful agents without goals: approval-directed agents, approval-directed bootstrapping, and optimization and goals. I think this line of reasoning is very promising.
A formalization of one piece of the AI safety challenge: the steering problem. I am eager to see more precise, high-level discussion of AI safety, and I think this article is a helpful step in that direction. Since articulating the steering problem I have become much more optimistic about versions of it being solved in the near term. This mostly means that the steering problem fails to capture the hardest parts of AI safety. But it’s still good news, and I think it may eventually cause some people to revise their understanding of AI safety.
Some ideas for getting useful work out of self-interested agents, based on arguments: of arguments and wagers, adversarial collaboration [older], and delegating to a mixed crowd. I think these are interesting ideas in an interesting area, but they have a ways to go until they could be useful.

I’m excited about a few possible next steps:

Under the (highly improbable) assumption that various deep learning architectures could yield human-level performance, could they also predictably yield safe AI? I think we have a good chance of finding a solution---i.e. a design of plausibly safe AI, under roughly the same assumptions needed to get human-level AI---for some possible architectures. This would feel like a big step forward.
For what capabilities can we solve the steering problem? I had originally assumed none, but I am now interested in trying to apply the ideas from the approval-directed agents post. From easiest to hardest, I think there are natural lines of attack using any of: natural language question answering, precise question answering, sequence prediction. It might even be possible using reinforcement learners (though this would involve different techniques).
I am very interested in implementing effective debates, and am keen to test some unusual proposals. The connection to AI safety is more impressionistic, but in my mind these techniques are closely linked with approval-directed behavior.
I’m currently writing up a concrete architecture for approval-directed agents, in order to facilitate clearer discussion about the idea. This kind of work that seems harder to do in advance, but at this point I think it’s mostly an exposition problem.

Personal Blog

32

New Comment

Rendering 0/6 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 4:25 AM

Moderation Log

More from paulfchristiano

Curated and popular this week

6Comments

Comment Permalink

paulfchristiano11y70

An approval-directed agent doesn’t simulate a person any more than a goal-directed agent simulates the universe. It tries to predict what actions the person would approve of, just as a goal-directed agent tries to predict what actions lead to good consequences. In the limit, the approval-directed agent is more like an emulation. This is analogous to the way in which a goal-directed agent approaches a simulation of the universe.

So there are two big differences:

You can implement it now; it's just an objective for your system, which it can satisfy to varying degrees of excellence---in the same way that you can build a system to rationally pursue a goal, with varying degrees of excellence.
The overseer can use the agent's help, when deciding what actions it approves of. This results in a form of implicit bootstrapping, since the agent is maximizing the approval of the (overseer+agent) system. In the limit of infinite computing the result would be an emulation with infinite time (or more precisely, the ability to instantiate copies of itself and immediately see their outputs, such that the copies can themselves delegate further). The hope is that a realistic system will converge to this ideal as well as it can, given its limited capabilities---in the same way that a goal-directed system would move towards perfect rational behavior.

SteveG11y00

Technology which can predict whether an action would be approved by a person or by an organization is:

-Practical to create, first applied to test cases, then to limited circumstances, then in more general cases.

-For the test cases and for the limited circumstances, it can be created using some existing machine learning technology without deploying full-scale natural language processing.

-Approval/disapproval is a binary value, and appropriate machine learning approaches would includes logistic regression or forest-and-trees methods. We create a model using... (read more)

0SteveG11y

In addition to determining whether an action would be approved using a priori reasoning, an approval-directed AI could also reference a large database of past actions which have either been approved or disapproved. Alternatively, in advance of ever making any real-world decision, the approval-directed AI could generate example scenarios and propose actions to people deemed effective moral reasoners many thousands of times. Their responses would greatly assist the system in constructing a model of whether an action is approvable, and by whom. A lot of approval data could be created fairly readily. The AI can train on this data.

See in context