I'm soon going to go on a two day "AI control retreat", when I'll be without internet or family or any contact, just a few books and thinking about AI control. In the meantime, here is one idea I found along the way.

We often prefer leaders to follow deontological rules, because these are harder to manipulate by those whose interests don't align with ours (you could say the similar things about frequentist statistics versus Bayesian ones).

What about if we applied the same idea to AI control? Not giving the AI deontological restrictions, but programming with a similart goal: to prevent a misalignment of values to be disastrous. But who could do this? Well, another AI.

My rough idea goes something like this:

AI A is tasked with maximising utility function u - a utility function which, crucially, it doesn't know yet. Its sole task is to create AI B, which will be given a utility function v and act on it.

What will v be? Well, I was thinking of taking u and adding some noise - nasty noise. By nasty noise I mean v=u+w, not v=max(u,w). In the first case, you could maximise v while sacrificing u completely, it w is suitable. In fact, I was thinking of adding an agent C (which need not actually exist). It would be motivated to maximise -u, and it would have the code of B and the set of u+noise, and would choose v to be the worst possible option (form the perspective of a u-maximiser) in this set.

So agent A, which doesn't know u, is motivated to design B so that it follows its motivation to some extent, but not to extreme amounts - not in ways that might sacrifice some of the values of some sub-part of its utility function, because that might be part of the original u.

Do people feel this idea is implementable/improvable?

New Comment
35 comments, sorted by Click to highlight new comments since: Today at 3:47 PM
[-][anonymous]9y40

Stuart, have you looked at AIs that don't have utility functions?

I don't think people want their leaders to follow deontological rules. It's more like "I wish they would follow rule X whenever possible." The last part if pretty important. "When possible" means "when it doesn't lead the negative utility outcomes." Or rather, "if they just followed this rule they'd be more likely to end up in good outcomes vs bad outcomes."

This are all ways of describing a heuristic, not a hard utility rule. No big surprise there, because humans are composed of heuristics, not hard unbreakable rules.

Stuart, have you looked at AIs that don't have utility functions?

They tend not to be stable; and there are a few suggestions floating around. But this design might result in such an AI; it might have a utility function, but wouldn't be a mindless maximiser.

[-][anonymous]9y10

They tend not to be stable.

Yes, well that is a tautology. What do you mean by stable? I assume you mean value-stable, which can be interpreted as maximizes-the-same-function-over-time. Something which does not behave as a utility maximizer therefore is pretty much by definition not "stable". By technical definition, at least.

My point was more that this "instability" is in fact the desirable outcome -- people wouldn't want technical-stability, they'd want perhaps a heuristic machine with sensible defaults and rational update procedures.

There are other ways of interpreting value stability; a satisficer is one example. But those don't tend to be stable: http://lesswrong.com/lw/854/satisficers_want_to_become_maximisers/

people wouldn't want technical-stability, they'd want perhaps a heuristic machine with sensible defaults and rational update procedures.

And would those defaults and update procedures remain stable themselves?

[-][anonymous]9y20

There are other ways of interpreting value stability; a satisficer is one example. But those don't tend to be stable

That statement does not make sense. I hope if you read it with a fresh mind you can see why. "There are other ways of defining stable, but they are not stable." Perhaps you need to taboo the word stable here?

And would those defaults and update procedures remain stable themselves?

No, and that's the whole point! Stability is scary. Stability leads to Clippy. People wouldn't want stable. They'd want sensible. Sensible updates its behavior based on new information.

Perhaps you need to taboo the word stable here?

"There are some agents that are defined to have constant value systems, where, nonetheless, the value system will drift in practice".

Stability leads to Clippy.

There are many bad stable outcomes. And an unstable update system will eventually fall into one of them, because they're attractor states. To avoid this, you need to define "sensible" in such a way as the agent never enters such states. You've effectively promoting a different kind of goal stability - a zone of stability, rather than a single point. It's not intrinsically a bad idea, but it's not clear that its easer than finding a single idea goal system. And it's very underdefined at his point.

[-][anonymous]9y-20

"There are some agents that are defined to have constant value systems, where, nonetheless, the value system will drift in practice".

Ok, we are now quite deep in a threat that started with me pointing out that a constant value system might be a bad thing! People want machines whose actions align with their own morality, and humans don't have constant value systems (maybe this is where we disagree?).

There are many bad stable outcomes. And an unstable update system will eventually fall into one of them, because they're attractor states.

Why don't we seem humans drifting into being sociopaths? E.g. starting as a normal, well adjusted human being and then becoming sociopaths as they get older?

Why don't we seem humans drifting into being sociopaths? E.g. starting as a normal, well adjusted human being and then becoming sociopaths as they get older?

That's an interesting question, partially because we'd want to copy that and implement it in AI. A large part of it seems to be social pressure, and lack of power: people must respond to social pressure, because they don't have the power to ignore it (a superintelligent AI would be very different, as would a superintelligent human). This is also connected with some evolutionary instincts, which cause us to behave in many ways as if we were in a tribal society with high costs to deviant behaviour - even if this is no longer the case.

The other main reason is evolution itself: very good at producing robustness, terrible at efficiency. If/when humans start self modifying freely, I'd start being worried about that tendency for them too...

Stuart, have you looked at AIs that don't have utility functions?

Such AIs would not satisfy the axioms of VNM-rationality, meaning their preferences wouldn't be structured intuitively, meaning... well, I'm not sure what, exactly, but since "intuitively" generally refers to human intuition, I think humanity probably wouldn't like that.

[-][anonymous]9y60

Since human beings are not utility maximizes and intuition is based on comparison to our own reference class experience, I question your assumption that only VNM-rational agents would behave intuitively.

I'm not sure humans aren't utility maximizers. They simply don't maximize utility over worldstates. I do feel, however, that it's plausible humans are utility maximizers over brainstates.

(Also, even if humans aren't utility maximizers, that doesn't mean they will find the behavior other non-utility-maximizing agents intuitive. Humans often find the behavior of other humans extraordinarily unintuitive, for example--and these are identical brain designs we're talking about, here. If we start considering larger regions in mindspace, there's no guarantee that humans would like a non-utility-maximizing AI.)

[-][anonymous]9y20

What's Al control retreat? (french..)

[-][anonymous]9y90

A "retreat" is a period of time spent away from the internet and other distractions focusing on one thing in particular, in this case the topic of "AI control."

[-][anonymous]9y40

Thanks a lot,I was thinking about in the way of "to be retired". Thanks for relpying a karma for you.

Sounds like an AI box experiment, with Stuart playing the AI :-)

Une periode de receillement et de mediation dans un monastere (au figuree ^_^).

[-][anonymous]9y00

Thanks for the translation :) are you french?

J'ais passe le secondaire en France (pres de Geneve) :-)

[-][anonymous]9y00

Un autre français merci milles fois!

Made me think of Rawl's veil of ignorance, somewhat. I wonder- is there a whole family of techniques along the lines of "design intelligence B, given some ambiguity about your own values", with different forms or degrees of uncertainty?

It seems like it should avoid extreme or weirdly specialized results (i.e. paper-clipping), since hedging your bets is an immediate consequence. But it's still highly dependent on the language you're using to model those values in the first place.

I'm a little unclear on the behavioral consequences of 'utility function uncertainty' as opposed to the more usual empirical uncertainty. Technically, it is an empirical question, but what does it mean to act without having perfect confidence in your own utility function?

but what does it mean to act without having perfect confidence in your own utility function?

If you look at utility functions as actual functions (not as affine equivalence classes of functions) then that uncertainty can be handled the usual way.

Suppose you want to either maximise u (the number of paperclips) or -u, you don't know which, but will find out soon. Then, in any case, you want to gain control of the paperclip factories...

Well, let's further say that you assign p(+u)=0.51 and p(-u)=0.49, slightly favoring the production of paperclips over their destruction. And just to keep it a toy problem, you've got a paperclip-making button and a paperclip-destroying button you can push, and no other means of interacting with reality.

A plain old 'confident' paperclip maximizer in this situation will happily just push the former button all day, receiving one Point every time it does so. But an uncertain agent will have the exact same behavior; the only difference is that it only gets .02 Points every time it pushes the button, and thus a lower overall score in the same period of time. But the number of paperclips produced is identical. The agent would not (for example) push the 'destroy' button 49 times and the 'create' button 51 times. In practical effect, this is as inconsequential as telling the confident agent that it gets two Points for every paperclip.

So in this toy problem, at least, uncertainty isn't a moderating force. On the other hand, I would intuitively expect different behavior in a less 'toy' problem- for example, an uncertain maximizer might build every paperclip with a secret self-destruct command so that the number of paperclips could be quickly reduced to zero. So there's a line somewhere where behavior changes. Maybe a good way to phrase my question would be- what are the special circumstances under which an uncertain utility function produces a change in behavior?

If the AI expects to know tomorrow what utility function it has, it will be willing to wait, even if there is a (mild) discount rate, while a pure maximiser would not.

In the more frequently considered case of a non-stable utility function, my understanding is that the agent will not try to identify the terminal attractor and then act according to that- it doesn't care about what 'it' will value in the future, except instrumentally. Rather, it will attempt to maximize its current utility function, given a future agent/self acting according to a different function. Metaphorically, it gets one move in a chess game against its future selves.

I don't see any reason for a temporarily uncertain agent to act any differently. If there is no function that is, right now, motivating it to maximize paperclips, why should it care that it will be so motivated in the future? That would seem to require a kind of recursive utility function, one in which it gains utility from maximizing its utility function in the abstract.

In this case, the AI has a stable utility function - it just doesn't know yet what it is.

For instance, it could be "in worlds where a certain coin was heads, maximise paperclips; in other worlds, minimise them", and it has no info yet on the coin flip. That's a perfectly consistent and stable utility function.

it is if you can get evidence about your UF.

This sounded to me as being ruled by two Roman consuls, each of which can override the other's decisions. A part of me likes the idea.

Can that part of you override the other part's decisions?

It's more like: one Roman consul writes the constitution that the other must follow.

his sounded to me as being ruled by two Roman consuls, each of which can override the other's decisions.

Hey, looks like the doctrine of the separation of powers to me. Not a new idea and one that actually has been tried in real life :-)

[-][anonymous]9y00

And yet this was the system under which Rome conquered the Mediterranean world.

[This comment is no longer endorsed by its author]Reply
[-][anonymous]9y00

I think the idea of having additional agents B (and C) to act as a form of control is definitely worth pursuing, though I am not clear how it would be implemented.

Is 'w' just random noise added to the max value of u?

If so, would this just act as a limiter and eventually it would find a result close to the original max utility anyway once the random noise falls close to zero?

Specifying v is part of the challenge. But by "noise" I means a whole other utility function added permanently on to u. It would not "fall", it would be a permanent feature of v.

In my opinion, the best of proposed solutions for AI safety problem is to make the AI number 1, to tell him that we are going to create another AI (number 2) and ask AI number 1 to tell us how to ensure friendliness and safety of AI number 2, and how to ensure that unsafe AI is not created. This solution has its chances to fail, but still in my opinion it's much better than any other proposed solution. What do you think?

If AI 1 cannot be trusted, any AI it tells us how to build cannot be trusted.