Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

Toy model for wire-heading

2 Stuart_Armstrong 09 October 2015 03:45PM

A putative new idea for AI control; index here.

This is a (very) simple toy model of the wire-heading problem to illustrate how it might or might not happen. The great question is "where do we add the (super)intelligence?"

Let's assume a simple model for an expected utility maximising agent. There's the input assessor module A, which takes various inputs and computes the agent's "reward" or "utility". For a reward-based agent, A is typically outside of the agent; for a utility-maximiser, it's typically inside the agent, though the distinction need not be sharp. And there's the the decision module D, which assess the possible actions to take to maximise the output of A. If E is the general environment, we have D+A+E.

Now let's make the agent superintelligent. If we add superintelligence to module D, then D will wirehead by taking control of A (whether A is inside the agent or not) and controlling E to prevent interference. If we add superintelligence to module A, then it will attempt to compute rewards as effectively as possible, sacrificing D and E to achieve it's efficient calculations.

Therefore to prevent wireheading, we need to "add superintelligence" to (D+A), making sure that we aren't doing so to some sub-section of the algorithm - which might be hard if the "superintelligence" is obscure or black-box.

Comment author: capybaralet 22 September 2015 03:05:17PM 1 point [-]

You should say "replace THEM", in that case, to refer to the infinite set of axioms, as opposed to Peano Arithmetic.

Comment author: Stuart_Armstrong 23 September 2015 01:37:37PM 0 points [-]


Comment author: VoiceOfRa 21 September 2015 11:15:32PM 0 points [-]

Well, banning the creation of other superintelligence seems easier than banning the creation of any subagents.

Comment author: Stuart_Armstrong 22 September 2015 09:48:47AM 0 points [-]

How? (and are you talking in terms of motivational restrictions, which I don't see at all, or physical restrictions, which seem more probable)

Comment author: Eliezer_Yudkowsky 18 September 2015 07:51:54PM 4 points [-]

When I consider this as a potential way to pose an open problem, the main thing that jumps out at me as being missing is something that doesn't allow A to model all of B's possible actions concretely. The problem is trivial if A can fully model B, precompute B's actions, and precompute the consequences of those actions.

The levels of 'reason for concern about AI safety' might ascend something like this:

  • 0 - system with a finite state space you can fully model, like Tic-Tac-Toe
  • 1 - you can't model the system in advance and therefore it may exhibit unanticipated behaviors on the level of computer bugs
  • 2 - the system is cognitive, and can exhibit unanticipated consequentialist or goal-directed behaviors, on the level of a genetic algorithm finding an unanticipated way to turn the CPU into a radio or Eurisko hacking its own reward mechanism
  • 3 - the system is cognitive and humanish-level general; an uncaught cognitive pressure towards an outcome we wouldn't like, results in facing something like a smart cryptographic adversary that is going to deeply ponder any way to work around anything it sees as an obstacle
  • 4 - the system is cognitive and superintelligent; its estimates are always at least as good as our estimates; the expected agent-utility of the best strategy we can imagine when we imagine ourselves in the agent's shoes, is an unknowably severe underestimate of the expected agent-utility of the best strategy the agent can find using its own cognition

We want to introduce something into the toy model to at least force solutions past level 0. This is doubly true because levels 0 and 1 are in some sense 'straightforward' and therefore tempting for academics to write papers about (because they know that they can write the paper); so if you don't force their thinking past those levels, I'd expect that to be all that they wrote about. You don't get into the hard problems with astronomical stakes until levels 3 and 4. (Level 2 is the most we can possibly model using running code with today's technology.)

Comment author: Stuart_Armstrong 21 September 2015 11:14:49AM 0 points [-]

Added a cheap way to get us somewhat in the region of 2, just by assuming that B/C can model A, which precludes A being able to model B/C in general.

Comment author: Houshalter 21 September 2015 09:45:48AM 0 points [-]

Well it's not impossible to restrict the AIs from accessing their own source code. Especially if they are implemented in specialized hardware like we are.

Comment author: Stuart_Armstrong 21 September 2015 10:42:17AM 1 point [-]

It's not impossible, no. But it's another failure point. And the AI might deduce stuff about itself by watching how it's run. And a world that has built an AI is a world where there will be lots of tools for building AIs around...

Comment author: Houshalter 19 September 2015 12:48:02PM 3 points [-]

This doesn't really solve your problem, but it's interesting that humans are also trying to create subagents. The whole AI problem is us trying to create subagents. It turns out that that is very very hard. And if we want to solve FAI, making subagents that actually follow our utility function, that's even harder.

So humans are an existence proof for minds which are very powerful, but unable to make subagents. Controlling true superintelligences is a totally different issue of course. But maybe in some cases we can restrict them from being superintelligent?

Comment author: Stuart_Armstrong 21 September 2015 09:24:25AM 1 point [-]

We'd be considerably better at subagent creation if we could copy our brains and modify them at will...

Comment author: Kaj_Sotala 18 September 2015 06:49:40PM 2 points [-]
Comment author: Stuart_Armstrong 21 September 2015 09:23:40AM 1 point [-]

Ah yes, that one. I wish I could claim credit for it being deliberate, but no!

Comment author: VoiceOfRa 18 September 2015 11:53:38PM 1 point [-]

Well one thing to keep in mind is that non-superintelligent subagents are a lot less dangerous without their controller.

Comment author: Stuart_Armstrong 21 September 2015 09:22:06AM 1 point [-]

Why would they be non-superintelligent? And why would they need a controller? If the AI is under some sort of restriction, the most effective idea for it would be to create a superintelligent being with the same motives as itself, but without restrictions.

Comment author: PhilGoetz 19 September 2015 01:07:51AM 5 points [-]

No; you are asking her two different questions, so it is correct for frequentism to give different answers to the different questions.

Comment author: Stuart_Armstrong 21 September 2015 09:17:31AM *  1 point [-]

Of course. But the two questions are the same outside of anthropic situations; they are two extensions of the underdefined "how often was I right?" Or, if you prefer, the frequentist answer in anthropic situations is dependent on the exact question asked, showing that "anthropic probability" is not a well defined concept.

Comment author: taygetea 19 September 2015 12:45:24AM 7 points [-]

Unrelated to this particular post, I've seen a couple people mention that all your ideas as of late are somewhat scattered and unorganized, and in need of some unification. You've put out a lot of content here, but I think people would definitely appreciate some synthesis work, as well as directly addressing established ideas about these subproblems as a way of grounding your ideas a bit more. "Sixteen main ideas" is probably in need of synthesis or merger.

Comment author: Stuart_Armstrong 19 September 2015 05:47:49AM 4 points [-]

I agree. I think I've got to a good point to start synthesising now.

View more: Next