A putative new idea for AI control; index here.

This is a simple extension of the model-as-definition and the intelligence module ideas. General structure of these extensions: even an unfriendly AI, in the course of being unfriendly, will need to calculate certain estimates that would be of great positive value if we could but see them, shorn from the rest of the AI's infrastructure.

It's almost trivially simple. Have the AI construct a module that models humans and models human understanding (including natural language understanding). This is the kind of thing that any AI would want to do, whatever its goals were.

Then take that module (using corrigibility) into another AI, and use it as part of the definition of the new AI's motivation. The new AI will then use this module to follow instruction humans give it in natural language.

 

Too easy?...

This approach essentially solves the whole friendly AI problem, loading it onto the AI in a way that avoids the whole "defining goals (or meta-goals, or meta-meta-goals) in machine code" or the "grounding everything in code" problems. As such it is extremely seductive, and will sound better, and easier, than it likely is.

I expect this approach to fail. For it to have any chance of success, we need to be sure that both model-as-definition and the intelligence module idea are rigorously defined. Then we have to have a good understanding of the various ways how the approach might fail, before we can even begin to talk about how it might succeed.

The first issue that springs to mind is when multiple definitions fit the AI's model of human intentions and understanding. We might want the AI to try and accomplish all the things it is asked to do, according to all the definitions. Therefore, similarly to this post, we want to phrase the instructions carefully so that a "bad instantiation" simply means the AI does something pointless, rather than something negative. Eg "Give humans something nice" seems much safer than "give humans what they really want".

And then of course there's those orders where humans really don't understand what they themselves want...

I'd want a lot more issues like that discussed and solved, before I'd recommend using this approach to getting a safe FAI.

New Comment
17 comments, sorted by Click to highlight new comments since:

Problem one: It would require the human to be able to correctly design utopia (or at least not a dystopia - being able to design a not-dystopia is probably rarer than one might think).

Problem two: There are moral problems in letting an AI simulate a human in sufficiently high detail.

Problem three: In certain cases, the human might command things that they do not want. In particular, if the human-module is simulated as essentially a human within the AI, the AI might wirehead the human and ask if humans in general should be wireheaded (or something like that).

Problem one can be addressed by only allowing certain questions/orders to be given.

Problem two is a real problem, with no solution currently.

Problem three sounds like it isn't a problem - the initial model the AI has of a human, is not of a wireheaded human (though it is of a wireheadable human). What exactly did you have in mind?

  1. Which leads to the obvious question of whether figuring out the rules about the questions is much simpler than figuring out the rules for morality. Do you have a specific, simple class of questions/orders in mind?

  2. Yes, but it seems to me that your approach is dependent on an 'immoral' system: simulating humans in too high detail. In other cases, one might attempt to make a nonperson predicate and eliminate all models that fail, or something. However, your idea seems to depend on simulated humans.

  3. Well, it depends on how the model of the human works and how it is asked questions. That would probably depend a lot on how the original AI structured the model of the human, and we don't currently have any AIs to test that with. The point is, though, that in certain cases, the AI might compromise the human, for instance by wireheading it or convincing it of a religion or something, and then the compromised human might command destructive things. There's a huge, hidden amount of trickiness, such as determining how to give the human correct information to decide etc etc.

3 is the general problem of AI's behaving badly. The way that this approach is supposed to avoid that is by having constructing a "human interpretation module" that is maximally accurate, and then using that module+human instructions to be the motivation of the AI.

Basically I'm using a lot of the module approach (and the "false miracle" stuff to get counterfactuals): the AI that builds the human interpretation module will build it for the purpose of making it accurate, and the one that uses it will have it as part of its motivation. The old problems may rear their heads again if the process is ongoing, but "module X" + "human instructions" + "module X's interpretation of human instructions" seems rather solid as a one-off initial motivation.

The problem is that the 'human interpretation module' might give the wrong results. For instance, if it convinces people that X is morally obligatory, it might interpret that as X being morally obligatory. It is not entirely obvious to me that it would be useful to have a better model. It probably depends on what the original AI wants to do.

The module is supposed to be a predictive model of what humans mean or expect, rather than something that "convinces" or does anything like that.

I know, but my point is that such a model might be very perverse, such as "Humans do not expect to find out that you presented misleading information." rather than "Humans do not expect that you present misleading information."

You're right. This thing can come up in terms of "predicting human behaviour", if the AI is sneaky enough. It wouldn't come up in "compare human models of the world to reality". So there are subtle nuances there to dig into...

  1. If the human is in of the very few with a capacity or interest in grand world changing schemes, they might have trouble coming up with a genuine utopia. If they are one of the great majority without, all you can expect out of them is incremental changes.

  2. And there isnt a moral dilemma in building the AI in the first place, even though it is ,by hypothesis, a superset of the human? You are making an assumptions or two about qualia, and they are bound to .be unjustified assumptions.

  1. Most people I've talked to have one or two world changing schemes that they want to implement. This might be selection bias, though.

  2. It is not at all obvious to me that any optimizer would be personlike. Sure, it would be possible (maybe even easy!) to build a personlike AI, but I'm not sure it would "necessarily" happen. So I don't know if those problems would be there for an arbitrary AI, but I do know that they would be there for its models of humans.

It is not at all obvious to me that any optimizer would be personlike

It is not at all obvious to me that being personlike is necessary to have qualia at all, for all that might be necessary for having personlike qualia.

I dislike the concept of qualia because it seems to me that it's just a confusing name for "how inputs feel from the inside of an algorithm".

In a sense you should be confused about qualia/TWAAFFTI, because we know next nothing about the subject. It might be the case that we "qualia" adds some extra level of confusiojn,...although it might alternatively be the case that TWAAFFTI is something that sounds like an explanation without being actually being an explanation. In particular, TWAAFFTI sets no constraints on what kind of algorithm would have morally relevant feelings, which reinforces my original point: if you think an embedded simulation of al human is morally relevant, how can you deny relevance to the host, even at times when it isnt simulating a human?

Maybe it would be clearer if we looked at some already existing maximization processes. Take for instance evolution. Evolution maximizes inclusive genetic fitness. You punish it by not donating sperm/eggs. I don't care, because evolution is not a personlike thing.

One could argue that it reduces to "know the rigourous actual semantics of human language" instead of nominal ones. Atleast analytial philosophy would be solved if one would attain this capability. It doesn't sound that easy. One could say that the core problem of AI is that nobody knows with sufficent accuracy what intelligence means.

Indeed. What I'm trying to do here is seeing if there is a way to safely let the AI solve the semantics problem (probably not, but worth pondering).

Let me play devil's advocate for this position.

"defining goals (or meta-goals, or meta-meta-goals) in machine code" or the "grounding everything in code" problems.

  1. An AI that is super intelligent will "know what I mean" when I tell it to do something. The difficulty is specifying the AI's goals (at compile time / in machine code) so that the AI "wants" to do what I mean.
  2. Solving the "specify the correct goals in machine code" is thus necessary and sufficient for making a friendly AI. A lot of my arguments depend on this claim.
  3. How to specify goals at compile time is a technical question, but we can do some a priori theorizing as to how we might do it. Roughly, there are two high level approaches of how to go about it. Simple hard-coded goals, and goals fed in from more complex modules. A simple hard-coded goal might be something like current reinforcement learners were the reward signal is human praise (or, a simple to hard-code proxy for human praise such as pressing a reward button). The other alternative is to make a a few modules (e.g. one for natural language understanding, one for modeling humans) and "use it/them as part of the definition of the new AI's motivation.").

  4. Responses to counterarguments:

4.1: needing to specify commands carefully (e.g. "give humans what they really want".).

And then of course there's those orders where humans really don't understand what they themselves want...

The whole point of intelligence is being able to specify tasks in an ambiguous way (e.g. you don't have to specify what you want in such detail that you're practically programming a computer). An AI that actually wants to make you happier (since it's goals were specified at compile time using a module that models humans) will ask you what to clarify your intentions if you give it vague goals.

Some other thoughts:

For it to have any chance of success, we need to be sure that both model-as-definition and the intelligence module idea are rigorously defined.

It will be hard to accomplish this, since nobody knows how to go about building such modules. Modeling language, humans, and human values are hard problems. Building the modules is a technical question. But, it is necessary and sufficient to build the modules and feed them into the goal system of another AI to build a friendly AI. In fact, one could make a stronger argument that any AGI that's built with a goal system must have it's goal system specified with natural language modules (e.g. reinforcement learning sucks). Thus, it is likely that any built AGIs would be FAIs.

EDITED to add: Tool-AI arguments. If you can build the modules to feed into an AI with a goal system, then you might be able to build a "tool-AI" that doesn't a goal system. I think it's hard to say a priori that such an architecture isn't more likely than an architecture that requires a goal system. It's even harder to say that a tool-AI architecture is impossible to build.

In summary, I think the chief issues with building friendly AI are technical issues related to actually building the AI. I don't see how decision theory helps. I do think that unfriendly humans with a tool AI is something to be concerned about, but doing math research doesn't seem related to that (Incidentally, MIRI's math research has intrigued people like Elon Musk, which helps with the "unfriendly humans problem").