I would like to propose an idea for aligning AI.
First, I will provide some motivation for it. Suppose you are a programmer who's having a really hard time implementing a function in a computer program you're developing. Most of the code is fine, but there's this one function that you can't figure out how to implement right. But, you still need to run the program. So, to do this, you do the following: first, you add a breakpoint to the code in the function you're having trouble implementing. So, whenever you reach the function in the code, program execution halts. Once this happens, you on your own find a reasonable value v for the function to return. Finally, in your debugger you type "return v", making the function return v, and then you resume execution.
As long as you can come up with reasonable return values of the function on your own, then I bet the above would make the program work pretty well. And why not? Everything outside that function is implemented well. And you are manually making sure that hard-to-implement function also outputs reasonable values. So then there's no function that's not doing what it's supposed to do.
My basic idea is to do this, but with the AI's utility function.
Now, you don't need to literally put a breakpoint in the AI's utility function and then have the developers type into a debugger. Instead, inside the AI's utility function, you can just have the AI pause execution, send a message to a developer or other individual containing a description of a possible world, and then wait for a response. Once someone sends a message in response, the AI will use the returned value as the value of its utility function. That is, you could do something like:
def utility(outcome):
message_ai_controllers(make_readable(outcome))
response = wait_for_controller_response()
return parse_utility(response)
(Error-handling code could be added if the returned utility is invalid.)
Using the above utility function would, in theory at least, be equivalent to actually having a breakpoint in the code, then manually returning the right value with a debugger.
You might imagine this AI would be incredibly inefficient due to how slow people would be in answering the AI's queries. However, with the right optimization algorithm I'm not sure this would be much of a problem. The AI would have an extremely slow utility function, but I don't see a reason to think that it's impossible to make an optimization algorithm that can perform well on even on extremely slow objective functions.
I'll provide one potential approach to making such an algorithm. The optimization algorithm would, based on the known values of its objective function, learn fast approximations to it. Then, the AI could use these fast function to come up with a plan that scores well on them. Finally, if necessary, the AI can query its (slow) objective function for the value of the results of this plan. After doing so, it would also update its fast approximations with what its learned. The optimization algorithm could be designed so that if the AI is particularly unsure about if something would be desirable according to the objective function, it would consult the actual (slow) objective function. The algorithm could also potentially be programmed to do the same for any outcomes with high impact or strategic significance.
My technique is intended to provide both outer-alignment and corrigability. By directly asking the people for the desirability of outcomes, the AI would, if I'm reasoning correctly, be outer-aligned. If the AI uses fast approximations learned approximations to its utility function, then the system also provides a degree of hard-coded corrigability. The AI's optimization algorithm is hard-coded to query its slow utility function at some points and to update its fast models appropriately, which allows for errors in the fast approximations to be corrected.
This works great when you can recognize good things within the respresentation the AI uses to think about the world. But what if that's not true?
Here's the optimistic case:
Suppose you build a Go-playing AI that defers to you for its values, but the only things it represents are states of the Go board, and functions over states of the Go board. You want to tell it to win at Go, but it doesn't represent that concept, you have to tell it what "win at Go" means in terms of a value function from states of the Go board to real numbers. If (like me) you have a hard time telling when you're winning at Go, maybe you just generate as many obviously-winning positions as you can and label them all as high-value, everything else low-value. And this sort of works! The Go-playing AI tries to steer the gameboard into one of these obviously-winning states, and then it stops, and maybe it could win more games of Go if it also valued the less-obviously-winning positions, but that's alright.
Why is that optimistic?
Because it doesn't scale to the real world. An AI that learns about and acts in the real world doesn't have a simple gameboard that we just need to find some obviously-good arrangements of. At the base level it has raw sensor feeds and motor outputs, which we are not smart enough to define success in terms of directly. And as it processes its sensory data it (by default) generates representations and internal states that are useful for it, but not simple for humans to understand, or good things to try to put value functions over. In fact, an entire intelligent system can operate without ever internally representing the things we want to put value functions over.
Here's a nice post from the past: https://www.lesswrong.com/posts/Mizt7thg22iFiKERM/concept-safety-the-problem-of-alien-concepts
I hadn't fully appreciated to difficultly that could result from AIs having alien concepts, so thanks for bringing it up.
However, it seems to me that this would not be a big problem, provided the AI is still interpretable. I'll provide two ways to handle this.
For one, you could potentially translate the human concepts you care about into statements using the AI's concepts. Even if the AI doesn't use the same concepts people do, AIs are still incentivized to form a detailed model of the world. If you can have access to all the AI's world model, but still ca... (read more)