If it's worth saying, but not worth its own post (even in Discussion), then it goes here.
Notes for future OT posters:
1. Please add the 'open_thread' tag.
2. Check if there is an active Open Thread before posting a new one. (Immediately before; refresh the list-of-threads page before posting.)
3. Open Threads should be posted in Discussion, and not Main.
4. Open Threads should start on Monday, and end on Sunday.
Hi all, I'm new here so pardon me if I speak nonsense. I have some thoughts regarding how and why an AI would want to trick us or mislead us, for instance behaving nicely during tests and turning nasty when released and it would be great if I could be pointed in the right direction. So here's my thought process.
Our AI is a utility-based agent that wishes to maximize the total utility of the world based on a utility function that has been coded by us with some initial values and then has evolved through reinforced learning. With our usual luck, somehow it's learnt that paperclips are a bit more useful than humans. Now the "treacherous turn" problem that I've read about says that we can't trust the AI if it performs well under surveillance, because it might have calculated that it's better to play nice until it acquires more power before turning all humans into paperclips. I'd like to understand more about this process. Say it calculates that the world with maximum utility is one where it can turn us all into paperclips with minimum effort, with the total utility of this world being UAI(kill)=100. Second best is a world where it first plays nice until it is unstoppable, then turns us into paperclips. This is second best because it's wasting time and resources to achieve the same final result. UAI(nice+kill)=99. Why would it possibly choose the second, sub-optimal, option, which is the most dangerous for us? I suppose it would only choose it if it associated it with a higher probability of success, which means somehow, somewhere the AI must have calculated that the the utility a human would give to these scenarios is different than what it is giving, otherwise we would be happy to comply. In particular, it must believe that for each possible world w:
if UAI(kill)≥UAI(w)≥UAI(nice+kill) then Uhuman(w)≤Uhuman(nice+kill)
How is the AI calculating utilities from a human point of view? (Sorry but this questions comes straight out of my poor understanding of AI architectures.) Is it using some kind of secondary utility function that it applies to humans to guess their behavior? If the process that would motivate the AI to trick us is anything similar to this, then it looks to me like it could be solved by making the AI use EXACTLY it's own utility function when it refers to other agents. Also note that the utilities must not be relative to the agent, but to the AI. For instance, if the AI greatly values its own survival over the survival of other agents, then the other agents should equally greatly value the AI's survival over their own. This should be easily achieved if whenever the AI needs to look up another agent's utility for any action it is simply redirected to its own.
This way the AI will always think we would love it's optimum plan and would never see the need to lie to us or trick us, brainwashing us or engineer us in any way as it would only be a waste of resources. In some cases it might even openly look for our collaboration if that makes the plan any better. Clippy, for instance, might say "OK guys I'm going to turn everything into paperclips, can you please quickly get me the resources I need to begin with, then you can all line up over there for paperclippification. Shall we start?".
This also seems to make the AI indifferent to our actions, provided its belief regarding the identity of our utility functions is unchangeable. For instance, even while it sees us pressing the button to blow it up, it won't think we are going to jeopardize the plan. That would be crazy. Or it won't try to stop us from re-booting it. Considering that it can't imagine you not going along with the plan from that moment onward, it's never a good choice to waste time and resources to stop you. There's no need to stop you.
Now obviously this does not solve the problem of how to make it do the right thing, but it looks to me that at least we would be able to assume that a behavior observed during tests should be honest. What am I getting wrong? (don't flame me please!!!)
Hi all, thanks for taking your time to comment. I'm sure it must be a bit frustrating to read something that lacks technical terms as much as this post, so I really appreciate your input. I'll just write a couple of lines to summarize my thought, which is to design an AI that: 1- uses an initial utility function U, defined in absolute terms rather than subjective terms (for instance "survival of the AI" rather than "my survival"); 2- doesn't try to learn an utility function for humans or for other agents, but uses for everyone the same ... (read more)