Comment author: rikisola 17 July 2015 03:26:35PM *  0 points [-]

Hi all, I'm new here so pardon me if I speak nonsense. I have some thoughts regarding how and why an AI would want to trick us or mislead us, for instance behaving nicely during tests and turning nasty when released and it would be great if I could be pointed in the right direction. So here's my thought process.

Our AI is a utility-based agent that wishes to maximize the total utility of the world based on a utility function that has been coded by us with some initial values and then has evolved through reinforced learning. With our usual luck, somehow it's learnt that paperclips are a bit more useful than humans. Now the "treacherous turn" problem that I've read about says that we can't trust the AI if it performs well under surveillance, because it might have calculated that it's better to play nice until it acquires more power before turning all humans into paperclips. I'd like to understand more about this process. Say it calculates that the world with maximum utility is one where it can turn us all into paperclips with minimum effort, with the total utility of this world being UAI(kill)=100. Second best is a world where it first plays nice until it is unstoppable, then turns us into paperclips. This is second best because it's wasting time and resources to achieve the same final result. UAI(nice+kill)=99. Why would it possibly choose the second, sub-optimal, option, which is the most dangerous for us? I suppose it would only choose it if it associated it with a higher probability of success, which means somehow, somewhere the AI must have calculated that the the utility a human would give to these scenarios is different than what it is giving, otherwise we would be happy to comply. In particular, it must believe that for each possible world w:

if UAI(kill)≥UAI(w)≥UAI(nice+kill) then Uhuman(w)≤Uhuman(nice+kill)

How is the AI calculating utilities from a human point of view? (Sorry but this questions comes straight out of my poor understanding of AI architectures.) Is it using some kind of secondary utility function that it applies to humans to guess their behavior? If the process that would motivate the AI to trick us is anything similar to this, then it looks to me like it could be solved by making the AI use EXACTLY it's own utility function when it refers to other agents. Also note that the utilities must not be relative to the agent, but to the AI. For instance, if the AI greatly values its own survival over the survival of other agents, then the other agents should equally greatly value the AI's survival over their own. This should be easily achieved if whenever the AI needs to look up another agent's utility for any action it is simply redirected to its own.

This way the AI will always think we would love it's optimum plan and would never see the need to lie to us or trick us, brainwashing us or engineer us in any way as it would only be a waste of resources. In some cases it might even openly look for our collaboration if that makes the plan any better. Clippy, for instance, might say "OK guys I'm going to turn everything into paperclips, can you please quickly get me the resources I need to begin with, then you can all line up over there for paperclippification. Shall we start?".

This also seems to make the AI indifferent to our actions, provided its belief regarding the identity of our utility functions is unchangeable. For instance, even while it sees us pressing the button to blow it up, it won't think we are going to jeopardize the plan. That would be crazy. Or it won't try to stop us from re-booting it. Considering that it can't imagine you not going along with the plan from that moment onward, it's never a good choice to waste time and resources to stop you. There's no need to stop you.

Now obviously this does not solve the problem of how to make it do the right thing, but it looks to me that at least we would be able to assume that a behavior observed during tests should be honest. What am I getting wrong? (don't flame me please!!!)

Comment author: Lumifer 17 July 2015 02:43:52PM 0 points [-]

Do your algorithms require co-location and are sensitive to latency?

Comment author: rikisola 17 July 2015 02:58:01PM 1 point [-]

Hi Lumifer. Yes, to some extent. At the moment I don't have co-location so I minimized latency as much as possible in other ways and have to stick to the slower, less efficient markets. I'd like to eventually test them on larger markets but I know that without co-location (and maybe a good deal of extra smarts) I stand no chance.

Comment author: John_Maxwell_IV 17 July 2015 12:33:30PM 6 points [-]

Hello and welcome! Don't be shy about posting; if you're a PhD making money with HFT, I think you are plenty qualified, and external perspectives can be very valuable. Posting in an open thread doesn't require any karma and will get you a much bigger audience than this welcome thread. (For maximum visibility you can post right after a thread's creation.)

Comment author: rikisola 17 July 2015 01:14:35PM 3 points [-]

Hi John, thanks for the encouragement. One thing that strikes me of this community is how most people make an effort to consider each other's point of view, it's a real indicator of a high level of reasonableness and intellectual honesty. I hope I can practice this too. Thanks for pointing me to the open threads, they are perfect for what I had in mind.

Comment author: rikisola 17 July 2015 12:00:33PM 11 points [-]

Hi all, I'm new. I've been browsing the forum for two weeks and only now I've come across this welcome thread, so nice to meet you! I'm quite interested in the control problem, mainly because it seems like a very critical thing to get right. My background is a PhD in structural engineering and developing my own HFT algorithms (which for the past few years has been my source of both revenue and spare time). So I'm completely new to all of the topics on the forum, but I'm loving the challenge. At the moment I don't have any karma points so I can't publish, which is probably a good thing given my ignorance, so may I post some doubts and questions here in hope to be pointed in the right direction? Thanks in advance!

Comment author: Allan_Crossman 04 September 2008 12:29:50AM 12 points [-]

Prase, Chris, I don't understand. Eliezer's example is set up in such a way that, regardless of what the paperclip maximizer does, defecting gains one billion lives and loses two paperclips.

Basically, we're being asked to choose between a billion lives and two paperclips (paperclips in another universe, no less, so we can't even put them to good use).

The only argument for cooperating would be if we had reason to believe that the paperclip maximizer will somehow do whatever we do. But I can't imagine how that could be true. Being a paperclip maximizer, it's bound to defect, unless it had reason to believe that we would somehow do whatever it does. I can't imagine how that could be true either.

Or am I missing something?

Comment author: rikisola 17 July 2015 08:30:28AM 1 point [-]

One thing I can't understand. Considering we've built Clippy, we gave it a set of values and we've asked it to maximise paperclips, how can it possibly imagine we would be unhappy about its actions? I can't help but thinking that from Clippy's point of view, there's no dilemma: we should always agree with its plan and therefore give it carte blanche. What am I getting wrong?

Comment author: rikisola 17 July 2015 08:15:31AM 0 points [-]

Hi there, I'm new here and this is an old post but I have a question regarding the AI playing a prisoner dilemma against us, which is : how would this situation be possible? I'm trying to get my head around why the AI would think that our payouts are any different than his payouts, given that we built it, we thought it (some) of our values in a rough way and we asked it to maximize paperclips, which means we like paperclips. Shouldn't the AI think we are on the same team? I mean, we coded it that way and we gave it a task, what process exactly would make the AI ever think we would disagree with its choice? So for instance if we coded it in such a way that it values a human life 0, then it would only see one choice: make 3 paperclips. And it shouldn't have any reason to believe that's not the best outcome for us too, so the only possible outcome from its point of view in this case should be (+0 lives, +3 paperclips). Basically the main question is: how can the AI ever imagine that we would disagree with it? (I'm honestly just asking as I'm struggling with this idea and am interested in this process) Thanks!

In response to Moral AI: Options
Comment author: rikisola 13 July 2015 11:43:12AM 1 point [-]

I feel like a mixed approach is the most desirable. There is a risk that if the AI is allowed to simply learn from humans, we might get a greedy AI that maximizes its Facebook experience while the rest of the World keeps dying of starvation and wars. Also, our values probably evolve with time (slavery, death penalty, freedom of speech...) so we might as well try and teach the AI what our values should be rather than what they are right now. Maybe then it's the case of developing a top-down, high level ethical system and use it to seed a neural network that then picks up patterns in more detailed scenarios?

Comment author: Stuart_Armstrong 09 July 2015 09:40:40AM 1 point [-]

I forgot an important part of the setup, which was that u is bounded, not too far away from the present value, which means εΔu > -Δv is unlikely for general v.

Comment author: rikisola 09 July 2015 11:44:10AM 0 points [-]

Ah yep that'll do.

Comment author: Stuart_Armstrong 08 July 2015 12:51:57PM 0 points [-]

That's one of the reasons the agents don't know u and v at this point.

Comment author: rikisola 08 July 2015 01:18:25PM 0 points [-]

Thanks for your reply, I had missed the fact that M(εu+v) is also ignorant of what u and v are. In this case is this a general structure of how a satisficer should work, but then when applying it in practice we would need to assign some values to u and v on a case by case basis, or at least to ε, so that M(εu+v) could veto? Or is it the case that M(εu+v) uses an arbitrarily small ε, in which case it is the same as imposing Δv>0?

Comment author: rikisola 08 July 2015 10:50:46AM 0 points [-]

Every idea that comes to my mind is faced by the big question "if we were able to program a nice AI for that situation, why would we not program it to be nice in every situation". I mean, it seems to me that in that scenario we would have both a solid definition of niceness and the ability to make the AI stick to it. Could you elaborate a little on that? Maybe an example?

Comment author: rikisola 08 July 2015 12:48:18PM 0 points [-]

Nevermind this comment, I read some more of your posts on the subject and I think I got the point now ;)

View more: Prev | Next