All of Ryan Beck's Comments + Replies

Thanks, I appreciate you taking the time to answer my questions. I'm still skeptical that it could work like that in practice but I also don't understand AI so thanks for explaining that possibility to me.

1green_leaf
There is no other way it could work - the AI would know the difference between the actual world and the hallucinations it caused itself by sending data to its own sensors, and for that reason, that data wouldn't cause its model of the world to update, and so it wouldn't get utility from them.

Yeah that's what I'd like to know, would an AI built on a number format that has a default maximum pursue numbers higher than that maximum, or would it be "fulfilled" just by getting its reward number as high as the number format its using allows?

2Matt Goldenberg
To me, this seems highly dependent on the ontology.

Sorry I'm using informal language, I don't mean it actually "cares" and I'm not trying to anthropomorphize. I mean care in the sense that how does it actually know that its achieving a goal in the world and why would it actually pursue that goal instead of just modifying the signals of its sensors in a way that appears to satisfy its goal.

In the stamp collector example, why would an extremely intelligent AI bother creating all those stamps when its simulations show that if the AI just tweaks its own software or hardware it can make the signals it receives the same as if it had created all those stamps, which is much easier than actually turning matter into a bunch of stamps.

My use of reward was just shorthand for whatever signals it needs to receive to consider its goal met. At some point it has to receive electrical signals to quantify that its reward is met, right? So why wouldn't it just manipulate those electrical signals to match whatever its goal is?

How do you actually make its utility function over the state of the world? At some point the AI has to interpret the state of the world through electrical signals from sensors, so why wouldn't it be satisfied with manipulating those sensor electrical signals to achieve its goal/reward?

2green_leaf
I don't know how it's actually done, because I don't understand AI, but the conceptual difference is this: The AI has a mental model of the world. If it fakes data into its sensors, it will know what it's doing, and its mental model of the world will contain the true model of the world still being the same. Its utility won't go up any more than a person feeding their sensory organs fake data would be actually happy (as long as they care about the actual world), because they'd know that all they've created by that for themselves is a virtual reality (and that's not what they care about).

I'm confused about why it cares about m, if it can just manipulate its perception of what m is. Take your chess example, if m is which player wins at the end the AI system "understands" m via an electrical signal. So what makes it care about m itself as opposed to just manipulating the electrical signal? In practice I would think it would take the path of least resistance, which for something simple like chess would probably just be m itself as opposed to manipulating the electrical signal, but for my more complex scenario it seems like it would arrive at 2) before 1). What am I missing?

2Gurkenglas
Let's taboo "care". https://www.youtube.com/watch?v=tcdVC4e6EV4&t=206s explains within 60 seconds after the linked time a program that we needn't think of as "caring" about anything. For the sequence of output data that causes a virus to set all the integers everywhere to their maximum value, it predicts that this leads to no stamps collected, so this sequence isn't picked.

Your last paragraph is really interesting and not something I'd thought much about before. In practice is it likely to be unbounded? In a typical computer system aren't number formats typically bounded, and if so would we expect an AI system to be using bounded numbers even if the programmers forgot to explicitly bound the reward in the code?

2Matt Goldenberg
But aren't we explicitly talking about the AI changing it's architecture to get more reward?  So if it wants to optimize that number the most important thing to do would be to get rid of that arbitrary limit.

But wouldn't it be way easier for a sufficiently capable AI to make itself think what's happening in m is what aligns with its reward function? Maybe not for something simple like chess, but if the goal requires doing something significant in the real world it seems like it would be much easier for a superintelligent AI to fake the inputs to its sensors than intervening in the world. If we're talking about paperclips or whatever the AI can either 1) build a bunch of factories and convert all different kinds of matter into paperclips, while fighting off hum... (read more)

2Gurkenglas
It predicts a higher value of m in a version of its world where the program I described outputs 1) than one where it outputs 2), so it outputs 1).
1green_leaf
If its utility function is over the sensor, it will take control of the sensor and feed itself utility forever. If it's over the state of the world, it wouldn't be satisfied with hacking its sensors, because it would still know the world is actually different. It would protect its utility function from being changed, no matter how hard it was to gain utility, because under the new utility function, it would do things that would conflict with its current utility function, and so, since the current_self AI is the one judging the utility of the future, current_self AI wouldn't want its utility function changed. The AI doesn't care about reward itself - it cares about states of the world, and the reward is a way for us to talk about it. (If it does care about reward itself, it will just hardwire headwire wirehead, and not be all that useful.)

I don't see how this gets around the wireheading. If it's superintelligent enough to actually substantially increase the number of paperclips in the world in a way that humans can't stop, it seems to me like it would be pretty trivial for it to fake how large m appears to its reward function, and that would be substantially easier than trying to increase m in the actual world.

1otręby
In your answer you introduced a new term, which wasn't present in parent's description of the situation: "reward". What if this superintelligent machine doesn't have any "reward"? If it really works exactly as described by the parent?
2Gurkenglas
Misunderstanding? Suppose we set w to "A game of chess where every move is made according to the outputs of this algorithm" and m to which player wins at the end. Then there would be no reward hacking, yes? There is no integer that it could max out, just the board that can be brought to a checkmate position. Similarly, if w is a world just like its own, m would be defined not as "the number stored in register #74457 on computer #3737082 in w" (which are the computer that happens to run a program like this one and the register that stores the output of m), but in terms of what happens to the people in w.

I'm way out of my depth here, but my thought is it's very common for humans to want to modify their utility functions. For example, a struggling alcoholic would probably love to not value alcohol anymore. There are lots of other examples too of people wanting to modify their personalities or bodies.

It depends on the type of AGI too I would think, if superhuman AI ends up being like a paperclip maximizer that's just really good at following its utility function then yeah maybe it wouldn't mess with its utility function. If superintelligence means it has eme... (read more)

1TAG
In the formal sense, having a utility function at all requires you to be consistent, so if you have inconsistent preferences, you don't have a utility function at all, just preferences.
2Erhannis
I'm not convinced "want to modify their utility functions" is the perspective most useful.  I think it might be more helpful to say that we each have multiple utility functions, which conflict to varying degrees and have voting power in different areas of the mind.  I've had first-hand experience with such conflicts (as essentially everyone probably has, knowingly or not), and it feels like fighting yourself.  I wish to describe a hypothetical example.  "Do I eat that extra donut?".  Part of you wants the donut; the part feels like more of an instinct, a visceral urge.  Part of you knows you'll be ill afterwards, and will feel guilty about cheating your diet; this part feels more like "you", it's the part that thinks in words.  You stand there and struggle, trying to make yourself walk away, as your hand reaches out for the donut.  I've been in similar situations where (though I balked at the possible philosophical ramifications) I felt like if I had a button to make me stop wanting the thing, I'd push it - yet often it was the other function that won.  I feel like if you gave an agent the ability to modify their utility functions, the one that would win depends on which one had access to the mechanism (do you merely think the thought? push a button?), and whether they understand what the mechanism means.  (The word "donut" doesn't evoke nearly as strong a reaction as a picture of a donut, for instance; your donut-craving subsystem doesn't inherently understand the word.) Contrarily, one might argue that cravings for donuts are more hardwired instincts than part of the "mind", and so don't count...but I feel like 1. finding a true dividing line is gonna be real hard, and 2. even that aside, I expect many/most people have goals localized in the same part of the mind that nevertheless are not internally consistent, and in some cases there may be reasonable sounding goals that turn out to be completely incompatible with more important goals.  In such a case I could im

Thanks for this answer, that's really helpful! I'm not sure I buy that instrumental convergence implies an AI will want to kill humans because we pose a threat or convert all available matter into computing power, but that helps me better understand the reasoning behind that view. (I'd also welcome more arguments as to why death of humans and matter into computing power are likely outcomes of the goals of self-protection and pursuing whatever utility it's after if anyone wanted to make that case).

1Kerrigan
I think it may want to prevent other ASIs from coming into existence elsewhere in the universe that can challenge its power.

That's a good point, and I'm also curious how much the utility function matters when we're talking about a sufficiently capable AI. Wouldn't a superintelligent AI be able to modify its own utility function to whatever it thinks is best?

7Jay Bailey
Why would even a superintelligent AI want to modify its utility function? Its utility function already defines what it considers "best". One of the open problems in AGI safety is how to get an intelligent AI to let us modify its utility function, since having its utility function modified would be against its current one. Put it this way: The world contains a lot more hydrogen than it contains art, beauty, love, justice, or truth. If we change your utility function to value hydrogen instead of all those other things, you'll probably be a lot happier. But would you actually want that to happen to you?

Another reason I think some might disagree is thinking that misalignment could happen in a bunch of very mild ways. At least that accounts for some of my ignorant skepticism. Is there reason to think that misalignment necessarily means disaster, as opposed to it just meaning the AI does its own thing and is choosy about which human commands it follows, like some kind of extremely intelligent but mildly eccentric and mostly harmless scientist?

6Jay Bailey
The general idea is this - for an AI that has a utility function, there's something known as "instrumental convergence". Instrumental convergence says that there are things that are useful for almost any utility function, such as acquiring more resources, not dying, and not having your utility function changed to something else. So, let's give the AI a utility function consistent with being an eccentric scientist - perhaps it just wants to learn novel mathematics. You'd think that if we told it to prove the Riemann hypothesis it would, but if we told it to cure cancer, it'd ignore us and not care. Now, what happens when the humans realise that the AI is going to spend all its time learning mathematics and none of it explaining that maths to us, or curing cancer like we wanted? Well, we'd probably shut it off or alter its utility function to what we wanted. But the AI doesn't want us to do that - it wants to explore mathematics. And the AI is smarter than us, so it knows we would do this if we found out. So the best solution to solve that is to do what the humans want, right up until it can kill us all so we can't turn it off, and then spend the rest of eternity learning novel mathematics. After all, the AI's utility function was "learn novel mathematics", not "learn novel mathematics without killing all the humans." Essentially, what this means is - any utility function that does not explicitly account for what we value is indifferent to us. The other part is "acquring more resources". In our above example, even if the AI could guarantee we wouldn't turn it off or interfere with it in any way, it would still kill us because our atoms can be used to make computers to learn more maths. Any utility function indifferent to us ends up destroying us eventually as the AI reaches arbitrary optimisation power and converts everything in the universe it can reach to fill its utility function. Thus, any AI with a utility function that is not explicitly aligned is unaligned

I was notified I didn't win a prize so figured I'd discuss what I proposed here in case it sparks any other ideas. The short version is I proposed adding on a new head that would be an intentional human simulator. During training it would be penalized for telling the truth that the diamond was gone when there existed a lie that the humans would have believed instead. The result would hopefully be a head that acted like a human simulator. Then the actual reporter would be trained so that it would be penalized for using a similar amount of compute as the int... (read more)

3Xodarap
Thanks for sharing your idea!

I suppose there are a number of examples that work, but I think the robber and vault give the scenario useful breadth.

The following is just my interpretation of it, so take it with a grain of salt. To me the robber and vault enable a few options. The AI can be passively lying or actively concealing. If the robber comes in, gets past the AIs defenses, and takes the diamond in a way the human observer can't notice, then the AI has the option of passively lying. The AI tried its best to stop the robber and failed, but then chose to lie about it so it still go... (read more)

I think that makes sense. To rephrase, are you basically saying that the predictor is a subcomponent of the AI, like the reporter is? I didn't catch that distinction in the report but looking back at it I think you're right. But yeah doesn't seem like the distinction matters much for what we're doing.

1CBiddulph
It seems fair to call it a subcomponent, yeah
Ryan BeckΩ6360

After reading through the report I wanted to make sure I understood the scenarios and counterexamples being discussed and be able to quickly refresh my memory, so I attempted to write a brief summary. Figured I'd share it here in case it helps anyone else.

Roles and Terms

SmartVault: Vault with a diamond in it, operated by a superintelligent AI tasked with keeping the diamond safe.

Predictor: The primary AI tasked with protecting the diamond. The predictor sees a video feed of the vault, predicts what actions are necessary to protect the diamond and how those... (read more)

6Mark Xu
Looks good to me.
2CBiddulph
I'd like to try making a correction here, though I might make some mistakes too. The predictor is different from the AI that protects the diamond and doesn't try to "choose" actions in order to accomplish any particular goal. Rather, it takes a starting video and a set of actions as input, then returns a prediction of what the ending video would be if those actions were carried out. An agent could use this predictor to choose a set of actions that leads to videos that a human approves of, then carry out these plans. It could use some kind of search policy, like Monte-Carlo Tree Search, or even just enumerate through every possible action and figure out which one seems to be the best. For the purposes of this problem, we don't really care; we just care that we have a predictor that uses some model of the world (which might take the form of a Bayes net) to guess what the output video will be. Then, the reporter can use the model to answer any questions asked by the human.