Yeah that's what I'd like to know, would an AI built on a number format that has a default maximum pursue numbers higher than that maximum, or would it be "fulfilled" just by getting its reward number as high as the number format its using allows?
Sorry I'm using informal language, I don't mean it actually "cares" and I'm not trying to anthropomorphize. I mean care in the sense that how does it actually know that its achieving a goal in the world and why would it actually pursue that goal instead of just modifying the signals of its sensors in a way that appears to satisfy its goal.
In the stamp collector example, why would an extremely intelligent AI bother creating all those stamps when its simulations show that if the AI just tweaks its own software or hardware it can make the signals it receives the same as if it had created all those stamps, which is much easier than actually turning matter into a bunch of stamps.
My use of reward was just shorthand for whatever signals it needs to receive to consider its goal met. At some point it has to receive electrical signals to quantify that its reward is met, right? So why wouldn't it just manipulate those electrical signals to match whatever its goal is?
How do you actually make its utility function over the state of the world? At some point the AI has to interpret the state of the world through electrical signals from sensors, so why wouldn't it be satisfied with manipulating those sensor electrical signals to achieve its goal/reward?
I'm confused about why it cares about m, if it can just manipulate its perception of what m is. Take your chess example, if m is which player wins at the end the AI system "understands" m via an electrical signal. So what makes it care about m itself as opposed to just manipulating the electrical signal? In practice I would think it would take the path of least resistance, which for something simple like chess would probably just be m itself as opposed to manipulating the electrical signal, but for my more complex scenario it seems like it would arrive at 2) before 1). What am I missing?
Your last paragraph is really interesting and not something I'd thought much about before. In practice is it likely to be unbounded? In a typical computer system aren't number formats typically bounded, and if so would we expect an AI system to be using bounded numbers even if the programmers forgot to explicitly bound the reward in the code?
But wouldn't it be way easier for a sufficiently capable AI to make itself think what's happening in m is what aligns with its reward function? Maybe not for something simple like chess, but if the goal requires doing something significant in the real world it seems like it would be much easier for a superintelligent AI to fake the inputs to its sensors than intervening in the world. If we're talking about paperclips or whatever the AI can either 1) build a bunch of factories and convert all different kinds of matter into paperclips, while fighting off hum...
I don't see how this gets around the wireheading. If it's superintelligent enough to actually substantially increase the number of paperclips in the world in a way that humans can't stop, it seems to me like it would be pretty trivial for it to fake how large m appears to its reward function, and that would be substantially easier than trying to increase m in the actual world.
I'm way out of my depth here, but my thought is it's very common for humans to want to modify their utility functions. For example, a struggling alcoholic would probably love to not value alcohol anymore. There are lots of other examples too of people wanting to modify their personalities or bodies.
It depends on the type of AGI too I would think, if superhuman AI ends up being like a paperclip maximizer that's just really good at following its utility function then yeah maybe it wouldn't mess with its utility function. If superintelligence means it has eme...
Thanks for this answer, that's really helpful! I'm not sure I buy that instrumental convergence implies an AI will want to kill humans because we pose a threat or convert all available matter into computing power, but that helps me better understand the reasoning behind that view. (I'd also welcome more arguments as to why death of humans and matter into computing power are likely outcomes of the goals of self-protection and pursuing whatever utility it's after if anyone wanted to make that case).
That's a good point, and I'm also curious how much the utility function matters when we're talking about a sufficiently capable AI. Wouldn't a superintelligent AI be able to modify its own utility function to whatever it thinks is best?
Another reason I think some might disagree is thinking that misalignment could happen in a bunch of very mild ways. At least that accounts for some of my ignorant skepticism. Is there reason to think that misalignment necessarily means disaster, as opposed to it just meaning the AI does its own thing and is choosy about which human commands it follows, like some kind of extremely intelligent but mildly eccentric and mostly harmless scientist?
I was notified I didn't win a prize so figured I'd discuss what I proposed here in case it sparks any other ideas. The short version is I proposed adding on a new head that would be an intentional human simulator. During training it would be penalized for telling the truth that the diamond was gone when there existed a lie that the humans would have believed instead. The result would hopefully be a head that acted like a human simulator. Then the actual reporter would be trained so that it would be penalized for using a similar amount of compute as the int...
I suppose there are a number of examples that work, but I think the robber and vault give the scenario useful breadth.
The following is just my interpretation of it, so take it with a grain of salt. To me the robber and vault enable a few options. The AI can be passively lying or actively concealing. If the robber comes in, gets past the AIs defenses, and takes the diamond in a way the human observer can't notice, then the AI has the option of passively lying. The AI tried its best to stop the robber and failed, but then chose to lie about it so it still go...
I think that makes sense. To rephrase, are you basically saying that the predictor is a subcomponent of the AI, like the reporter is? I didn't catch that distinction in the report but looking back at it I think you're right. But yeah doesn't seem like the distinction matters much for what we're doing.
After reading through the report I wanted to make sure I understood the scenarios and counterexamples being discussed and be able to quickly refresh my memory, so I attempted to write a brief summary. Figured I'd share it here in case it helps anyone else.
SmartVault: Vault with a diamond in it, operated by a superintelligent AI tasked with keeping the diamond safe.
Predictor: The primary AI tasked with protecting the diamond. The predictor sees a video feed of the vault, predicts what actions are necessary to protect the diamond and how those...
Thanks, I appreciate you taking the time to answer my questions. I'm still skeptical that it could work like that in practice but I also don't understand AI so thanks for explaining that possibility to me.