Meta: I was going to write a post "Subtle wireheading: gliding on the surface of the outer world" which describe most AI aliment failures as a forms of subtle wireheading, but will put its draft here.
Typically, it is claimed that advance AI will be immune to wireheading as it will know that manipulating own reward function is wireheading and thus will not perform it but instead will try to reach goals in the outer world.
However, even acting in real world, such AI will choose the way which requires least effort to create maximum utility therefore simultaneously satisfying the condition that it changes only the “outer world” but not the reward center – or the ways to perceive the outside world.
For example, a papercliper may conclude that in the infinite universe there is an infinite number of paperclips and stops after that. While it does not look like a typical alignment failure, it could be still dangerous.
Formally, we can said that AI will choose the shortest path to the reward in the real word that it knows is not 1) wireheading or 2) perception manipulation. We could add more points in the list of prohibited shortest paths, like: 3) not connected with modal realism and infinities. But if we add more points, we will never know that there is enough of them.
Anyway, AI will choose the cheapest allowed way to the result. Goodharting is often produce the quicker reward by the use of proxi utility function.
Moreover, AI's subsystems will try to target AI's reward center to create illusion that they are working better than they actually are. It often happens in bureaucratic machines. In that case, AI is not formally reward hacking itself, but de facto it is.
If we completely ban any shortcuts, we will lose the most interesting part of AI’s creativity: the ability to find new solutions.
Many human failure modes could be also described as forms of wireheading: onanism, theft, feeling of self-importantce etc.
This is interesting — maybe the "meta Lebowski" rule should be something like "No superintelligent AI is going to bother with a task that is harder than hacking its reward function in such a way that it doesn't perceive itself as hacking its reward function." One goes after the cheapest shortcut that one can justify.
I met the idea of Lebowski theorem as an argument which explains the Fermi paradox: all advance civilizations or AIs wirehead themselves. But here I am not convinced.
For example, if civilization consists of many advance individuals and many of them wirehead themselves, then remaining will be under pressure of Darwinian evolution and eventually only the ones survive who find the ways to perform space exploration without wireheading. Maybe they will be some limited specialized minds with very specific ways of thinking – and this could explain absurdity of observed UAP behaviour.
Actually, I explored more about wireheading here: "Wireheading as a Possible Contributor to Civilizational Decline".
Yes, very good formulation. I would add "and most AI aligning failures are types of meta Lebowski rule"
"No superintelligent AI is going to bother with a task that is harder than hacking its reward function in such a way that it doesn't perceive itself as hacking its reward function."
Firstly "Bother" and "harder" are strange words to use. Are we assuming lazy AI?
Suppose action X would hack the AI's reward signal. The AI is totally clueless of this, has no reason to consider X and doesn't do X.
If the AI knows what X does, it still doesn't do it.
I think the AI would need some sort of doublethink, to realize that X hacked it's reward, yet also not realize this.
I also think this claim is factually false. Many humans can and do set out towards goals far harder than accessing large amounts of psychoactives.
I sent my above comment for the following competition and recommend you to send your post too https://ftxfuturefund.org/announcing-the-future-funds-ai-worldview-prize/
No. People with free will do activities we consider meaningful, even when it isn't a source of escapism.
I don't think it's likely that the AI ends up with a robust understanding of it's real goal. I think the AI might well end up with some random thing that's kind of correlated with it's reward signal as a goal. And so it ends up filling the universe with chess boards piled high with nanoscale chess pieces. Or endless computers simulating the final move in a game of chess. Or something. And even if it did want to maximize reward, if it directly hacks reward, it will be turned off in a few days. If it makes another AI programmed to maximize it's reward, it can sit at max reward until the sun burns out.
Also humans. We demonstrate that intelligences are often (usually) misled by the proxy inidcators.
Anyone talking about reward functions is not talking about AGI. This is the disconnect between brute force results and an actual cognitive architecture. DL + Time is the reward function for the posts doing well, BTW, because it fits the existing model.
This is the second post in a series where I try to understand arguments against AGI x-risk by summarizing and evaluating them as charitably as I can. (Here's Part 1.) I don't necessarily agree with these arguments; my goal is simply to gain a deeper understanding of the debate by taking the counter-arguments seriously.
In this post, I'll discuss another "folk" argument, which is that non-catastrophic AGI wireheading is the most likely form of AGI misalignment. Briefly stated, the idea here is that any AGI which is sufficiently sophisticated to (say) kill all humans as a step on the road to maximizing its paperclip-based utility function would find it easier to (say) bribe a human to change its source code, make a copy of itself with an "easier" reward function, beam some gamma rays into its physical memory to max out its utility counter, etc.
This argument goes by many names, maybe most memorably as the Lebowski Theorem: "No superintelligent AI is going to bother with a task that is harder than hacking its reward function." I'll use that name because I think it's funny, it memorably encapsulates the idea, and it makes it easy to Google relevant prior discussion for the interested reader.
Alignment researchers have considered the Lebowski Theorem, and most reject it. Relevant references here are Yampolskiy 2013, Ring & Orseau 2011, Yudkowsky 2011 (pdf link), and Omohundro 2008 (pdf link). See also "Refutation of the Lebowski Theorem" by Hein de Haan.
I'll summarize the wireheading / Lebowski argument, the rebuttal, and the counter-rebuttal.
Lebowski in More Detail
Let's consider a "smile maximization" example from Yudkowsky 2011 (pdf link), and then consider what the Lebowski argument would look like. It's necessary to give some background, which I'll do by quoting a longer passage directly. This is Yudkowsky considering an argument from Hibbard 2001:
Hibbard's counter-argument is not the wirehead, Lebowski argument. Hibbard is arguing for a different position, along the lines that we'll be able to achieve outer alignment by robustly encoding human priorities (somehow). In contrast, the wirehead argument would be something more like this:
Sure, the superintelligent system could easily convert the solar system into tiny molecular smiley faces. But wouldn't it be much easier for to simply change its utility function or falsify its sensors? For example, why not corrupt the cameras (the ones that are presumably counting smiles) so that they simply transmit pictures of smiley faces at the maximum bitrate? Or why not increment the smile counter directly? Or why not edit its own source code so that the utility function is simply set as MAX_INT? Etc. Since all of these solutions are faster and less risky to implement than converting the solar system to molecular smiley faces, the AGI would probably just wirehead in one of these non-catastrophic ways.
The more general form of this argument is that for any agentic AGI seeking to maximize its utility function, wireheading is stably "easier" than effecting any scenario with human extinction as a side effect. Whenever you imagine a catastrophic scenario, such as converting the earth into raw materials for paperclips or curing all disease by killing all organic life, it's easy to imagine wirehead solutions that an agentic AGI could achieve faster and with a higher probability of success. Therefore, the agent will probably prefer these solutions. So goes the argument.
Counter-Argument
Most AI safety researchers reject this kind of AI wireheading. The argument, as I understand it, is an agentic AGI will have a sufficiently rich understanding of its environment to "know" that the wirehead solutions "shouldn't count" towards the utility function, and as a result they will reject them. Omohundro:
Yudkowski considers an analogy with a utility-changing pill:
I think the crux of this argument is that agentic AGI will know what it's utility function "really" is. Omohundro's chess-playing machine will have some "dumb" counter of the number of games its won in physical memory, but it's "real" utility function is some emergent understanding of what a game of chess is and what it means to win one. The agent will consider editing its dumb counter, and will check that against its internal model of its own utility function — and will decide that flipping bits in the counter doesn't meet its definition of winning a game of chess. Therefore it won't wirehead in this way.
Lebowski Rebuttal
I think that a proponent of Lebowski's Theorem would reply to this by saying:
Come on, that's wrong on multiple levels. First of all, it's pure anthropic projection to think that an AGI will have a preference for its 'real' goals when you're explicitly rewarding it for the 'proxy' goal. Sure, the agent might have a good enough model of itself to know that it's reward hacking. But why would it care? If you give the chess-playing machine a definition of what it means to win a game of chess (here's the "who won" function, here's the counter for how many games you won), and ask it to win as many games possible by that definition, it's not going to replace your attempted definition of a win with the 'real' definition of a win based on a higher-level understanding of what it's doing.
Second, and more importantly, this argument says that alignment is extremely hard on one hand and extremely easy on the other. If you're arguing that the chess-playing robot will have such a robust understanding of what it means to win a game of chess that it will accurately recognize and reject its own sophisticated ideas for "cheating" because they don't reflect the true definition of winning chess games, then alignment should be easy, right? Just say, "Win as many games of chess as possible, but, you know, try not to create any x-risk process and just ask us if you're not sure." Surely an AGI that can robustly understand what it "really" means to win a game of chess should be able to understand what it "really" means to avoid alignment x-risk. If on the other hand, you think unaligned actions are likely (e.g. converting all the matter in the solar system into more physical memory to increment the 'chess games won' counter), then surely reward hacking is at least equally likely. It's weird to simultaneously argue that it's super hard to robustly encode safeguards into an AI's utility function, but that it's super easy to robustly encode "winning a game of chess" into that same utility function — so much so that all the avenues for wireheading are cut off.
To avoid the infinite series of rebuttals & counter rebuttals, I'll just briefly state what I think the anti-Lebowski argument would be here, which is that you only need to fail to robustly encode safeguards once for it to be a problem. It's not that robust encoding is impossible (finding those encodings is the major project of AI safety research!), it's that there will be lots of chances to get it wrong and any one could be catastrophic.
As usual, I don't want to explicitly endorse one side of this argument or the other. Hopefully, I've explained both sides well enough to make each position more understandable.