That no superintelligence could be dumb enough to misinterpret what we mean is the particular belief in question
Some off-the-cuff thoughts on why "a superintelligence dumb enough to misinterpret what we mean" may be a contradiction in terms, given the usual meaning of superintelligence:
Intelligence is near-synonymous with "able to build accurate models and to update those models accurately", with 'higher intelligence' denoting a combination of "faster model-building / updating" and/or "less prone to systematic / random errors".
'Super' as a qualifier is usually applied on both dimensions, i.e. "faster and more accurately". While this seems more like a change in degree (one intelligence hypothesis, a devoted immortal fool with an endless supply of paper and pencils could simulate the world), it also often is a change in kind, since in practice there always are resource-constraints (unless Multivax reverses entropy), often relevant enough to bar a slower-modeling agent from achieving its goals within the given constraints.
"Able to build accurate models and to update those models accurately", then, proportionally increases "powerful, probably able to pursue its goals effectively, conditional on those goals being related to the accurate models".
Given a high degree of the former, by definition it is not exactly very hard to acquire and emulate the shared background on which inter-human understanding is built. For an AI, understanding humans would be relevant near-regardless of its actual goals; accurate models of humans as the sine-qua-non for e.g. breaking out of the AI box. Being able to build such models quickly and accurately is what classifies the agent as "superintelligent" in the first place! If there was no incentive for the agent to model humans at all, why would there be interactions with humans, such as the human asking the agent to "rescue grandma from the burning building"? The agent, when encountering rocks and precious minerals, will probably seek models reflecting a deep understanding of those. It will do the same when encountering humans.
See, I'm d'accord with statements such as "less intelligent agents would be expected to misinterpret what we mean", but a superintelligent agent -- i.e. an agent good at building accurate models --, should by its definition by able to understand human-level intentions. If it does not, then in that respect, I wouldn't call it a superintelligent agent.
In addition, I'd question who'd call a domain-limited expert system which is great with models only on some small subject-spectrum, but evidently abysmal with building models relevant to its goals in other respects, a "superintelligent agent", with its connotations of general intelligence. Does the expression "a superintelligent chessbot" make sense? Or saying "x is a superintelligent human, except for doing fractions, which he absolutely cannot do"?
Before you label me an idiot who'd expect the AI to fall in love with a human princess on top of the Empire State building, allow me to stress I'm not talking about the goal-specification phase, for which no shared basis for interpretation can be expected. "The humans constructed me to stop cancer. Now, I have come to understand that humans want that in order to live longer, and I use that and all my other refined models of the human psyche to fulfill my goal. Which I do: I stop cancer, by wiping out humanity." (Refined models cannot be used to change terminal goals, only to choose actions and subgoals to attain those goals.) More qualifications apply:
At first, such human-related models would of course be quite lacking, but probably converge fast (by definition). The problem remains of why the superintelligent agent would do what the monkeys intend it to (nevermind what they explicitly told it to), and how the monkeys could make sure of that in a way which survives self-modification. The intend-it-to / programmed-it-to dichotomy remains a problem then, since terminal goals are presumably not subject to updating/reflection, at least not as part of the 'superintelligence' attribute.
tl;dr: A superintelligent agent's specified goals must be airtightly constructed, but if those include "do what the human intends, not what he says", then the step from "words" to "intent" should be trivial. (Argument that superintelligent agents will not misinterpret humans does not apply to the goal-setting phase!)
ETA: News at 11 - News at 11 - Kawoomba solved FAI: use / leverage the foomed AIs superior model building ability (which entails that it knows what we want better than we do) by letting it solve the problem: let its initial (invariant) goal be to develop superior models of anything it encounters without affecting it (which should be easier to formalize than "friendliness"), then time that such that it will ask for "ENTER NEW GOALS" once it already established its superior models, at which point you simply tell it "ok glass, use as your new goal system that which I'd most want you to use".
NEXT!
If you already know what you're going to tell it when it asks for new goals, couldn't you just program that in from the beginning? So the script would be, "work on your models for X years, then try to parse this statement ..."
Also, re: Eliezer's HTTP GET objection, you could just give it a giant archive of the internet and no actual connection to the outside world. If it's just supposed to be learning and not affecting anything external, that should be sufficient (to ensure learning, not necessarily to preclude all effects on the outside world).
At this point, I think we've just reinvented the concept of CEV.
(It has come to my attention that this article is currently being misrepresented as proof that I/MIRI previously advocated that it would be very difficult to get machine superintelligences to understand or predict human values. This would obviously be false, and also, is not what is being argued below. The example in the post below is not about an Artificial Intelligence literally at all! If the post were about what AIs supposedly can't do, the central example would have used an AI! The point that is made below will be about the algorithmic complexity of human values. This point is relevant within a larger argument, because it bears on the complexity of what you need to get an artificial superintelligence to want or value; rather than bearing on what a superintelligence supposedly could not predict or understand. -- EY, May 2024.)
<
There are three kinds of genies: Genies to whom you can safely say "I wish for you to do what I should wish for"; genies for which no wish is safe; and genies that aren't very powerful or intelligent.
Suppose your aged mother is trapped in a burning building, and it so happens that you're in a wheelchair; you can't rush in yourself. You could cry, "Get my mother out of that building!" but there would be no one to hear.
Luckily you have, in your pocket, an Outcome Pump. This handy device squeezes the flow of time, pouring probability into some outcomes, draining it from others.
The Outcome Pump is not sentient. It contains a tiny time machine, which resets time unless a specified outcome occurs. For example, if you hooked up the Outcome Pump's sensors to a coin, and specified that the time machine should keep resetting until it sees the coin come up heads, and then you actually flipped the coin, you would see the coin come up heads. (The physicists say that any future in which a "reset" occurs is inconsistent, and therefore never happens in the first place - so you aren't actually killing any versions of yourself.)
Whatever proposition you can manage to input into the Outcome Pump, somehow happens, though not in a way that violates the laws of physics. If you try to input a proposition that's too unlikely, the time machine will suffer a spontaneous mechanical failure before that outcome ever occurs.
You can also redirect probability flow in more quantitative ways using the "future function" to scale the temporal reset probability for different outcomes. If the temporal reset probability is 99% when the coin comes up heads, and 1% when the coin comes up tails, the odds will go from 1:1 to 99:1 in favor of tails. If you had a mysterious machine that spit out money, and you wanted to maximize the amount of money spit out, you would use reset probabilities that diminished as the amount of money increased. For example, spitting out $10 might have a 99.999999% reset probability, and spitting out $100 might have a 99.99999% reset probability. This way you can get an outcome that tends to be as high as possible in the future function, even when you don't know the best attainable maximum.
So you desperately yank the Outcome Pump from your pocket - your mother is still trapped in the burning building, remember? - and try to describe your goal: get your mother out of the building!
The user interface doesn't take English inputs. The Outcome Pump isn't sentient, remember? But it does have 3D scanners for the near vicinity, and built-in utilities for pattern matching. So you hold up a photo of your mother's head and shoulders; match on the photo; use object contiguity to select your mother's whole body (not just her head and shoulders); and define the future function using your mother's distance from the building's center. The further she gets from the building's center, the less the time machine's reset probability.
You cry "Get my mother out of the building!", for luck, and press Enter.
For a moment it seems like nothing happens. You look around, waiting for the fire truck to pull up, and rescuers to arrive - or even just a strong, fast runner to haul your mother out of the building -
BOOM! With a thundering roar, the gas main under the building explodes. As the structure comes apart, in what seems like slow motion, you glimpse your mother's shattered body being hurled high into the air, traveling fast, rapidly increasing its distance from the former center of the building.
On the side of the Outcome Pump is an Emergency Regret Button. All future functions are automatically defined with a huge negative value for the Regret Button being pressed - a temporal reset probability of nearly 1 - so that the Outcome Pump is extremely unlikely to do anything which upsets the user enough to make them press the Regret Button. You can't ever remember pressing it. But you've barely started to reach for the Regret Button (and what good will it do now?) when a flaming wooden beam drops out of the sky and smashes you flat.
Which wasn't really what you wanted, but scores very high in the defined future function...
The Outcome Pump is a genie of the second class. No wish is safe.
If someone asked you to get their poor aged mother out of a burning building, you might help, or you might pretend not to hear. But it wouldn't even occur to you to explode the building. "Get my mother out of the building" sounds like a much safer wish than it really is, because you don't even consider the plans that you assign extreme negative values.
Consider again the Tragedy of Group Selectionism: Some early biologists asserted that group selection for low subpopulation sizes would produce individual restraint in breeding; and yet actually enforcing group selection in the laboratory produced cannibalism, especially of immature females. It's obvious in hindsight that, given strong selection for small subpopulation sizes, cannibals will outreproduce individuals who voluntarily forego reproductive opportunities. But eating little girls is such an un-aesthetic solution that Wynne-Edwards, Allee, Brereton, and the other group-selectionists simply didn't think of it. They only saw the solutions they would have used themselves.
Suppose you try to patch the future function by specifying that the Outcome Pump should not explode the building: outcomes in which the building materials are distributed over too much volume, will have ~1 temporal reset probabilities.
So your mother falls out of a second-story window and breaks her neck. The Outcome Pump took a different path through time that still ended up with your mother outside the building, and it still wasn't what you wanted, and it still wasn't a solution that would occur to a human rescuer.
If only the Open-Source Wish Project had developed a Wish To Get Your Mother Out Of A Burning Building:
All these special cases, the seemingly unlimited number of required patches, should remind you of the parable of Artificial Addition - programming an Arithmetic Expert Systems by explicitly adding ever more assertions like "fifteen plus fifteen equals thirty, but fifteen plus sixteen equals thirty-one instead".
How do you exclude the outcome where the building explodes and flings your mother into the sky? You look ahead, and you foresee that your mother would end up dead, and you don't want that consequence, so you try to forbid the event leading up to it.
Your brain isn't hardwired with a specific, prerecorded statement that "Blowing up a burning building containing my mother is a bad idea." And yet you're trying to prerecord that exact specific statement in the Outcome Pump's future function. So the wish is exploding, turning into a giant lookup table that records your judgment of every possible path through time.
You failed to ask for what you really wanted. You wanted your mother to go on living, but you wished for her to become more distant from the center of the building.
Except that's not all you wanted. If your mother was rescued from the building but was horribly burned, that outcome would rank lower in your preference ordering than an outcome where she was rescued safe and sound. So you not only value your mother's life, but also her health.
And you value not just her bodily health, but her state of mind. Being rescued in a fashion that traumatizes her - for example, a giant purple monster roaring up out of nowhere and seizing her - is inferior to a fireman showing up and escorting her out through a non-burning route. (Yes, we're supposed to stick with physics, but maybe a powerful enough Outcome Pump has aliens coincidentally showing up in the neighborhood at exactly that moment.) You would certainly prefer her being rescued by the monster to her being roasted alive, however.
How about a wormhole spontaneously opening and swallowing her to a desert island? Better than her being dead; but worse than her being alive, well, healthy, untraumatized, and in continual contact with you and the other members of her social network.
Would it be okay to save your mother's life at the cost of the family dog's life, if it ran to alert a fireman but then got run over by a car? Clearly yes, but it would be better ceteris paribus to avoid killing the dog. You wouldn't want to swap a human life for hers, but what about the life of a convicted murderer? Does it matter if the murderer dies trying to save her, from the goodness of his heart? How about two murderers? If the cost of your mother's life was the destruction of every extant copy, including the memories, of Bach's Little Fugue in G Minor, would that be worth it? How about if she had a terminal illness and would die anyway in eighteen months?
If your mother's foot is crushed by a burning beam, is it worthwhile to extract the rest of her? What if her head is crushed, leaving her body? What if her body is crushed, leaving only her head? What if there's a cryonics team waiting outside, ready to suspend the head? Is a frozen head a person? Is Terry Schiavo a person? How much is a chimpanzee worth?
Your brain is not infinitely complicated; there is only a finite Kolmogorov complexity / message length which suffices to describe all the judgments you would make. But just because this complexity is finite does not make it small. We value many things, and no they are not reducible to valuing happiness or valuing reproductive fitness.
There is no safe wish smaller than an entire human morality. There are too many possible paths through Time. You can't visualize all the roads that lead to the destination you give the genie. "Maximizing the distance between your mother and the center of the building" can be done even more effectively by detonating a nuclear weapon. Or, at higher levels of genie power, flinging her body out of the Solar System. Or, at higher levels of genie intelligence, doing something that neither you nor I would think of, just like a chimpanzee wouldn't think of detonating a nuclear weapon. You can't visualize all the paths through time, any more than you can program a chess-playing machine by hardcoding a move for every possible board position.
And real life is far more complicated than chess. You cannot predict, in advance, which of your values will be needed to judge the path through time that the genie takes. Especially if you wish for something longer-term or wider-range than rescuing your mother from a burning building.
I fear the Open-Source Wish Project is futile, except as an illustration of how not to think about genie problems. The only safe genie is a genie that shares all your judgment criteria, and at that point, you can just say "I wish for you to do what I should wish for." Which simply runs the genie's should function.
Indeed, it shouldn't be necessary to say anything. To be a safe fulfiller of a wish, a genie must share the same values that led you to make the wish. Otherwise the genie may not choose a path through time which leads to the destination you had in mind, or it may fail to exclude horrible side effects that would lead you to not even consider a plan in the first place. Wishes are leaky generalizations, derived from the huge but finite structure that is your entire morality; only by including this entire structure can you plug all the leaks.
With a safe genie, wishing is superfluous. Just run the genie.