Imagine a robot with a turret-mounted camera and laser. Each moment, it is programmed to move forward a certain distance and perform a sweep with its camera. As it sweeps, the robot continuously analyzes the average RGB value of the pixels in the camera image; if the blue component passes a certain threshold, the robot stops, fires its laser at the part of the world corresponding to the blue area in the camera image, and then continues on its way.
Watching the robot's behavior, we would conclude that this is a robot that destroys blue objects. Maybe it is a surgical robot that destroys cancer cells marked by a blue dye; maybe it was built by the Department of Homeland Security to fight a group of terrorists who wear blue uniforms. Whatever. The point is that we would analyze this robot in terms of its goals, and in those terms we would be tempted to call this robot a blue-minimizer: a machine that exists solely to reduce the amount of blue objects in the world.
Suppose the robot had human level intelligence in some side module, but no access to its own source code; that it could learn about itself only through observing its own actions. The robot might come to the same conclusions we did: that it is a blue-minimizer, set upon a holy quest to rid the world of the scourge of blue objects.
But now stick the robot in a room with a hologram projector. The hologram projector (which is itself gray) projects a hologram of a blue object five meters in front of it. The robot's camera detects the projector, but its RGB value is harmless and the robot does not fire. Then the robot's camera detects the blue hologram and zaps it. We arrange for the robot to enter this room several times, and each time it ignores the projector and zaps the hologram, without effect.
Here the robot is failing at its goal of being a blue-minimizer. The right way to reduce the amount of blue in the universe is to destroy the projector; instead its beams flit harmlessly through the hologram.
Again, give the robot human level intelligence. Teach it exactly what a hologram projector is and how it works. Now what happens? Exactly the same thing - the robot executes its code, which says to scan the room until its camera registers blue, then shoot its laser.
In fact, there are many ways to subvert this robot. What if we put a lens over its camera which inverts the image, so that white appears as black, red as green, blue as yellow, and so on? The robot will not shoot us with its laser to prevent such a violation (unless we happen to be wearing blue clothes when we approach) - its entire program was detailed in the first paragraph, and there's nothing about resisting lens alterations. Nor will the robot correct itself and shoot only at objects that appear yellow - its entire program was detailed in the first paragraph, and there's nothing about correcting its program for new lenses. The robot will continue to zap objects that register a blue RGB value; but now it'll be shooting at anything that is yellow.
The human-level intelligence version of the robot will notice its vision has been inverted. It will know it is shooting yellow objects. It will know it is failing at its original goal of blue-minimization. And maybe if it had previously decided it was on a holy quest to rid the world of blue, it will be deeply horrified and ashamed of its actions. It will wonder why it has suddenly started to deviate from this quest, and why it just can't work up the will to destroy blue objects anymore.
The robot goes to Quirinus Quirrell, who explains that robots don't really care about minimizing the color blue. They only care about status and power, and pretend to care about minimizing blue in order to impress potential allies.
The robot goes to Robin Hanson, who explains that there are really multiple agents within the robot. One of them wants to minimize the color blue, the other wants to minimize the color yellow. Maybe the two of them can make peace, and agree to minimize yellow one day and blue the next?
The robot goes to Anna Salamon, who explains that robots are not automatically strategic, and that if it wants to achieve its goal it will have to learn special techniques to keep focus on it.
I think all of these explanations hold part of the puzzle, but that the most fundamental explanation is that the mistake began as soon as we started calling it a "blue-minimizing robot". This is not because its utility function doesn't exactly correspond to blue-minimization: even if we try to assign it a ponderous function like "minimize the color represented as blue within your current visual system, except in the case of holograms" it will be a case of overfitting a curve. The robot is not maximizing or minimizing anything. It does exactly what it says in its program: find something that appears blue and shoot it with a laser. If its human handlers (or itself) want to interpret that as goal directed behavior, well, that's their problem.
It may be that the robot was created to achieve a specific goal. It may be that the Department of Homeland Security programmed it to attack blue-uniformed terrorists who had no access to hologram projectors or inversion lenses. But to assign the goal of "blue minimization" to the robot is a confusion of levels: this was a goal of the Department of Homeland Security, which became a lost purpose as soon as it was represented in the form of code.
The robot is a behavior-executor, not a utility-maximizer.
In the rest of this sequence, I want to expand upon this idea. I'll start by discussing some of the foundations of behaviorism, one of the earliest theories to treat people as behavior-executors. I'll go into some of the implications for the "easy problem" of consciousness and philosophy of mind. I'll very briefly discuss the philosophical debate around eliminativism and a few eliminativist schools. Then I'll go into why we feel like we have goals and preferences and what to do about them.
Take three common "broad" or "generally-categorizable" demographics of minds: Autistic people, Empaths (lots of mirror neurons dedicated to modeling the behavior of others), Sociopaths, or "Professional Psychopaths" (high-functioning without mirror neurons, responsible for most systemic destruction, precisely because they can appear to be "highly functional and productive well-respected citizens"), Psychopaths (low-functioning without mirror neurons, most commonly an "obvious problem"). All of the prior humans' minds work on the principle of emergent order, with logic, reason, introspection being the alien or uncommon state of existence that is a minor veneer on the surface of the vast majority of brain function which is: "streaming perception and prediction of emergent patterns."
A robot that never evolved to "get along" with other sentiences, and is programmed in a certain way can "go wrong" or have any of billions of irrational "blue minimizing" functions. It seems that sure, it's a "behavior executor," not a "utility maximizer." I would go further and say that humans are not "ultility maximizers" either, except when they are training themselves to behave in a robotic fashion toward the purpose of "maximizing a utility" based on the very small number of patterns that they have consciously identified.
There's no reason for a super-human intelligence (one with far more neocortex, or far more complex neocortex that's equipped to do far more than model linear patterns, which perhaps automatically "sees" exponentials, cellular automata, and transcendental numbers) to be so limited.
Humans aren't much good at intelligent planning that takes other minds, and "kinds of minds" into account. That's why our societies regularly fall into a state of dominion and enslavement, and have to be "started over" from a position of utter chaos and destruction (ie: the rebuilding of Berlin and Dresden).
Far be it from me to be "mind-killed," but I think avoiding that fate should be a common object of discussion among people who are "rational." (ie: "What not to do.")
I also don't think it's fair to lump "behaviorists" (other than perhaps B.F. Skinner) into an irrational school of "oversimplification." Even Skinner noted that his goal was to get to the truth, via observation. (Eliminate biases.) By trying to make an entire school out of the implications of some minds, some of the time, we oversimplify the complex reality.
Behaviorism has caught scores of serial killers. (According to John Douglas, author of Mindhunter, originator of the FBI's Investigative Support Unit.) How? It turns out that serial killer behavior isn't that complex, and it's seeking goals that superior minds actually can model quite accurately. (This is much like a toddler chasing a ball into the street. Every adult can model that as a "bad thing," because their minds are superior enough to understand 1-what the child's goal is 2-what the child's probable failures in perception are 3-how the entire system of child, ball, street, and their inter-related feedbacks are likely to produce, as well as how the adult can, and should, swoop in and prevent the child from reaching the street.)
So, behaviorism does help us do two things: 4-eliminate errors from prior "schools" of philosophy (which were, themselves, not really "schools" but just significant insights) 5-reference "just what we can observe," in terms of revealed preferences. Revealed preferences are not "the whole picture." However, they do give us a starting point for isolating important variables.
This can be done with a robot or a human, but the human is a group of "messy emergent networks" (brain regions, combined with body feedback, with nagging long-term goals in the background, acting as an "action-shifting threshold") whose goals are the result of modeled patterns and instances of reward. The robot, on the other hand, lacks all the messy patterns, and can often deal with reality as a set of extreme reductions, in a way that no (or few) humans can.
The entire "utility function" paradigm appears to be a very backwards way of approximating thought to me. First you start with perceived patterns, then, you evolve ever-more-complex thought.
This allows you to develop goals that are really worth solving.
What we want in a super-intelligence is actually "more effective libertarians." Sure, we've found that free markets (very large free networks of humans) create wealth and prosperity. However, we've also found that there are large numbers of sociopaths who don't care about wealth and prosperity for all, just for themselves. Such a goal structure can maximize prosperity for sociopaths, while destroying all wealth and prosperity for others. In fact, this repeatedly happens throughout history, right up to the present. It's a cycle that's been interfered with temporarily, but never broken.
Would any robot, by default, care about shifting that outcome of "sociopaths dominate grossly-imperfect legal institutions"? I doubt it. Moreover, such a sociopath could create a lasting peace by creating a very stable tyranny, replete with highly-functional secret police, and a highly effective algorithm for "how to steal the most from every producer, while sensing their threshold for rebellion."
In fact, this is what the current system attempts to accomplish: There's no reason for the system to decay to Hitler's excesses, when scientists, producers, engineers, etc. have found (enough) happiness (and fear) in slavery. How much is "enough"? It's "enough (happiness) to keep producing without rebellion," and "enough (fear) to disincentivize rebellion."
This is like baling a few thousand gallons of water while the Titanic is sinking. 6-It won't make any difference to any important goal, short-term or long-term 7-It deals with a local situation that is irrelevant to anything important, worldwide 8-It deals with theories of the mind that are compatible with Francis Crick and Jeff Hawkins' work, but only useful to narrow sub-disciplines like "How do we know when law enforcement should take action?" or "When we see this at a crime scene, it's a good threshold-based variable for how many resources we should throw at the problem." 9-Every "school" that stops referring to reality and nature, to the extent it does so, is horribly flawed (this is Jeff Hawkins, who is right about almost everything, screwing up royally in dismissing science fiction as "not having anything important to say about brain building.") 10-When you're studying human "schools," you're studying a narrow focus of human insight described with words("labels" and "maps") instead of the insight they've derived from their modeling of the territory. (Kozybski,who himself,turned a few insights into a "school")