Followup to: Building Phenomenological Bridges
Summary: AI theorists often use models in which agents are crisply separated from their environments. This simplifying assumption can be useful, but it leads to trouble when we build machines that presuppose it. A machine that believes it can only interact with its environment in a narrow, fixed set of ways will not understand the value, or the dangers, of self-modification. By analogy with Descartes' mind/body dualism, I refer to agent/environment dualism as Cartesianism. The open problem in Friendly AI (OPFAI) I'm calling naturalized induction is the project of replacing Cartesian approaches to scientific induction with reductive, physicalistic ones.
I'll begin with a story about a storyteller.
Once upon a time — specifically, 1976 — there was an AI named TALE-SPIN. This AI told stories by inferring how characters would respond to problems from background knowledge about the characters' traits. One day, TALE-SPIN constructed a most peculiar tale.
Henry Ant was thirsty. He walked over to the river bank where his good friend Bill Bird was sitting. Henry slipped and fell in the river. Gravity drowned.
Since Henry fell in the river near his friend Bill, TALE-SPIN concluded that Bill rescued Henry. But for Henry to fall in the river, gravity must have pulled Henry. Which means gravity must have been in the river. TALE-SPIN had never been told that gravity knows how to swim; and TALE-SPIN had never been told that gravity has any friends. So gravity drowned.
TALE-SPIN had previously been programmed to understand involuntary motion in the case of characters being pulled or carried by other characters — like Bill rescuing Henry. So it was programmed to understand 'character X fell to place Y' as 'gravity moves X to Y', as though gravity were a character in the story.1
For us, the hypothesis 'gravity drowned' has low prior probability because we know gravity isn't the type of thing that swims or breathes or makes friends. We want agents to seriously consider whether the law of gravity pulls down rocks; we don't want agents to seriously consider whether the law of gravity pulls down the law of electromagnetism. We may not want an AI to assign zero probability to 'gravity drowned', but we at least want it to neglect the possibility as Ridiculous-By-Default.
When we introduce deep type distinctions, however, we also introduce new ways our stories can fail.
Hutter's cybernetic agent model
Russell and Norvig's leading AI textbook credits Solomonoff with setting the agenda for the field of AGI: "AGI looks for a universal algorithm for learning and acting in any environment, and has its roots in the work of Ray Solomonoff[.]" As an approach to AGI, Solomonoff induction presupposes a model with a strong type distinction between the 'agent' and the 'environment'. To make its intuitive appeal and attendant problems more obvious, I'll sketch out the model.
A Solomonoff-inspired AI can most easily be represented as a multi-tape Turing machine like the one Alex Altair describes in An Intuitive Explanation of Solomonoff Induction. The machine has:
- three tapes, labeled 'input', 'work', and 'output'. Each initially has an infinite strip of 0s written in discrete cells.
- one head per tape, with the input head able to read its cell's digit and move to the right, the output head able to write 0 or 1 to its cell and move to the right, and the work head able to read, write, and move in either direction.
- a program, consisting of a finite, fixed set of transition rules. Each rule says when heads read, write, move, or do nothing, and how to transition to another rule.
A three-tape Turing machine.
We could imagine two such Turing machines communicating with each other. Call them 'Agent' and 'Environment', or 'Alice' and 'Everett'. Alice and Everett take turns acting. After Everett writes a bit to his output tape, that bit magically appears on Alice's input tape; and likewise, when Alice writes to her output tape, it gets copied to Everett's input tape. AI theorists have used this setup, which Marcus Hutter calls the cybernetic agent model, as an extremely simple representation of an agent that can perceive its environment (using the input tape), think (using the work tape), and act (using the output tape).2
A Turing machine model of agent-environment interactions. At first, the machines differ only in their programs. ‘Alice’ is the agent we want to build, while ‘Everett’ stands for everything else that’s causally relevant to Alice’s success.
We can define Alice and Everett's behavior in terms of any bit-producing Turing machines we'd like, including ones that represent probability distributions and do Bayesian updating. Alice might, for example, use her work tape to track four distinct possibilities and update probabilities over them:3
- (a) Everett always outputs 0.
- (b) Everett always outputs 1.
- (c) Everett outputs its input.
- (d) Everett outputs the opposite of its input.
Alice starts with a uniform prior, i.e., 25% probability each. If Alice's first output is 1, and Everett responds with 1, then Alice can store those two facts on her work tape and conditionalize on them both, treating them as though they were certain. This results in 0.5 probability each for (b) and (c), 0 probability for (a) and (d).
We care about an AI's epistemology only because it informs the AI's behavior — on this model, its bit output. If Alice outputs whatever bits maximize her expected chance of receiving 1s as input, then we can say that Alice prefers to perceive 1. In the example I just gave, such a preference predicts that Alice will proceed to output 1 forever. Further exploration is unnecessary, since she knows of no other importantly different hypotheses to test.
Enriching Alice's set of hypotheses for how Everett could act will let Alice win more games against a wider variety of Turing machines. The more programs Alice can pick out and assign a probability to, the more Turing machines Alice will be able to identify and intelligently respond to. If we aren't worried about whether it takes Alice ten minutes or a billion years to compute an update, and Everett will always patiently wait his turn, then we can simply have Alice perform perfect Bayesian updates; if her priors are right, and she translates her beliefs into sensible actions, she'll then be able to optimally respond to any environmental Turing machine.
For AI researchers following Solomonoff's lead, that's the name of the game: Figure out the program that will let Alice behave optimally while communicating with as wide a range of Turing machines as possible, and you've at least solved the theoretical problem of picking out the optimal artificial agent from the space of possible reasoners. The agent/environment model here may look simple, but a number of theorists see it as distilling into its most basic form the task of an AGI.2
Yet a Turing machine, like a cellular automaton, is an abstract machine — a creature of thought experiments and mathematical proofs. Physical computers can act like abstract computers, in just the same sense that heaps of apples can behave like the abstract objects we call 'numbers'. But computers and apples are high-level generalizations, imperfectly represented by concise equations.4 When we move from our mental models to trying to build an actual AI, we have to pause and ask how well our formalism captures what's going on in reality.
The problem with Alice
'Sensory input' or 'data' is what I call the information Alice conditionalizes on; and 'beliefs' or 'hypotheses' is what I call the resultant probability distribution and representation of possibilities (in Alice's program or work tape). This distinction seems basic to reasoning, so I endorse programming agents to treat them as two clearly distinct types. But in building such agents, we introduce the possibility of Cartesianism.
René Descartes held that human minds and brains, although able to causally interact with each other, can each exist in the absence of the other; and, moreover, that the properties of purely material things can never fully explain minds. In his honor, we can call a model or procedure Cartesian if it treats the reasoner as a being separated from the physical universe. Such a being can perceive (and perhaps alter) physical processes, but it can't be identified with any such process.5
The relevance of Cartesians to AGI work is that we can model them as agents experiencing a strong type distinction between 'mind' and 'matter', and an unshakable belief in the metaphysical independence of those two categories; because they're of such different kinds, they can vary independently. So we end up with AI errors that are the opposite of TALE-SPIN's — like an induction procedure that distinguishes gravity's type from embodied characters' types so strongly that it cannot hypothesize that, say, particles underlie or mediate both phenomena.
My claim is that if we plug in 'Alice's sensory data' for 'mind' and 'the stuff Alice hypothesizes as causing the sensory data' for 'matter', then agents that can only model themselves using the cybernetic agent model are Cartesian in the relevant sense.6
The model is Cartesian because the agent and its environment can only interact by communicating. That is, their only way of affecting each other is by trading bits printed to tapes.
If we build an actual AI that believes it's like Alice, it will believe that the environment can't affect it in ways that aren't immediately detectable, can't edit its source code, and can't force it to halt. But that makes the Alice-Everett system almost nothing like a physical agent embedded in a real environment. Under many circumstances, a real AI's environment will alter it directly. E.g., the AI can fall into a volcano. A volcano doesn't harm the agent by feeding unhelpful bits into its environmental sensors. It harms the agent by destroying it.
A more naturalistic model would say: Alice outputs a bit; Everett reads it; and then Everett does whatever the heck he wants. That might be feeding a new bit into Alice. Or it might be vandalizing Alice's work tape, or smashing Alice flat.
A robotic Everett tampering with an agent that mistakenly assumes Cartesianism. A real-world agent’s computational states have physical correlates that can be directly edited by the environment. If the agent can't model such scenarios, its reasoning (and resultant decision-making) will suffer.
A still more naturalistic approach would be to place Alice inside of Everett, as a subsystem. In the real world, agents are surrounded by their environments. The two form a cohesive whole, bound by the same physical laws, freely interacting and commingling.
If Alice only worries about whether Everett will output a 0 or 1 to her sensory tape, then no matter how complex an understanding Alice has of Everett's inner workings, Alice will fundamentally misunderstand the situation she's in. Alice won't be able to represent hypotheses about how, for example, a pill might erase her memories or otherwise modify her source code.
Humans, in contrast, can readily imagine a pill that modifies our memories. It seems childishly easy to hypothesize being changed by avenues other than perceived sensory information. The limitations of the cybernetic agent model aren't immediately obvious, because it isn't easy for us to put ourselves in the shoes of agents with alien blind spots.
There is an agent-environment distinction, but it's a pragmatic and artificial one. The boundary between the part of the world we call 'agent' and the part we call 'not-agent' (= 'environment') is frequently fuzzy and mutable. If we want to build an agent that's robust across many environments and self-modifications, we can't just design a program that excels at predicting sensory sequences generated by Turing machines. We need an agent that can form accurate beliefs about the actual world it lives in, including accurate beliefs about its own physical underpinnings.
From Cartesianism to naturalism
What would a naturalized self-model, a model of the agent as a process embedded in a lawful universe, look like? As a first attempt, one might point to the pictures of Cai in Building Phenomenological Bridges.
Cai has a simple physical model of itself as a black tile at the center of a cellular automaton grid. Cai's phenomenological bridge hypotheses relate its sensory data to surrounding tiles' states.
But this doesn't yet specify a non-Cartesian agent. To treat Cai as a Cartesian, we could view the tiles surrounding Cai as the work tape of Everett, and the dynamics of Cai's environment as Everett's program. (We can also convert Cai's perceptual experiences into a binary sequence on Alice/Cai's input tape, with a translation like 'cyan = 01, magenta = 10, yellow = 11'.)
Alice/Cai as a cybernetic agent in a Turing machine circuit.
The problem isn't that Cai's world is Turing-computable, of course. It's that if Cai's hypotheses are solely about what sorts of perception-correlated patterns of environmental change can occur, then Cai's models will be Cartesian.
Cai as a Cartesian treats its sensory experiences as though they exist in a separate world.
Cartesian Cai recognizes that its two universes, its sensory experiences and hypothesized environment, can interact. But it thinks they can only do so via a narrow range of stable pathways. No actual agent's mind-matter connections can be that simple and uniform.
If Cai were a robot in a world resembling its model, it would itself be a complex pattern of tiles. To form accurate predictions, it would need to have self-models and bridge hypotheses that were more sophisticated than any I've considered so far. Humans are the same way: No bridge hypothesis explaining the physical conditions for subjective experience will ever fit on a T-shirt.
Cai's world divided up into a 9x9 grid. Cai is the central 3x3 grid. Barely visible: Complex computations like Cai's reasoning are possible in this world because they're implemented by even finer tile patterns at smaller scales.
Changing Cai's tiles' states — from black to white, for example — could have a large impact on its computations, analogous to changing a human brain from solid to gaseous. But if an agent's hypotheses are all shaped like the cybernetic agent model, 'my input/output algorithm is replaced by a dust cloud' won't be in the hypothesis space.
If you programmed something to thinks like Cartesian Cai, it might decide that its sequence of visual experiences will persist even if the tiles forming its brain completely change state. It wouldn't be able to entertain thoughts like 'if Cai performs self-modification #381, Cai will experience its environment as smells rather than colors' or 'if Cai falls into a volcano, Cai gets destroyed'. No pattern of perceived colors is identical to a perceived smell, or to the absence of perception.
To form naturalistic self-models and world-models, Cai needs hypotheses that look less like conversations between independent programs, and more like worlds in which it is a fairly ordinary subprocess, governed by the same general patterns. It needs to form and privilege physical hypotheses under which it has parts, as well as bridge hypotheses under which those parts correspond in plausible ways to its high-level computational states.
Cai wouldn't need a complete self-model in order to recognize general facts about its subsystems. Suppose, for instance, that Cai has just one sensor, on its left side, and a motor on its right side. Cai might recognize that the motor and sensor regions of its body correspond to its introspectible decisions and perceptions, respectively.
A naturalized agent can recognize that it has physical parts with varying functions. Cai's top and bottom lack sensors and motors altogether, making it clearer that Cai's environment can impact Cai by entirely non-sensory means.
We care about Cai's models because we want to use Cai to modify its environment. For example, we may want Cai to convert as much of its environment as possible into grey tiles. Our interest is then in the algorithm that reliably outputs maximally greyifying actions when handed perceptual data.
If Cai is able to form sophisticated self-models, then Cai can recognize that it's a grey tile maximizer. Since it wants there to be more grey tiles, it also wants to make sure that it continues to exist, provided it believes that it's better than chance at pursuing its goals.
More specifically, Naturalized Cai can recognize that its actions are some black-box function of its perceptual computations. Since it has a bridge hypothesis linking its perceptions to its middle-left tile, it will then reason that it should preserve its sensory hardware. Cai's self-model tells it that if its sensor fails, then its actions will be based on beliefs that are much less correlated with the environment. And its self-model tells it that if its actions are poorly calibrated, then there will be fewer grey tiles in the universe. Which is bad.
A naturalistic version of Cai can reason intelligently from the knowledge that its actions (motor output) depend on a specific part of its body that's responsible for perception (environmental input).
A physical Cai might need to foresee scenarios like 'an anvil crashes into my head and destroys me', and assign probability mass to them. Bridge hypotheses expressive enough to consider that possibility would not just relate experiences to environmental or hardware states; they would also recognize that the agent's experiences can be absent altogether.
An anvil can destroy Cai's perceptual hardware by crashing into it. A Cartesian might not worry about this eventuality, expecting its experience to persist after its body is smashed. But a naturalized reasoner will form hypotheses like the above, on which its sequence of color experiences suddenly terminates when its sensors are destroyed.
This point generalizes to other ways Cai might self-modify, and to other things Cai might alter about itself. For example, Cai might learn that other portions of its brain correspond to its hypotheses and desires.
Another very simple model of how different physical structures are associated with different computational patterns.
This allows Cai to recognize that its goals depend on the proper functioning of many of its hardware components. If Cai believes that its actions depend on its brain's goal unit's working a specific way, then it will avoid taking pills that foreseeably change its goal unit. If Cai's causal model tells it that agents like it stop exhibiting future-steering behaviors when they self-modify to have mad priors, then it won't self-modify to acquire mad priors. And so on.
If Cai's motor fails, its effect on the world can change as a result. The same is true if its hardware is modified in ways that change its thoughts, or its preferences (i.e., the thing linking its conclusions to its motor).
Once Cai recognizes that its brain needs to work in a very specific way for its goals to be achieved, its preferences can take its physical state into account in sensible ways, without our needing to hand-code Cai at the outset to have the right beliefs or preferences over every individual thing that could change in its brain.
Just the opposite is true for Cartesians. Since they can't form hypotheses like 'my tape heads will stop computing digits if I disassemble them', they can only intelligently navigate such risks if they've been hand-coded in advance to avoid perceptual experiences the programmer thought would correlate with such dangers.
In other words, even though all of this is still highly informal, there's already some cause to think that a reasoning pattern like Naturalized Cai can generalize in ways that Cartesians can't. The programmers don't need to know everything about Cai's physical state, or anticipate everything about what future changes Cai might undergo, if Cai's epistemology allows it to easily form accurate reductive beliefs and behave accordingly. An agent like this might be adaptive and self-correcting in very novel circumstances, leaving more wiggle room for programmers to make human mistakes.
Bridging maps of worlds and maps of minds
Solomonoff-style dualists have alien blind spots that lead them to neglect the possibility that some hardware state is equivalent to some introspected computation '000110'. TALE-SPIN-like AIs, on the other hand, have blind spots that lead to mistakes like trying to figure out the angular momentum of '000110'.
A naturalized agent doesn't try to do away with the data/hypothesis type distinction and acquire a typology as simple as TALE-SPIN's. Rather, it tries to tightly interconnect its types using bridges. Naturalizing induction is about combining the dualist's useful map/territory distinction with a more sophisticated metaphysical monism than TALE-SPIN exhibits, resulting in a reductive monist AI.7
Alice's simple fixed bridge axiom, {environmental output 0 ↔ perceptual input 0, environmental output 1 ↔ perceptual input 1}, is inadequate for physically embodied agents. And the problem isn't just that Alice lacks other bridge rules and can't weigh evidence for or against each one. Bridge hypotheses are a step in the right direction, but they need to be diverse enough to express a variety of correlations between the agent's sensory experiences and the physical world, and they need a sensible prior. An agent that only considers bridge hypotheses compatible with the cybernetic agent model will falter whenever it and the environment interact in ways that look nothing like exchanging sensory bits.
With the help of an inductive algorithm that uses bridge hypotheses to relate sensory data to a continuous physical universe, we can avoid making our AIs Cartesians. This will make their epistemologies much more secure. It will also make it possible for them to want things to be true about the physical universe, not just about the particular sensory experiences they encounter. Actually writing a program that does all this is an OPFAI. Even formalizing how bridge hypotheses ought to work in principle is an OPFAI.
In my next post, I'll move away from toy models and discuss AIXI, Hutter's optimality definition for cybernetic agents. In asking whether the best Cartesian can overcome the difficulties I've described, we'll get a clearer sense of why Solomonoff inductors aren't reflective and reductive enough to predict drastic changes to their sense-input-to-motor-output relation — and why they can't be that reflective and reductive — and why this matters.
Notes
1 Meehan (1977). Colin Allen first introduced me to this story. Dennett discusses it as well. ↩
2 E.g., Durand, Muchnik, Ushakov & Vereshchagin (2004), Epstein & Betke (2011), Legg & Veness (2013), Solomonoff (2011). Hutter (2005) uses the term "cybernetic agent model" to emphasize the parallelism between his Turing machine circuit and control theory's cybernetic systems. ↩ ↩
3 One simple representation would be: Program Alice to write to her work tape, on round one, 0010 (standing for 'if I output 0, Everett outputs 0; if I output 1, Everett outputs 0'). Ditto for the other three hypotheses, 0111, 0011, and 0110. Then write the hypothesis' probability in binary (initially 25%, represented '11001') to the right of each, and program Alice to edit this number as she receives new evidence. Since the first and third digit stay the same, we can simplify the hypotheses' encoding to 00, 11, 01, 10. Indeed, if the hypotheses remain the same over time there's no reason to visibly distinguish them in the work tape at all, when we can instead just program Alice to use the left-to-right ordering of the four probabilities to distinguish the hypotheses. ↩
4 To the extent our universe perfectly resembles any mathematical structure, it's much more likely to do so at the level of gluons and mesons than at the level of medium-sized dry goods. The resemblance of apples to natural numbers is much more approximate. Two apples and three apples generally make five apples, but when you start cutting up or pulverizing or genetically altering apples, you may find that other mathematical models do a superior job of predicting the apples' behavior. It seems likely that the only perfectly general and faithful mathematical representation of apples will be some drastically large and unwieldy physics equation.
Ditto for machines. It's sometimes possible to build a physical machine that closely mimics a given Turing machine — but only 'closely', as Turing machines have unboundedly large tapes. And although any halting Turing machine can in principle be simulated with a bounded tape (Cockshott & Michaelson (2007)), nearly all Turing machine programs are too large to even be approximated by any physical process.
All physical machines structurally resemble Turing machines in ways that allow us to draw productive inferences from the one group to the other. See Piccinini's (2011) discussion of the physical Church-Turing thesis. But, for all that, the concrete machine and the abstract one remain distinct. ↩
5 Descartes (1641): "[A]lthough I certainly do possess a body with which I am very closely conjoined; nevertheless, because, on the one hand, I have a clear and distinct idea of myself, in as far as I am only a thinking and unextended thing, and as, on the other hand, I possess a distinct idea of body, in as far as it is only an extended and unthinking thing, it is certain that I (that is, my mind, by which I am what I am) am entirely and truly distinct from my body, and may exist without it."
From this it’s clear that Descartes also believed that the mind can exist without the body. This interestingly parallels the anvil problem, which I'll discuss more in my next post. However, I don't build immortality into my definition of 'Cartesianism'. Not all agents that act as though there is a Cartesian barrier between their thoughts and the world think that their experiences are future-eternal. I'm taking care not to conflate Cartesianism with the anvil problem because the formalism I'll discuss next time, AIXI, does face both of them. Though the problems are logically distinct, it's true that a naturalized reasoning method would be much less likely to face the anvil problem. ↩
6 This isn't to say that a Solomonoff inductor would need to be conscious in anything like the way humans are conscious. It can be fruitful to point to similarities between the reasoning patterns of humans and unconscious processes. Indeed, this already happens when we speak of unconscious mental processes within humans.
Parting ways with Descartes (cf. Kirk (2012)), many present-day dualists would in fact go even further than reductionists in allowing for structural similarities between conscious and unconscious processes, treating all cognitive or functional mental states as (in theory) realizable without consciousness. E.g., Chalmers (1996): "Although consciousness is a feature of the world that we would not predict from the physical facts, the things we say about consciousness are a garden-variety cognitive phenomenon. Somebody who knew enough about cognitive structure would immediately be able to predict the likelihood of utterances such as 'I feel conscious, in a way that no physical object could be,' or even Descartes's 'Cogito ergo sum.' In principle, some reductive explanation in terms of internal processes should render claims about consciousness no more deeply surprising than any other aspect of behavior." ↩
7 And since we happen to live in a world made of physics, the kind of monist we want in practice is a reductive physicalist AI. We want a 'physicalist' as opposed to a reductive monist that thinks everything is made of monads, or abstract objects, or morality fluid, or what-have-you. ↩
References
∙ Chalmers (1996). The Conscious Mind: In Search of a Fundamental Theory. Oxford University Press.
∙ Cockshott & Michaelson (2007). Are there new models of computation? Reply to Wegner and Eberbach. The Computer Journal, 50: 232-247.
∙ Descartes (1641). Meditations on first philosophy, in which the existence of God and the immortality of the soul are demonstrated.
∙ Durand, Muchnik, Ushakov & Vereshchagin (2004). Ecological Turing machines. Lecture Notes in Computer Science, 3142: 457-468.
∙ Epstein & Betke (2011). An information-theoretic representation of agent dynamics as set intersections. Lecture Notes in Computer Science, 6830: 72-81.
∙ Hutter (2005). Universal Artificial Intelligence: Sequence Decisions Based on Algorithmic Probability. Springer.
∙ Kirk (2012). Zombies. In Zalta (ed.), The Stanford Encyclopedia of Philosophy.
∙ Legg & Veness (2013). An approximation of the Universal Intelligence Measure. Lecture Notes in Computer Science, 7070: 236-249.
∙ Meehan (1977). TALE-SPIN, an interactive program that writes stories. Proceedings of the 5th International Joint Conference on Artificial Intelligence: 91-98.
∙ Piccinini (2011). The physical Church-Turing thesis: Modest or bold? British Journal for the Philosophy of Science, 62: 733-769.
∙ Russell & Norvig (2010). Artificial Intelligence: A Modern Approach. Prentice Hall.
∙ Solomonoff (2011). Algorithmic probability — its discovery — its properties and application to Strong AI. In Zenil (ed.), Randomness Through Computation: Some Answers, More Questions (pp. 149-157).
Thanks, Adele!
That's right, if you mean 'representations exist, so they must be implemented in physical systems'.
But the Cartesian agrees with 'the map is part of the territory' on a different interpretation. She thinks the mental and physical worlds both exist (as distinct 'countries' in a larger territory). Her error is just to think that it's impossible to redescribe the mental parts of the universe in physical terms.
An attempt at a Cartesian seed AI would probably just break, unless it overcame its Cartesianness by some mostly autonomous evolutionary algorithm for generating successful successor-agents. A human programmer could try to improve it over time, but it wouldn't be able to rely much on the AI's own intelligence (because self-modification is precisely where the AI has no defined hypotheses), so I'd expect the process to become increasingly difficult and slow and ineffective as we reached the limits of human understanding.
I think the main worry with Cartesians isn't that they're dumb-ish, so they might become a dangerously unpredictable human-level AI or a bumbling superintelligence. The main worry is that they're so dumb that they'll never coalesce into a working general intelligence of any kind. Then, while the build-a-clean-AI people (who are trying to design simple, transparent AGIs with stable, defined goals) are busy wasting their time in the blind alley of Cartesian architectures, some random build-an-ugly-AI project will pop up out of left field and eat us.
Build-an-ugly-AI people care about sloppy, quick-and-dirty search processes, not so much about AIXI or Solomonoff. So the primary danger of Cartesians isn't that they're Unfriendly; it's that they're shiny objects distracting a lot of the people with the right tastes and competencies for making progress toward Friendliness.
The bootstrapping idea is probably a good one: There's no way we'll succeed at building a perfect FAI in one go, so the trick will be to cut corners in all the ways that can get fixed by the system, and that don't make the system unsafe in the interim. I'm not sure Cartesianism is the right sort of corner to cut. Yes, the AI won't care about self-preservation; but it also won't care about any other interim values we'd like to program it with, except ones that amount to patterns of sensory experience for the AI.
The "build a clean Cartesian AI" folks, Schmidhuber and Hutter, are much closer to "describe how to build a clean naturalistic AI given unlimited computing power" than, say, Lenat's Eurisko is to AIXI. It's just that AIXI won't actually work as a conceptual foundation for the reasons given, nay it is Solomonoff induction itself which will not work as a conceptual foundation, hence considering naturalized induction as part of the work to be done along the way to OPFAI. The worry from Eurisko-style AI is not that it will be Cartesian an... (read more)