For utility function maximisers, the AIXI is the theoretically best agent there is, more successful at reaching its goals (up to a finite constant) than any other agent (Hutter, 2005).
False. AIXI as defined can maximize only a sensory reward channel, not a utility function over an environmental model with a known ontology. As Dewey demonstrates, this problem is not easy to fix; AIXI can have utility functions over (functions of) sensory data, but its environment-predictors vary freely in ontology via Solomonoff induction, so it can't have a predefined utility function over the future of its environment without major rewriting.
AIXI is the optimal function-of-sense-data maximizer for Cartesian agents with unbounded computing power and access to a halting oracle, in a computable environment as separated from AIXI by the Cartesian boundary, given that your prior belief about the possible environments matches AIXI's Solomonoff prior.
Here's an attack on section 4.1. Consider the possibility that "philosophical ability" (something like the ability to solve confusing problems that can't be easily formalized) is needed to self-improve beyond some threshold of intelligence, and this same "philosophical ability" also reliably causes one to decide that some particular goal G is the right goal to have, and therefore beyond some threshold of intelligence all agents have goal G. To deny this possibility seems to require more meta-philosophical knowledge than we currently possess.
My own intuition seems to assign a probability to it that is greater than "very unlikely"
Why? You're making an extraordinary claim. Something - undefined - called philosophical ability is needed (for some reason) to self improve and, for some extraordinary and unexplained reason, this ability causes an agent to have a goal G. Where goal G is similarly undefined.
Let me paraphrase: Consider the possibility that "mathematical ability" is needed to self-improve beyond some threshold of intelligence, and this same "mathematical ability" also reliably causes one to decide that some particular goal G is the right goal to have, and therefore beyond some threshold of intelligence all agents have goal G.
Why is this different? What in your intuition is doing the work "philosophical ability" -> same goals? If we call it something else than "philosophical ability", would you have the same intuition? What raises the status of that implication to the level that it's worthy of consideration?
I'm asking seriously - this is the bit in the argument I consistently fail to understand, the bit that never makes sense to me, but who's outline I can feel in most counterarguments.
I think we don't just lack introspective access to our goals, but can't be said to have goals at all (in the sense of preference ordering over some well defined ontology, attached to some decision theory that we're actually running). For the kind of pseudo-goals we have (behavior tendencies and semantically unclear values expressed in natural language), they don't seem to have the motivational strength to make us think "I should keep my goal G1 instead of avoiding arbitrariness", nor is it clear what it would mean to "keep" such pseudo-goals as one self-improves.
What if it's the case that evolution always or almost always produces agents like us, so the only way they can get real goals in the first place is via philosophy?
‘maximising paperclips’
Since you want a non-LWian audience, make that “maximising the number of paperclips in the universe”, otherwise the meaning might be unclear.
Couple of comments:
We will also take the materialistic position that humans themselves can be viewed as non-deterministic algorithms[2]
I'm not a philosopher of mind but I think "materialistic" might be a misleading word here, being too similar to "materialist". Wouldn't "computationalistic" or maybe "functionalistic" be more precise? ("-istic" as opposed to "-ist" to avoid connotational baggage.) Also it's ambiguous whether footnote two is a stipulation for interpreting the paper or a brief description of the consensus view in physics.
At various points you make somewhat bold philosophical or conceptual claims based off of speculative mathematical formalisms. Even though I'm familiar with and have much respect for the cited mathematics, this still makes me nervous, because when I read philosophical papers that take such an approach my prior is high for subtle or subtly unjustified equivocation; I'd be even more suspicious were I a philosopher who wasn't already familiar with universal AI, which isn't a well-known or widely respected academic subfield. The necessity of finding clearly trustworthy analogies between mathematical and phenomena...
Just some minor text corrections for you:
From 3.1
The utility function picture of a rational agent maps perfectly onto the Orthogonality thesis: here have the goal structure, the utility fu...
...could be "here we have the...
From 3.2
Human minds remain our only real model of general intelligence, and this strongly direct and informs...
this strongly directs and informs...
From 4.1
...“All human-designed rational beings would follow the same morality (or one of small sets of moralities)” sound plausible; in contract “All human-designed superefficient
I like the paper, but am wondering how (or whether) it applies to TDT and acausal trading. Doesn't the trading imply a form of convergence theorem among very powerful TDT agents (they should converge on an average utility function constructed across all powerful TDT agents in logical space)?
Or have I missed something here? (I've been looking around on Less Wrong for a good post on acausal trading, and am finding bits and pieces, but no overall account.)
If a goal is a preference order over world states, then there are uncountably many of them, so any countable means of expression can only express a vanishingly small minority of them. Trivially (as Bostrom points out) a goal system can be too complex for an agent of a given intelligence. It therefore seems to me that what we're really defending is an Upscalability thesis: if an agent A with goal G is possible, then a significantly more intelligent A++ with goal G is possible.
...Thus to deny the Orthogonality thesis is to assert that there is a goal system G, such that, among other things:
(1) There cannot exist any efficient real-world algorithm with goal G.
(2) If a being with arbitrarily high resources, intelligence, time and goal G, were to try design an efficient real-world algorithm with the same goal, it must fail.
(3) If a human society were highly motivated to design an efficient real-world algorithm with goal G, and were given a million years to do so along with huge amounts of resources, training and knowledge about AI, i
our race spans foot-fetishists, religious saints, serial killers, instinctive accountants, role-players, self-cannibals, firefighters and conceptual artists. The autistic, those with exceptional social skills, the obsessive compulsive and some with split-brains. Beings of great empathy and the many who used to enjoy torture and executions as public spectacles
Some of these are not really terminal goals. A fair number of people with strong sexual fetishes would be perfectly happy without them, and in more extreme cases really would prefer not to have them...
A fair number of people with strong sexual fetishes
there are some serial killers
It was an existence argument. That some more people aren't examples doesn't really change matters, does it?
to avoid worrying about robot bodies and such-like, we may restrict the list of tasks to those accomplishable over the internet
Many of the tasks I accomplish over the internet require there to be people who know me in real life, some require me to have a body and voice which looks and sounds human (in photos and videos at least) and a few require me to be enrolled in my university, have a bank account, be a citizen of my country, vel sim. (Adding “anonymously” and “for free” ought to fix that.)
I don't see why there are only two counter-theses in section 4. Or rather, it looks as though you want a too-strong claim - in order to criticise it.
Try a "partial convergence" thesis instead. For instance, the claim that goals that are the product of cultural or organic evolution tend to maximise entropy and feature universal instrumental values.
Minor text correction;
"dedicated committee of human-level AIs dedicated" repeats the same adjective in a small span.
More wide-ranging:
Perhaps the paper would be stronger if it explained why philosophers might feel that convergence is probable. For example, in their experience, human philosophers / philosophies converge.
In a society, where the members are similar to one another, and much less powerful than the society as a whole, the morality endorsed by the society might be based on the memes that can spread successfully. That is, a meme like '...
Copying from a comment I already made cos no-one responded last time:
I'm not confident about any of the below, so please add cautions in the text as appropriate.
The orthogonality thesis is both stronger and weaker than we need. It suffices to point out that neither we nor Ben Goertzel know anything useful or relevant about what goals are compatible with very large amounts of optimizing power, and so we have no reason to suppose that superoptimization by itself points either towards or away from things we value. By creating an "orthogonality thesis&quo...
I don't think section 4.1 defeats your wording of your Convergence Thesis.
Convergence: all human-designed superintelligences would have one of a small set of goals.
The way you have worded this, I read it as trivially true. The set of human designed superintelligences is necessarily a tiny subset of the space of all superintelligences, and thus the set of dependent goals of human-designed superintelligences is a tiny subset of the space of all goals.
Much depends on your useage of 'small'. Small relative to what?
I think you should clarify notions of conver...
Who is your target audience? Can you pretend to be the actual person you are trying to convince and do your absolute best to demolish the arguments presented in this paper? (You can find their arguments in their publications and apply them to your paper.) And no counter-objections until you finished writing what essentially is a referee report. If you need some extra motivation, pretend that you are being paid $100 for each argument that convinces the rest of the audience and $1000 for each argument that convinces the paper author. When done, post the referee report here, and people will tell you whether you did a good job.
Can you pretend to be the actual person you are trying to convince and do your absolute best to demolish the arguments presented in this paper?
No, I cannot. I've read the various papers, and they all orbit around an implicit and often unstated moral realism. I've also debated philosophers on this, and the same issue rears its head - I can counter their arguments, but their opinions don't shift. There is an implicit moral realism that does not make any sense to me, and the more I analyse it, the less sense it makes, and the less convincing it becomes. Every time a philosopher has encouraged me to read a particular work, it's made me find their moral realism less likely, because the arguments are always weak.
I can't really put myself in their shoes to successfully argue their position (which I could do with theism, incidentally). I've tried and failed.
If someone can help we with this, I'd be most grateful. Why does "for reasons we don't know, any being will come to share and follow specific moral principles (but we don't know what they are)", rise to seem plausible?
All of these seem extraordinarily strong claims to make!
A critic might respond: they are strong claims to make about an arbitrarily chosen individual goal system, but asserting that there exists some goal system fulfilling the conditions is a massive disjunction, and so is weaker than it appears from the list of conditions.
How's about that: the general purpose problem solving is altogether a different problem from implementing any form of real world motivation, and is likely to come separate from it (case in point: try make AIXI maximize paperclips without it also searching for a way to show itself paperclip porn; the problem appears entirely non solvable).
It seems that for danger of the AI you need some peculiar window into which the orthogonality must fly - too much orthogonality, no risk, too little, no FAI/UFAI distinction.
Strong orthogonality hypothesis is definitely wrong - not being openly hostile to most other agents has enormous instrumental advantage. That's what's holding modern human societies together - agents like humans, corporations, states etc. - have mostly managed to keep their hostility low. Those that are particularly belligerent (and historical median has been far more belligerent towards strangers than all but the most extreme cases today) don't do well by instrumental standards at all.
Of course you can make a complicated argument why it doesn't matter (so...
It's so much easier to just change your moral reasoning than than to reingineer the entirety of human intelligence. How can artificial intelligence experts be so daft?
Note: informally, the point of this paper is to argue against the instinctive "if the AI were so smart, it would figure out the right morality and everything will be fine." It is targeted mainly at philosophers, not at AI programmers. The paper succeeds if it forces proponents of that position to put forwards positive arguments, rather than just assuming it as the default position. This post is presented as an academic paper, and will hopefully be published, so any comments and advice are welcome, including stylistic ones! Also let me know if I've forgotten you in the acknowledgements.
Abstract: In his paper “The Superintelligent Will”, Nick Bostrom formalised the Orthogonality thesis: the idea that the final goals and intelligence levels of agents are independent of each other. This paper presents arguments for a (slightly narrower) version of the thesis, proceeding through three steps. First it shows that superintelligent agents with essentially arbitrary goals can exist. Then it argues that if humans are capable of building human-level artificial intelligences, we can build them with any goal. Finally it shows that the same result holds for any superintelligent agent we could directly or indirectly build. This result is relevant for arguments about the potential motivations of future agents.
1 The Orthogonality thesis
The Orthogonality thesis, due to Nick Bostrom (Bostrom, 2011), states that:
It is analogous to Hume’s thesis about the independence of reason and morality (Hume, 1739), but applied more narrowly, using the normatively thinner concepts ‘intelligence’ and ‘final goals’ rather than ‘reason’ and ‘morality’.
But even ‘intelligence’, as generally used, has too many connotations. A better term would be efficiency, or instrumental rationality, or the ability to effectively solve problems given limited knowledge and resources (Wang, 2011). Nevertheless, we will be sticking with terminology such as ‘intelligent agent’, ‘artificial intelligence’ or ‘superintelligence’, as they are well established, but using them synonymously with ‘efficient agent’, artificial efficiency’ and ‘superefficient algorithm’. The relevant criteria is whether the agent can effectively achieve its goals in general situations, not whether its inner process matches up with a particular definition of what intelligence is.
Thus an artificial intelligence (AI) is an artificial algorithm, deterministic or probabilistic, implemented on some device, that demonstrates an ability to achieve goals in varied and general situations[1]. We don’t assume that it need be a computer program, or a well laid-out algorithm with clear loops and structures – artificial neural networks or evolved genetic algorithms certainly qualify.
A human level AI is defined to be an AI that can successfully accomplish any task at least as well as an average human would (to avoid worrying about robot bodies and such-like, we may restrict the list of tasks to those accomplishable over the internet). Thus we would expect the AI to hold conversations about Paris Hilton’s sex life, to compose ironic limericks, to shop for the best deal on Halloween costumes and to debate the proper role of religion in politics, at least as well as an average human would.
A superhuman AI is similarly defined as an AI that would exceed the ability of the best human in all (or almost all) tasks. It would do the best research, write the most successful novels, run companies and motivate employees better than anyone else. In areas where there may not be clear scales (what’s the world’s best artwork?) we would expect a majority of the human population to agree the AI’s work is among the very best.
Nick Bostrom’s paper argued that the Orthogonality thesis does not depend on the Humean theory of motivation. This paper will directly present arguments in its favour. We will assume throughout that human level AIs (or at least human comparable AIs) are possible (if not, the thesis is void of useful content). We will also take the materialistic position that humans themselves can be viewed as non-deterministic algorithms[2]: this is not vital to the paper, but is useful for comparison of goals between various types of agents. We will do the same with entities such as committees of humans, institutions or corporations, if these can be considered to be acting in an agent-like way.
1.1 Qualifying the Orthogonality thesis
The Orthogonality thesis, taken literally, is false. Some motivations are mathematically incompatible with changes in intelligence (“I want to prove the Gödel statement for the being I would be if I were more intelligent”). Some goals specifically refer to the intelligence of the agent, directly (“I want to be an idiot!”) or indirectly (“I want to impress people who want me to be an idiot!”). Though we could make a case that an agent wanting to be an idiot could initially be of any intelligence level, it won’t stay there long, and it’s hard to see how an agent with that goal could have become intelligent in the first place. So we will exclude from consideration those goals that intrinsically refer to the intelligence level of the agent.
We will also exclude goals that are so complex or hard to describe that the complexity of the goal becomes crippling for the agent. If the agent’s goal takes five planets worth of material to describe, or if it takes the agent five years each time it checks its goal, it’s obvious that that agent can’t function as an intelligent being on any reasonable scale.
Further we will not try to show that intelligence and final goals can vary freely, in any dynamical sense (it could be quite hard to define this varying). Instead we will look at the thesis as talking about possible states: that there exist agents of all levels of intelligence with any given goals. Since it’s always possible to make an agent stupider or less efficient, what we are really claiming is that there exist high-intelligence agents with any given goal. Thus the restricted Orthogonality thesis that we will be discussing is:
2 Orthogonality for theoretic agents
If we were to step back for a moment and consider, in our mind’s eyes, the space of every possible algorithm, peering into their goal systems and teasing out some measure of their relative intelligences, would we expect the Orthogonality thesis to hold? Since we are not worrying about practicality or constructability, all that we would require is that for any given goal system, there exists a theoretically implementable algorithm of extremely high intelligence.
At this level of abstraction, we can consider any goal to be equivalent with maximising a utility function. It is generally not that hard to translate given goals into utilities (many deontological systems are equivalent with maximising the expected utility of a function that gives 1 if the agent always makes the correct decision and 0 otherwise), and any agent making a finite number of decisions can always be seen as maximising a certain utility function.
For utility function maximisers, the AIXI is the theoretically best agent there is, more successful at reaching its goals (up to a finite constant) than any other agent (Hutter, 2005). AIXI itself is incomputable, but there are computable variants such as AIXItl or Gödel machines (Schmidhuber, 2007) that accomplish comparable levels of efficiency. These methods work for whatever utility function is plugged into them. Thus in the extreme theoretical case, the Orthogonality thesis seems trivially true.
There is only one problem with these agents: they require incredibly large amounts of computing resources to work. Let us step down from the theoretical pinnacle and require that these agents could actually exist in our world (still not requiring that we be able or likely to build it).
An interesting thought experiment occurs here. We could imagine an AIXI-like super-agent, with all its resources, that is tasked to design and train an AI that could exist in our world, and that would accomplish the super-agent’s goals. Using its own vast intelligence, the super-agent would therefore design a constrained agent maximally effective at accomplishing those goals in our world. Then this agent would be the high-intelligence real-world agent we are looking for. It doesn’t matter that this is a thought experiment – if the super-agent can succeed in the thought experiment, then the trained AI can exist in our world.
This argument generalises to other ways of producing the AI. Thus to deny the Orthogonality thesis is to assert that there is a goal system G, such that, among other things:
All of these seem extraordinarily strong claims to make! The last claims all derive from the first, and merely serve to illustrate how strong the first claim actually is. Thus until such time as someone comes up with such a G and strong arguments for why it must fulfil these conditions, we can assume the Orthogonality statement established in the theoretical case.
3 Orthogonality for human-level AIs
Of course, even if efficient agents could exist for all these goals, that doesn’t mean that we could ever build them, even if we could build AIs. In this section, we’ll look at the ground for assuming the Orthogonality thesis holds for human-level agents. Since intelligence isn’t varying much, the thesis becomes simply:
So, is this true? The arguments in this section are generally independent of each other, and can be summarised as:
3.1 Utility functions
The utility function picture of a rational agent maps perfectly onto the Orthogonality thesis: here have the goal structure, the utility function, packaged neatly and separately from the intelligence module (whatever part of the machine calculates which actions maximise expected utility). Demonstrating the Orthogonality thesis is as simple as saying that the utility function can be replaced with another. However, many putative agent designs are not utility function based, such as neural networks, genetic algorithms, or humans. Nor do we have the extreme calculating ability that we had in the purely theoretic case to transform any goals into utility functions. So from now on we will consider that our agents are not expected utility maximisers with clear and separate utility functions.
3.2 The span of human motivations
It seems a reasonable assumption that if there exists a human being with particular goals, then we can construct a human-level AI with similar goals. This is immediately the case if the AI was a whole brain emulation/upload (Sandberg & Bostrom, 2008), a digital copy of a specific human mind. Even for more general agents, such as evolved agents, this remains a reasonable thesis. For a start, we know that real-world evolution has produced us, so constructing human-like agents that way is certainly possible. Human minds remain our only real model of general intelligence, and this strongly direct and informs our AI designs, which are likely to be as human-similar as we can make them. Similarly, human goals are the easiest goals for us to understand, hence the easiest to try and implement in AI. Hence it seems likely that we could implement most human goals in the first generation of human-level AIs.
So how wide is the space of human motivations[3]? Our race spans foot-fetishists, religious saints, serial killers, instinctive accountants, role-players, self-cannibals, firefighters and conceptual artists. The autistic, those with exceptional social skills, the obsessive compulsive and some with split-brains. Beings of great empathy and the many who used to enjoy torture and executions as public spectacles[4]. It is evident that the space of possible human motivations is vast[5]. For any desire, any particular goal, no matter how niche[6], pathological, bizarre or extreme, as long as there is a single human who ever had it, we could build and run an AI with the same goal.
But with AIs we can go even further. We could take any of these goals as a starting point, make them malleable (as goals are in humans), and push them further out. We could provide the AIs with specific reinforcements to push their goals in extreme directions (reward the saint for ever-more saintly behaviour). If the agents are fast enough, we could run whole societies of them with huge varieties of evolutionary or social pressures, to further explore the goal-space.
We may also be able to do surgery directly on their goals, to introduce more yet variety. For example, we could take a dedicated utilitarian charity worker obsessed with saving lives in poorer countries (but who doesn’t interact, or want to interact, directly with those saved), and replace ‘saving lives’ with ‘maximising paperclips’ or any similar abstract goal. This is more speculative, of course – but there are other ways of getting similar results.
3.3 Interim goals as terminal goals
If someone were to hold a gun to your head, they could make you do almost anything. Certainly there are people who, with a gun at their head, would be willing to do almost anything. A distinction is generally made between interim goals and terminal goals, with the former being seen as simply paths to the latter, and interchangeable with other plausible paths. The gun to your head disrupts the balance: your terminal goal is simply not to get shot, while your interim goals become what the gun holder wants them to be, and you put a great amount of effort into accomplishing the minute details of these interim goals. Note that the gun has not changed your level of intelligence or ability.
This is relevant because interim goals seem to be far more varied in humans than terminal goals. One can have interim goals of filling papers, solving equations, walking dogs, making money, pushing buttons in various sequences, opening doors, enhancing shareholder value, assembling cars, bombing villages or putting sharks into tanks. Or simply doing whatever the guy with gun at our head orders us to do. If we could accept human interim goals as AI terminal goals, we would extend the space of goals quite dramatically.
To do we would want to put the threatened agent, and the gun wielder, together into the same AI. Algorithmically there is nothing extraordinary about this: certain subroutines have certain behaviours depending on the outputs of other subroutines. The ‘gun wielder’ need not be particularly intelligent: it simply needs to be able to establish whether its goals are being met. If for instance those goals are given by a utility function then all that is required in an automated system that measure progress toward increasing utility and punishes (or erases) the rest of the AI if not. The ‘rest of AI’ is just required to be a human-level AI which would be susceptible to this kind of pressure. Note that we do not require that it even be close to human in any way, simply that it place a highest value on self-preservation (or on some similar small goal that the ‘gun wielder’ would have power over).
For humans, another similar model is that of a job in a corporation or bureaucracy: in order to achieve the money required for their terminal goals, some human are willing to perform extreme tasks (organising the logistics of genocides, weapon design, writing long detailed press releases they don’t agree with at all). Again, if the corporation-employee relationship can be captured in a single algorithm, this would generate an intelligent AI whose goal is anything measurable by the ‘corporation’. The ‘money’ could simply be an internal reward channel, perfectly aligning the incentives.
If the subagent is anything like a human, they would quickly integrate the other goals into their own motivation[7], removing the need for the gun wielder/corporation part of the algorithm.
3.4 Noise, anti-agents and goal combination
There are further ways of extending the space of goals we could implement in human-level AIs. One simple way is simply to introduce noise: flip a few bits and subroutines, add bugs and get a new agent. Of course, this is likely to cause the agent’s intelligence to decrease somewhat, but we have generated new goals. Then, if appropriate, we could use evolution or other improvements to raise the agent’s intelligence again; this will likely undo some, but not all of effect of the noise. Or we could use some of the tricks above to make a smarter agent implement the goals of the noise-modified agent.
A more extreme example would be to create an anti-agent: an agent whose single goal is to stymie the plans and goals of single given agent. This already happens with vengeful humans, and we would just need to dial it up: have an anti-agent that would do all it can to counter the goals of a given agent, even if that agent doesn’t exist (“I don’t care that you’re dead, I’m still going to despoil your country, because that’s what you’d wanted me to not do”). This further extends the space of possible goals.
Different agents with different goals can also be combined into a single algorithm. With some algorithmic method for the AIs to negotiate their combined objective and balance the relative importance of their goals, this procedure would construct a single AI with a combined goal system. There would likely be no drop in intelligence/efficiency: committees of two can work very well towards their common goals, especially if there is some automatic penalty for disagreements.
3.5 Further tricks up the sleeve
This section started by emphasising the wide space of human goals, and then introduced tricks to push goal systems further beyond these boundaries. The list isn’t exhaustive: there are surely more devices and ideas one can use to continue to extend the space of possible goals for human-level AIs. Though this might not be enough to get every goal, we can nearly certainly use these procedures to construct a human-level AI with any human-comprehensible goal. But would the same be true for superhuman AIs?
4 Orthogonality for superhuman AIs
We now come to the area where the Orthogonality thesis seems the most vulnerable. It is one thing to have human-level AIs, or abstract superintelligent algorithms created ex nihilo, with certain goals. But if ever the human race were to design a superintelligent AI, there would be some sort of process involved – directed evolution, recursive self-improvement (Yudkowsky, 2001), design by a committee of AIs, or similar – and it seems at least possible that such a process could fail to fully explore the goal-space. If we define the Orthogonality thesis in this context as:
There are two counter-theses. The weakest claim is:
A stronger claim is:
They should be distinguished; Incompleteness is all that is needed to contradict Orthogonality, but Convergence is often the issue being discussed. Often convergence is assumed to be to some particular model of metaethics (Müller, 2012).
4.1 No convergence
The plausibility of the convergence thesis is highly connected with the connotations of the terms used in it. “All human-designed rational beings would follow the same morality (or one of small sets of moralities)” sound plausible; in contract “All human-designed superefficient algorithms would accomplish the same task” seems ridiculous. To quote an online commentator, how good at playing chess would a chess computer have to be before it started feeding the hungry?
Similarly, if there were such a convergence, then all self-improving or constructed superintelligence must fall prey to it, even if it were actively seeking to avoid it. After all, the lower-level AIs or the AI designers have certain goals in mind (as we’ve seen in the previous section, potentially any goals in mind). Obviously, they would be less likely to achieve their goals if these goals were to change (Omohundro, 2008) (Bostrom, 2012). The same goes if the superintelligent AI they designed didn’t share these goals. Hence the AI designers will be actively trying to prevent such a convergence, if they suspected that one was likely to happen. If for instance their goals were immoral, they would program their AI not to care about morality; they would use every trick up their sleeves to prevent the AI’s goals from drifting from their own.
So the convergence thesis requires that for the vast majority of goals G:
This makes the convergence thesis very unlikely. The argument also works against the incompleteness thesis, but in a weaker fashion: it seems more plausible that some goals would be unreachable, despite being theoretically possible, rather than most goals being unreachable because of convergence to a small set.
There is another interesting aspect of the convergence thesis: it postulates that certain goals G will emerge, without them being aimed for or desired. If one accepts that goals aimed for will not be reached, one has to ask why convergence is assumed: why not divergence? Why not assume that though G is aimed for, random accidents or faulty implementation will lead to the AI ending up with one of a much wider array of possible goals, rather than a much narrower one.
4.2 Oracles show the way
If the Orthogonality thesis is wrong, then it implies that Oracles are impossible to build. An Oracle is a superintelligent AI that accurately answers questions about the world (Armstrong, Sandberg, & Bostrom, 2011). This includes hypothetical questions about the future, which means that we can produce a superintelligent AI with goal G by wiring a human-level AI with goal G to an Oracle: the human level AI will go through possible actions, have the Oracle check the outcomes, and choose the one that best accomplishes G.
What makes the “no Oracle” position even more counterintuitive is that any superintelligence must be able to look ahead, design actions, predict the consequences of its actions, and choose the best one available. But the convergence thesis implies that this general skill is one that we can make available only to AIs with certain specific goals. Though the agents with those narrow goals are capable of doing these predictions, they automatically lose this ability if their goals were to change.
4.3 Tricking the controller
Just as with human-level AIs, one could construct a superintelligent AI by wedding together a superintelligence with a large dedicated committee of human-level AIs dedicated to implementing a goal G, and checking the superintelligence’s actions. Thus to deny the Orthogonality thesis requires that one believes that the superintelligence is always capable of tricking this committee, no matter how detailed and thorough their oversight.
This argument extends the Orthogonality thesis to moderately superintelligent AIs, or to any situation where there was a diminishing return to intelligence. It only fails if we take AI to be fantastically superhuman: capable of tricking or seducing any collection of human-level beings.
4.4 Temporary fragments of algorithms, fictional worlds and extra tricks
These are other tricks that can be used to create an AI with any goals. For any superintelligent AI, there are certain inputs that will make it behave in certain ways. For instance, a human-loving moral AI could be compelled to follow most goals G for a day, if they were rewarded with something sufficiently positive afterwards. But its actions for that one day are the result of a series of inputs to a particular algorithm; if we turned off the AI after that day, we would have accomplished moves towards goal G without having to reward its “true” goals at all. And then we could continue the trick the next day with another copy.
For this to fail, it has to be the case that we can create an algorithm which will perform certain actions on certain inputs as long as it isn’t turned off afterwards, but that we cannot create an algorithm that does the same thing if it was to be turned off.
Another alternative is to create a superintelligent AI that has goals in a fictional world (such as a game or a reward channel) over which we have control. Then we could trade interventions in the fictional world against advice in the real world towards whichever goals we desire.
These two arguments may feel weaker than the ones before: they are tricks that may or may not work, depending on the details of the AI’s setup. But to deny the Orthogonality thesis requires not only denying that these tricks would ever work, but denying that any tricks or methods that we (or any human-level AIs) could think up, would ever work at controlling the AIs. We need to assume superintelligent AIs cannot be controlled.
4.5 In summary
Denying the Orthogonality thesis thus requires that:
5 Bayesian Orthogonality thesis
All the previous sections concern hypotheticals, but of a different kind. Section 2 touches upon what kinds of algorithm could theoretically exist. But sections 3 and 4 concern algorithms that could be constructed by humans (or from AIs originally constructed by humans): they refer to the future. As AI research advances, and certain approaches or groups start to show or lose prominence, we’ll start getting a better idea of how such an AI will emerge.
Thus the orthogonality thesis will narrow as we achieve better understanding of how AIs would work in practice, of what tasks they will be put to and of what requirements their designers will desire. Most importantly of all, we will get more information on the critical question as to whether the designers will actually be able to implement their desired goals in an AI. On the eve of creating the first AIs (and then the first superintelligent AIs), the Orthogonality thesis will likely have pretty much collapsed: yes, we could in theory construct an AI with any goal, but at that point, the most likely outcome is an AI with particular goals – either the goals desired by their designers, or specific undesired goals and error modes.
However, until that time arises, because we do not know any of this information currently, we remain in the grip of a Bayesian version of the Orthogonality thesis:
6 Conclusion
It is not enough to know that an agent is intelligent (or superintelligent). If we want to know something about its final goals, about the actions it will be willing to undertake to achieve them, and hence its ultimate impact on the world, there are no shortcuts. We have to directly figure out what these goals are, and cannot rely on the agent being moral just because it is superintelligent/superefficient.
7 Acknowledgements
It gives me great pleasure to acknowledge the help and support of Anders Sandberg, Nick Bostrom, Toby Ord, Owain Evans, Daniel Dewey, Eliezer Yudkowsky, Vladimir Slepnev, Viliam Bur, Matt Freeman, Wei Dai, Will Newsome, Paul Crowley, Alexander Kruel and Rasmus Eide, as well as those members of the Less Wrong online community going by the names shminux, Larks and Dmytry.
8 Bibliography
Armstrong, S., Sandberg, A., & Bostrom, N. (2011). Thinking Inside the Box: Controlling and Using an Oracle AI. forthcoming in Minds and Machines .
Bostrom, N. (2012). Superintelligence: Groundwork to a Strategic Analysis of the Machine Intelligence Revolution. to be published.
Bostrom, N. (2011). The Superintelligent Will: Motivation and Instrumental Rationality in Advance Artificial Agents. forthcoming in Minds and Machines .
de Fabrique, N., Romano, S. J., Vecchi, G. M., & van Hasselt, V. B. (2007). Understanding Stockholm Syndrome. FBI Law Enforcement Bulletin (Law Enforcement Communication Unit) , 76 (7), 10-15.
Hume, D. (1739). A Treatise of Human Nature.
Hutter, M. (2005). Universal algorithmic intelligence: A mathematical top-down approach. In B. Goertzel, & C. Pennachin (Eds.), Artificial General Intelligence. Springer-Verlag.
Müller, J. (2012). Ethics, risks and opportunities of superintelligences. Retrieved May 2012, from http://www.jonatasmuller.com/superintelligences.pdf
Omohundro, S. M. (2008). The Basic AI Drives. In P. Wang, B. Goertzel, & S. Franklin (Eds.), Artificial General Intelligence: Proceedings of the First AGI Conference (Vol. 171).
Sandberg, A., & Bostrom, N. (2008). Whole brain emulation: A roadmap. Future of Humanity Institute Technical report , 2008-3.
Schmidhuber, J. (2007). Gödel machines: Fully self-referential optimal universal self-improvers. In Artificial General Intelligence. Springer.
Wang, P. (2011). The assumptions on knowledge and resources in models of rationality. International Journal of Machine Consciousness , 3 (1), 193-218.
Yudkowsky, E. (2001). General Intelligence and Seed AI 2.3. Retrieved from Singularity Institute for Artificial Intelligence: http://singinst.org/ourresearch/publications/GISAI/
Footnotes
[1] We need to assume it has goals, of course. Determining whether something qualifies as a goal-based agent is very tricky (researcher Owain Evans is trying to establish a rigorous definition), but this paper will adopt the somewhat informal definition that an agent has goals if it achieves similar outcomes from very different starting positions. If the agent ends up making ice cream in any circumstances, we can assume ice creams are in its goals.
[2] Every law of nature being algorithmic (with some probabilistic process of known odds), and no exceptions to these laws being known.
[3] One could argue that we should consider the space of general animal intelligences – octopuses, supercolonies of social insects, etc... But we won’t pursue this here; the methods described can already get behaviours like this.
[4] Even today, many people have had great fun torturing and abusing their characters in games like “the Sims” (http://meodia.com/article/281/sadistic-ways-people-torture-their-sims/). The same urges are present, albeit diverted to fictionalised settings. Indeed games offer a wide variety of different goals that could conceivably be imported into an AI if it were possible to erase the reality/fiction distinction in its motivation.
[5] As can be shown by a glance through a biography of famous people – and famous means they were generally allowed to rise to prominence in their own society, so the space of possible motivations was already cut down.
[6] Of course, if we built an AI with that goal and copied it millions of times, it would no longer be niche.
[7] Such as the hostages suffering from Stockholm syndrome (de Fabrique, Romano, Vecchi, & van Hasselt, 2007).