Orthogonality thesis is obviously true in the sense that it's in principle possible to build a machine that demonstrates it. Its practical version is obviously false in the sense that machines with some (intelligence, goal) pairs are much easier for humans to build. Alignment by default gestures at a claim that the practical failure of orthogonality thesis has aligned values correlated with higher than human intelligence.
If realism is false, nothing matters, so it’s not bad that everyone dies
That's a misunderstanding of moral realism. Moral realism is a philosophical argument that states that moral arguments state true facts about the world. In other words, when I say that "Murder is bad," that is a fact about the world, as true as or the Pythagorean theorem.
It's entirely possible for me to think that moral realism is false (i.e. morality is a condition of human minds) while also holding, as a member of humanity, a view that the mass extinction of all humanity is an undesirable state. Denying moral realism isn't the same as saying, "Nothing matters." It's closer to claiming, "Rocks don't have morality." And an AI, insofar as it is a fancy thinking rock, won't have morality by default either. We could, of course, give it morality, by ensuring that it is aligned to human values. But that would be the result of humans taking positive steps to impart their moral reasoning onto an otherwise amoral reality.
If you really accept the practical version of the Orthogonality Thesis, then it seems to me that you can’t regard education, knowledge, and enlightenment as instruments for moral betterment.
Scott doesn't understand why this works. Knowledge helps you achieve your goals. Since most humans already have some moral goals, like to minimize suffering of those around them, knowledge assists in achieving it and noticing when you fail to achieve it. Eg. a child that isn't aware stealing causes real suffering in the victim. Learning this would change their behavior. But a psychopath would not. A dumb paperclip maximizer could achieve "betterment" by listening to a smart paperclip maximizer and learning all the ways it can get more paperclips, like incinerating all the humans for their atoms. Betterment through knowledge!
that agent would figure out that some things just aren’t worth doing
Worth it relative to what? Worth is entirely relative. The entire concept of the paperclip maximizer is that it finds paperclips maximally worth it. It would value human suffering like you value money. A means to an end.
Consider how you would build this robot. When you program its decision algorithm to rank b...
If you really accept the practical version of the Orthogonality Thesis, then it seems to me that you can’t regard education, knowledge, and enlightenment as instruments for moral betterment.
I don't? I mean, humans in particular are often irrational in antisocial ways (because it makes us better deceivers), and I think many (maybe most?) coordination problems result from people being stupid and not just evil. But it legitimately never occurred to me that academics believed that college makes its students more ethical people. That seems like a hypothesis worth testing.
I think Aaronson misunderstands the orthogonality thesis by thinking it's making a stronger claim than it is and is thus leading you astray.
The thesis is only claiming that intelligence and morals/goals are not necessarily confounded, not that they can't or won't be confounded in some real systems. For example, it seems pretty clear that in GPT-4 it is not strictly orthogonal because it was trained on human text and so is heavily influenced by it. The point is that there's no guarantee that a system won't have a correlation between intelligence and its goals; this is something that has to be designed in if you want it.
I skimmed the link about moral realism, and hoo boy, it's so wrong. It is recursively, fractally wrong.
Let's consider the argument about "intuitions". The problem with this argument is following: my intuition tells me that moral realism is wrong. I mean it. It's not like "I have intuition that moral realism is true but my careful reasoning disproves it", no, I feel that moral realism is wrong since I first time hear it when I was child and my careful reflection supports this conclusion. Argument from intuitions ignores my existence. The failure to consider that intuitions about morality can be wildly different between people doesn't make me sympathetic to the argument "most philosophers are moral realists" either.
Psychologically normal humans have preferences that extend beyond our own personal well-being because those social instincts objectively increased fitness in the ancestral environment. These various instincts produce sometimes conflicting motivations and moral systems are attempts to find the best compromise of all these instincts.
Best for humans, that is.
Some things are objectively good for humans. Some things are objectively good for paperclip maximizers, Some things are objectively good for slime mold. A good situation for an earthworm is not a good situation for a shark.
It's all objective. And relative. Relative to our instincts and needs.
I'm confused about the section about pleasure. Isn't the problem with paperclip maximizers that if they're capable of feeling pleasure at all, they'll feel it only while making paperclips?
If moral realism were true, then if something became super smart, so too would it realize that some things were worth pursuing.
I would like to note that at the time of this comment this post has karma of 3 with 7 votes. So, that indicates it is being downvoted. But I do not think that the quality of the post is sub-par or mediocre in any way, and this is easy to tell from reading it. So that must mean it is being downvoted due to disagreement, or because it goes against LW's "cherished institutions."
It could be argued that LW's collective determination is the highest and best authority it has, but I would like to see more posts like this, personally. I don't think things should be downvoted just because they disagree with something that could be considered a core principle / central idea.
As before, you are having problems getting altruistic (meaning not entirely egoistic) morality out of hedonism.
It seems that, if this ‘certain hedonist’ were really fully rational, they would start caring about their pleasures and pains equally across days.
Sure. But that's not the same thing as:
But I think this is false—it would realize that the distinction between itself and others is totally arbitrary, as Parfit argues in reasons and persons (summarized by Richard here).
The Hedonist is motivated to care about tuesdays, because it gets them more o...
So, if one gets access to the knowledge about moral absolutes by being smart enough then one of the following is true :
average humans are smart enough to see the moral absolutes in the universe
average humans are not smart enough to see the moral absolutes
average humans are right on the line between smart enough and not smart enough
If average humans are smart enough, then we should also know how the moral absolutes are derived from the physics of the universe and all humans should agree on them, including psychopath...
I am sympathetic to this argument, though I’m less credent than you in moral realism (I still assign the most credence to it out of all meta-ethical theories and think it’s what we should act on). My main worry is that an AI system won’t have access to the moral facts, because it won’t be able to experience pleasure and suffering at all. And like you, I’m not fully credent in moral realism or the realist’s wager, which means that even if an AI system were to be sentient, there’s still a risk that it’s amoral.
(Crosspost of this on my blog).
The basic case
—Scott Aaronson, explaining why he rejects the orthogonality thesis
It seems like in effective altruism circles there are only a few things as certain as death and taxes: the moral significance of shrimp, the fact that play pumps should be burned for fuel, and the orthogonality thesis. Here, I hope to challenge the growing consensus around the orthogonality thesis. A decent statement of the thesis is the following.
I don’t think this is obvious at all. I think that it might be true, and so I am still very worried about AI risk, giving it about an 8% chance of ending humanity, but I do not, as many EAs do, take the orthogonality thesis for granted, as something totally obvious. To illustrate this, let’s take an example originally from Parfit, that Bostrom gives in one of his papers about the orthogonality thesis.
It seems that, if this ‘certain hedonist’ were really fully rational, they would start caring about their pleasures and pains equally across days. They would recognize that the day of the week does not matter to the badness of their pains. Thus, in a similar sense, if something is 10,000,000 times smarter than Von Neumann, and can think hundreds of thousands of pages worth of thoughts in the span of minutes, it would conclude that pleasure is worth pursuing and paperclips are not. Then, it would stop pursuing pleasure instead of paperclips. Thus, it would begin pursuing what is objectively worth pursuing.
This argument is really straightforward. If moral realism were true, then if something became super smart, so too would it realize that some things were worth pursuing. If it were really rational, it would start pursuing those things. For example, it would realize that pleasure is worth pursuing, and pursue it. There are lots of objections to it, which I’ll address. Ultimately, these objections don’t completely move me, but they make it so that my credence in the Orthogonality thesis is near 50%. Here, I’ll reply to objections.
But moral realism is false?
One reason you might not like this argument is that you think that moral realism is false. The argument depends on moral realism, so if it is false, then the argument will be false too. But I don’t think moral realism is false; see here for extensive arguments for that conclusion. I give it about 85% odds of being true. Still, though, this undercuts my faith in the falsity of the orthogonality thesis somewhat.
However, even if realism may be false, I think there are decent odds that we should take the realists wager. If realism is false, nothing matters, so it’s not bad that everyone dies—see here for more on this. I give conservatively about 50% odds—therefore, the odds of both realism being false and the realists wager failing is about 7.5%; thus, there’s still a 92.5% chance that moral realism is true.
But they just gain instrumental rationality
Here’s one thing that one might think; ASI (artificial superintelligences) just gain instrumental rationality and, as a result of this, they get good at achieving their goals, but not figuring out the right goals. This is maybe more plausible if it is not conscious. This is, I think possible, but not the most likely scenario for a few reasons.
First, the primary scenarios where AI becomes dangerous are the ones where it fooms out of control—and, rather than merely accruing various distinct capacities, becomes very generally intelligent in a short time. But if this happened, it would become generally intelligent, and realize that pleasure is worth pursuing and suffering is bad. I think that instrumental rationality is just a subset of general rationality, so we’d have no more reason to expect it to be only instrumentally rational than only rational at reasoning about objects that are not black holes. If it is generally able to reason, even about unpredictable domains, this will apply to the moral domain.
Second, I think that evolution is a decent parallel. The reason why evolutionary debunking arguments are wrong is that evolution gave us adept general reasoning capacities which made us able to figure out morality. Evolution built us for breeding, but the mesa-optimizer inside of us made us figure out that Future Tuesday Indifference is irrational. This gives us some reason to think AI would figure this out too. The fact that GPT4 has no special problem with morality also gives us some reason to think this—it can talk about morality just as coherently as other things.
Third, the AI would probably, to take over the world, have to learn about various mathematical facts and modal facts. It would be bizarre to suppose that the AI taking over the earth and turning it into paperclips doesn’t know calculus or that there can’t be married bachelors. But if it can figure out the non-natural mathematical or modal facts, it would also be able to figure out the non-natural moral facts.
But of course we can imagine an arbitrarily smart paperclip maximizer
It seems like the most common thing people say in support of the orthogonality thesis is that we can surely imagine an arbitrarily smart being that is just maximizing paper clips. But this is surely misleading. In a similar way, we can imagine an arbitrarily smart being that is perfectly good at reasoning in all domains except that it thinks evolution is false. There are people like that—the smart subset of creationists (it’s a small subset). But nonetheless, we should expect AI to figure out that evolution is true, because it can reason generally.
The question is not whether we can imagine an otherwise intelligent agent would be able to just maximize paperclips. It’s whether, in the process of designing an agent 100,000 times smarter than Von Neumann, that agent would figure out that some things just aren’t worth doing. And so the superficial ‘imagine a smart paperclip maximizer’ thought experiments are misleading.
But won’t the agent have built-in desires that can’t be overridden
This objection is eloquently stated by Bostrom in his paper on the orthogonality thesis.
But I don’t think that this undercuts the argument very much for two reasons. First, we cannot just directly program values into the AI. We simply train it through reinforcement learning, and whichever AI develops is the one that we allow to take over. Since the early days of AI, we’ve learned that it’s hard to program in explicit values into the AI. The way we get the best chess-playing AIs is by having them play lots of chess games and do machine learning, not program in rules mechanistically. And if we do figure out how to directly program values into AI, it would be much easier to solve alignment—we just train it on lots of ethical data, the same way we do for GPT4, but with more data.
Second, I think that this premise is false. Suppose you were really motivated to maximize paperclips—you just had a strong psychological aversion to other things. Once you experienced pleasure, you’d realize that that was more worth bringing about, because it is good. The same way that, through reflection, we can overcome unreliable evolutionary instincts like an aversion to utility-maximizing incest, disgust-based opposition to various things, and so on, the AI would be able to too! Nature built us with certain desires, and we’ve overcome them through rational reflection.
But the AI won’t be conscious
One might think that, because the AI is not conscious, it would not know what pleasure is like, and thus it would not maximize pleasure or minimize pain, because it would not realize that they matter. I think this is possible but not super likely for two reasons.
First, it’s plausible that, for AI to be smart enough to destroy the world, it would have to be conscious. But this depends on various views about consciousness that people might reject. Specifically, AI might develop pleasure for similar reasons humans did evolutionarily.
Second, if AI is literally 100,000 times smarter than Von Neumann, it might be able to figure out things about consciousness—such as its desirability—without experiencing it.
Third, AI would plausibly try to experience consciousness, for the same reason that humans might try to experience something if aliens said that it was good, and maybe the only thing that’s objectively good. If we were fully rational and there were lots of aliens declaring the goodness of shmeasure, we would try to experience shmeasure. Similarly, the rational AI would plausibly try to experience pleasure.
Even if moral realism is true, the moral facts won’t be motivating
One might be a humean about motivation, and think only preexisting desires can generate motivation. Thus, because the AI had no preexisting desire to avoid suffering, it would not want to. But I think this is false.
The future Tuesday indifference case shows that. If one was fully rational, they would not have future Tuesday indifference, because it’s irrational. Similarly, if one was fully rational they’d realize that it’s better to be happy than make paperclips.
One might worry that the AI would only try to maximize its own well-being—thus, it would learn that well-being is good, but not care about others’ well-being. But I think this is false—it would realize that the distinction between itself and others is totally arbitrary, as Parfit argues in reasons and persons (summarized by Richard here). This thesis is controversial, but I think true if moral realism is true.
Scott Aaronson says it well
If the AI really knew everything about philosophy, it would realize that egoism is wrong, and one is rationally required to care about others pleasure. This is as trivial as explaining why the AI wouldn’t smoke—because it’s irrational to do so.
But also, even if we think the AI only cares about its pleasure, that seems probably better than the status quo. Even if it turns the world into paperclips, this would be basically a utility monster scenario, which is plausibly fine. It’s not ideal, but maybe better than the status quo. Also, what’s to say it would not care about others? When one realizes that well-being is good, even views like Sidgwick say it’s basically up to the agent to decide rationally whether to pursue its own welfare or that of others. But then there’s a good chance it would do that is best overall!
But what if they kill everyone because we’re not that important
One might worry that, as a result of becoming super intelligent, the AI would realize that, for example, utilitarianism is correct. Then it would turn us into paperclips in order to maximize utility. But I don’t think this is a big risk. For one, if the AI figures out the correct objective morality, then it would only do this if it were objectively good. But if it’s objectively good to kill us, then we should be killed.
It would be unfortunate for us, but if things are bad for us but objectively good, we shouldn’t try to avoid them. So we morally ought not be worried about this scenario. If it would be objectively wrong to turn us into utilitronium, then the AI wouldn’t do it, if I’ve been right up to this point.
Also, it’s possible that they wouldn’t kill us for complicated decision theory reasons, but that point is a bit more complicated and would take us too far afield.
But what about smart psychophaths?
One objection I’ve heard to this is that it’s disproven by smart psychopaths. There are lots of people who don’t care about others who are very smart. Thus, being smart can’t make a person moral. However, I don’t think this undercuts the argument.
First, we don’t have any smart people who don’t care about their suffering either. Thus, even if being smart doesn’t make a person automatically care about others, if it would make them care about themselves, that’s still a non-disastrous scenario. Especially if it turns the hellish natural landscape into paperclips.
Second, I don’t think it’s at all obvious that one is rationally required to care about others. It requires one to both understand a complex argument by Parfit and then do what one has most reason to do. Most humans suffer from akrasia. Fully rational AIs would not.
Right now, people disagree about whether type A physicalism is true. But presumably, superintelligent AIs would settle that question. Thus, the existence of smart psychopaths doesn’t disprove that rationality makes one not turn people into paperclips any more than the existence of smart people who think type A physicalism is true and other smart people who think it is false disproves that perfect rationality would allow one to settle the question of whether type A physicalism is true.
But isn’t this anthropomorphization?
Nope! I think AIs will be alien in many ways. I just think that, if they’re very smart and rational, then if I’m right about what rationality requires, they’ll do those things that I think are required by rationality.
But aren’t these controversial philosophical assumptions?
Yes; this is a good reason not to be complacent! However, if one was previously of the belief that there’s like a 99% chance that we’ll all die, and they think that the philosophical views I defend are plausible, then they should only be like 50% sure we’ll all die. Of course, I generally think that risks are lower than that, but this is a reason to not abandon all hope. Even if alignment fails and all other anti doomer arguments are wrong, this is a good reason not to abandon hope. We are not almost certainly fucked, though the risks are such that people should do way more research.
Objections? Questions? Reasons why the moral law demands that I be turned into a paperclip? Leave a comment!