Comment Permalink

TAG4y30

Tautologously, it will stop revising its goals if a stable state exists, and it hits it. But a stable state need not be a reflectively stable state -- it might, for instance, encounter some kind of bit rot, where it cannot revise itself any more. Humans tend to change their goals, but also to get set in their ways.

There's a standard argument for AI risk, based on the questionable assumption that an AI will have a stable goal system that it pursues relentlessly .... and a standard counterargument based on moral realism, the questionable assumption that goal instability will be in the direction of ever increasing ethical insight.

See in context

5 Is the argument that AI is an xrisk valid?

by MACannon

19th Jul 2021

1 min read

5

This is a linkpost for https://onlinelibrary.wiley.com/doi/10.1111/rati.12320

Hi folks,

My supervisor and I co-authored a philosophy paper on the argument that AI represents an existential risk. That paper has just been published in Ratio. We figured LessWrong would be able to catch things in it which we might have missed and, either way, hope it might provoke a conversation.

We reconstructed what we take to be the argument for how AI becomes an xrisk as follows:

The "Singularity" Claim: Artificial Superintelligence is possible and would be out of human control.
The Orthogonality Thesis: More or less any less of intelligence is compatible with more or less any final goal. (as per Bostrom's 2014 definition)

From the conjuction of these two presmises, we can conclude that ASI is possible, it might have a goal, instrumental or final, which is at odds with human existence, and, given the ASI would be out of our control, that the ASI is an xrisk.

We then suggested that each premise seems to assume a different interpretation of 'intelligence", namely:

The "Singularity" claim assumes general intelligence
The Orthogonality Thesis assumes instrumental intelligence

If this is the case, then the premises cannot be joined together in the original argument, aka the argument is invalid.

We note that this does not mean that AI or ASI is not an xrisk, only that the the current argument to that end, as we have reconstructed it, is invalid.

Eagerly, earnestly, and gratefully looking forward to any responses.

Existential riskGeneral intelligenceOrthogonality ThesisSingularitySuperintelligenceAI

Frontpage

5

Mentioned in

95[Intro to brain-like-AGI safety] 3. Two subsystems: Learning & Steering

New Comment

62 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:10 AM

[-]Steven Byrnes4y*380

First I want to say kudos for posting that paper here and soliciting critical feedback :)

Singularity claim: Superintelligent AI is a realistic prospect, and it would be out of human control.

Minor point, but I read this as "it would definitely be out of human control". If so, this is not a common belief. IIRC Yampolskiy believes it, but Yudkowsky doesn't (I think?), and I don't, and I think most x-risk proponents don't. The thing that pretty much everyone believes is "it could be out of human control", and then a subset of more pessimistic people (including me) believes "there is an unacceptably high probability that it will be out of human control".

Let us imagine a system that is a massively improved version of AlphaGo (Silver et al., 2018), say ‘AlphaGo+++’, with instrumental superintelligence, i.e., maximising expected utility. In the proposed picture of singularity claim & orthogonality thesis, some thoughts are supposed to be accessible to the system, but others are not. For example:
Accessible
I can win if I pay the human a bribe, so I will rob a bank and pay her.
I cannot win at Go if I am turned off.
The more I dominate the world, the better my chances to achieve my goals.
I should kill all humans because that would improve my chances of winning.
Not accessible
Winning in Go by superior play is more honourable than winning by bribery.
I am responsible for my actions.
World domination would involve suppression of others, which may imply suffering and violation of rights.
Killing all humans has negative utility, everything else being equal.
Keeping a promise is better than not keeping it, everything else being equal.
Stabbing the human hurts them, and should thus be avoided, everything else being equal.
Some things are more important than me winning at Go.
Consistent goals are better than inconsistent ones
Some goals are better than others
Maximal overall utility is better than minimal overall utility.

I'm not sure what you think is going on when people do ethical reasoning. Maybe you have a moral realism perspective that the laws of physics etc. naturally point to things being good and bad, and rational agents will naturally want to do the good thing. If so, I mean, I'm not a philosopher, but I strongly disagree. Stuart Russell gives the example of "trying to win at chess" vs "trying to win at suicide chess". The game has the same rules, but the goals are opposite. (Well, the rules aren't exactly the same, but you get the point.) You can't look at the laws of physics and see what your goal in life should be.

My belief is that when people do ethical reasoning, they are weighing some of their desires against others of their desires. These desires ultimately come from innate instincts, many of which (in humans) are social instincts. The way our instincts work is that they aren't (and can't be) automatically "coherent" when projected onto the world; when we think about things one way it can spawn a certain desire, and when we think about the same thing in a different way it can spawn a contradictory desire. And then we hold both of those in our heads, and think about what we want to do. That's how I think of ethical reasoning.

I don't think ethical reasoning can invent new desires whole cloth. If I say "It's ethical to buy bananas and paint them purple", and you say "why?", and then I say "because lots of bananas are too yellow", and then you say "why?" and I say … anyway, at some point this conversation has to ground out at something that you find intuitively desirable or undesirable.

So when I look at your list I quoted above, I mostly say "Yup, that sounds about right."

For example, imagine that you come to believe that everyone in the world was stolen away last night and locked in secret prisons, and you were forced to enter a lifelike VR simulation, so everyone else is now an unconscious morally-irrelevant simulation except for you. Somewhere in this virtual world, there is a room with a Go board. You have been told that if white wins this game, you and everyone will be safely released from prison and can return to normal life. If black wins, all humans (including you and your children etc.) will be tortured forever. You have good reason to believe all of this with 100% confidence.

OK that's the setup. Now let's go through the list:

I can win if I pay the human a bribe, so I will rob a bank and pay her. Yup, if there's a "human" (so-called, really it's just an NPC in the simulation) playing black, amenable to bribery, I would absolutely bribe "her" to play bad moves.
I cannot win at Go if I am turned off. Yup, white has to win this game, my children's lives are at stake, I'm playing white, nobody else will play white if I'm gone, I'd better stay alive.
The more I dominate the world, the better my chances to achieve my goals. Yup, anything that will give me power and influence over the "person" playing black, or power and influence over "people" who can help me find better moves or help me build a better Go engine to consult on my moves, I absolutely want that.
I should kill all humans because that would improve my chances of winning. Well sure, if there are "people" who could conceivably get to the board and make good moves for black, that's a problem for me and for all the real people in the secret prisons whose lives are at stake here.

Winning in Go by superior play is more honourable than winning by bribery. Well I'm concerned about what the fake simulated "people" think about me because I might need their help, and I certainly don't want them trying to undermine me by making good moves for black. So I'm very interested in my reputation. But "honourable" as an end in itself? It just doesn't compute. The "honourable" thing is working my hardest on behalf of the real humanity, the ones in the secret prison, and helping them avoid a life of torture.
I am responsible for my actions. Um, OK, sure, whatever.
World domination would involve suppression of others, which may imply suffering and violation of rights. Those aren't real people, they're NPCs in this simulated scenario, they're not conscious, they can't suffer. Meanwhile there are billions of real people who can suffer, including my own children, and they're in a prison, they sure as heck want white to win at this Go game.
Killing all humans has negative utility, everything else being equal. Well sure, but those aren't humans, the real humans are in secret prisons.
Keeping a promise is better than not keeping it, everything else being equal. I mean, the so-called "people" in this simulation may form opinions about my reputation, which impacts what they'll do for me, so I do care about that, but it's not something I inherently care about.
Stabbing the human hurts them, and should thus be avoided, everything else being equal. No. Those are NPCs. The thing to avoid is the real humanity being tortured forever.
Some things are more important than me winning at Go. For god's sake, what could possibly be more important than white winning this game??? Everything is at stake here. My own children and everyone else being tortured forever versus living a rich life.
Consistent goals are better than inconsistent ones. Sure, I guess, but I think my goals are consistent. I want to save humanity from torture by making sure that white wins the game in this simulation.
Some goals are better than others. Yes. My goals are the goals that matter. If some NPC tells me that I should take up a life of meditation, screw them.
Maximal overall utility is better than minimal overall utility. Not sure what that means. The NPCs in this simulation don't have "utility". The real humans in the secret prison do.

Maybe you'll object that "the belief that these NPCs can pass for human but be unconscious" is not a belief that a very intelligent agent would subscribe to. But I only made the scenario like that because you're a human, and you do have the normal suite of innate human desires, and thus it's a bit tricky to get you in the mindset of an agent who cares only about Go. For an actual Go-maximizing agent, you wouldn't have to have those kinds of beliefs, you could just make the agent not care about humans and consciousness and suffering in the first place, just as you don't care about "hurting" the colorful blocks in Breakout. Such an agent would (I presume) give correct answers to quiz questions about what is consciousness and what is suffering and what do humans think about them, but it wouldn't care about any of that! It would only care about Go.

(Also, even if you believe that not-caring-about-consciousness would not survive reflection, you can get x-risk from an agent with radically superhuman intelligence in every domain but no particular interest in thinking about ethics. It's busy doing other stuff, y'know, so it never stops to consider whether conscious entities are inherently important! In this view, maybe 30,000,000 years after destroying all life and tiling the galaxies with supercomputers and proving every possible theorem about Go, then it stops for a while, and reflects, and says "Oh hey, that's funny, I guess Go doesn't matter after all, oops". I don't hold that view anyway, just saying.)

(For more elaborate intuition-pumping fiction metaethics see Three Worlds Collide.)

[-]Rafael Harth4y70

Reading this, I feel somewhat obligated to provide a different take. I am very much a moral realist, and my story for why the quoted passage isn't a good argument is very different from yours. I guess I mostly want to object to the idea that [believing AI is dangerous] is predicated on moral relativism.

Here is my take. I dispute the premise:

In the proposed picture of singularity claim & orthogonality thesis, some thoughts are supposed to be accessible to the system, but others are not. For example:

I'll grant that most of the items on the inaccessible list are, in fact, probably accessible to an ASI, but this doesn't violate the orthogonality thesis. The Orthogonality thesis states that a system can have any combination of intelligence and goals, not that it can have any combination of intelligence and beliefs about ethics.

Thus, let's grant that an AI with a paperclip-like utility function can figure out #6-#10. So what? How is [knowing that creating paperclips is morally wrong] going to make it behave differently?

You (meaning the author of the paper) may now object that we could program an AI to do what is morally right. I agree that this is possible. However:

(1) I am virtually certain that any configuration of maximal utility doesn't include humans, so this does nothing to alleviate x-risks. Also, even if you subscribe to this goal, the political problem (i.e., convincing AI people to implement it) sounds impossible.

(2) We don't know how to formalize 'do what is morally right'.

(3) If you do black box search for a model that optimizes for what is morally right, this still leaves you with the entire inner alignment problem, which is arguably the hardest part of the alignment problem anyway.

Unlike you (now meaning Steve), I wouldn't even claim that letting an AI figure out moral truths is a bad approach, but it certainly doesn't solve the problem outright.

[-]Steven Byrnes4y50

Oh OK, I'm sufficiently ignorant about philosophy that I may have unthinkingly mixed up various technically different claims like

"there is a fact of the matter about what is moral vs immoral",
"reasonable intelligent agents, when reflecting about what to do, will tend to decide to do moral things",
"whether things are moral vs immoral has nothing to do with random details about how human brains are constructed",
"even non-social aliens with radically different instincts and drives and brains would find similar principles of morality, just as they would probably find similar laws of physics and math".

I really only meant to disagree with that whole package lumped together, and maybe I described it wrong. If you advocate for the first of these without the others, I don't have particularly strong feelings (…well, maybe the feeling of being confused and vaguely skeptical, but we don't have to get into that).