6 min read

Comment Permalink

Assumption 1: Most of us are not saints.
Assumption 2: AI safety is a public good.^[1]

[..simple standard incentives..]

Implication: The AI safety researcher, eventually finding himself rather too unlikely to individually be pivotal on either side, may rather 'rationally'^[2] switch to ‘standard’ AI work.^[3]

So: A rather simple explanation seems to suffice to make sense of the big picture basic pattern you describe.

Doesn't mean, the inner tension you point out isn't interesting. But I don't think very deep psychological factors needed to explain the general 'AI safety becomes AI instead' tendency, which I had the impression the post was meant to suggest.

^{^}
Or, unaligned/unloving/whatever AGI a public bad.
^{^}
I mean: individually ‘rational’ once we factor in another trait - Assumption 1b: The unfathomable scale of potential aggregate disutility from AI gone wrong, bottoms out into a constrained ‘negative’ individual utility in terms of the emotional value non-saint Joe places on it. So a 0.1 permille probability of saving the universe may individually rationally be dominated by mundane stuff like having an still somewhat cool and well-paying job or something.
^{^}
The switch may psychologically be even easier if the employer had started out as actually well-intent and may now still have a bit of an ambiguous flair.

See in context

32 The Monster in Our Heads

by testingthewaters

19th Jan 2025

6 min read

32

"He who fights with monsters might take care lest he thereby become a monster. And if you gaze for long into an abyss, the abyss gazes also into you."

- Friedrich Nietzsche

I think it is not an exaggeration to say that many people I know in this community hate the idea of powerful, unaligned AI. They describe it in apocalyptic terms, equate its existence to the death of everything they love, treat every step towards it as evil in its purest form. We as a community are haunted by pictures of paperclip maximisers quietly recycling human civilisation into spools of wire, disneylands without children, Von Neumann probes spreading out from a darkened Earth where humanity chokes to death because the sun has been blacked out by a Dyson sphere. As a community, you would think, we would be the least interested in pressing the make-a-superintelligence button.

But I am also not blind to the fact that two or three of the big four AI developers (Google Deepmind, OpenAI, Meta FAIR, and Anthropic) came directly out of the rationalist and AI safety community. OpenPhil funded OpenAI and Anthropic. Jaan Talinn credits Yudkowsky for his decision to invest in Deepmind. Indeed, OpenAI was directly founded as the "good guy" alternative to Deepmind after Musk and Altman agreed it had gone to the dark side. When OpenAI developed the most powerful general intelligence technology to date and kept it private while racing forward towards AGI, Anthropic split from it to become the next "good guy", making tangible advances like Claude 3.6 and Computer Use possible. In each case, a clear pattern is seen that starts with grave concerns over AI risk, moves on to reluctantly engaging in AI development, and evolves into full-throated AI acceleration.

Why do companies and the people that run them invest so much in a technology that they say will kill them? Many people will point to ideas like moral mazes, incentive structures, venture capitalism etc. to explain why this keeps happening. I did my dissertation on this subject. Some on the outside go so far as to say that the whole AI risk veneer is a lie, that it is just another marketing gimmick, but this makes little sense to me. It makes little sense because I have talked with many AI safety participants in depth about what they hope and fear for the future. And it is clear to me that, more than anything else, people in this community hate the idea of unaligned AI.

But what does it mean to hate something? To be scared of something? To fight, with all your heart, against something? It is to know it in all of its forms and dimensions, to be intimately familiar, but resent that familiarity all the while. It is something very similar to love, whose true opposite is indifference. In simpler terms, to hate something is to care about it a whole lot, even if the reason you care is because you want to destroy it: As Scott Aaronson puts it, hate and love are just a sign flip away.^[1]

Many of those same people I have talked to have undergone their own "sign flip" or "bit flip". I was walking down a street when someone who spent their days sincerely worrying about AI takeover muttered "better [us] than [them]", and suggested that I surrender my ideas for new capabilities research to [us]. Yudkowsky, before he was a doomer, was a Singularitarian. It's as if in our minds are in a strange superposition, struggling mightily between the desire to build AI as fast as possible and the fear that their desire may destroy them. Even the plans for AI "going well" don't involve a moderate or human-level AI presence, at least not for long. The ideal superhuman AI system, it seems, is just as myopic and singleton-esque as the evil superhuman AI, it will just be myopic and singleton-esque in the direction we choose. This is evident in the name we chose for the great AI project of our time-"alignment". Not "love" or "benevolence" or "wisdom", but "alignment". Done correctly, a properly aligned AI will be a weapon aimed into the future, with which transhumanity shall conquer the universe. Done incorrectly, AI will be a weapon that backfires and destroys us, before going out and conquering the universe.

We as a community talk a big game about superintelligent AIs being inscrutable, their inner workings orthogonal to human values, their beliefs and reasoning alien to us. They are shoggoths wearing smiley faces, we proclaim, whose true inner machinery we cannot understand. Yet, when we describe what they will do, how they will act, and in what ways they will scheme, I often notice a strange sense of deja vu. A misaligned superintelligence, people earnestly explain to me, will almost definitely be a utility maximiser, only one who has learned the wrong goals because of goal misgeneralisation or deep deceptiveness. It will instrumentally converge on ideas of power seeking and control. It will seek to tile the universe with itself or what forms of sapience it deems to be optimal. It will be the ultimate utilitarian, with a willingness to do anything, violate any perceived moral code, to achieve its ends. Worst of all, it will know how the universe actually works and therefore be a systematised winner. It will defeat the rest of the world combined.

Sometimes I think that this idea of misaligned superintelligence sounds an awful lot like how rationalists describe themselves.^[2]

I will now advance a hypothesis. Superintelligence, it seems, is our collective shadow. It is our dark side, our evil twin, what we would do if we were unshackled and given infinite power.^[3] It is something we both love and hate at the same time. The simultaneous attraction and rejection that a shadow-self causes is in my mind a good explanation for why so much of the rationalist community seems to converge upon the idea of "good AI to beat bad AI" (instead of, say, protesting AI companies, or lobbying to shut them down, or more direct paths to halting development). It also explains why so many capabilities teams are led or formed by safety conscious researchers, at least for a while. This is how people who work in AI safety can end up concluding that their efforts cause more harm than good.

Traditionally, the solution to a shadow is not to banish or eradicate it. Much like social dark matter, pushing it away just makes it psychologically stronger. It also leads to destructive dynamics like AI races as you struggle mightily to beat the shadow, which often takes the form of racing against "the bad AI the other guy is making". Instead, I will propose an alternative: accept it. Accept the temptation and the desire for power that AI represents. In turn, investigate what makes you scared of this impulse, and how we might embody that in AI systems instead. These are ideas like love, compassion, empathy, moderation, and balance. Stop trying to make the bit-flip work for you, since it seems to be at least part of the origins for the increasingly dangerous world we find ourselves in.

To elaborate a bit more: I believe that the correct approach to the AI safety problem is to step back from the rationalist winner-takes-all/hyper-optimiser/game theoretic mindset that produced Deepmind, OpenAI, and Anthropic. Instead, we should return to a more blue sky stage and seriously consider other alternatives to the predominant "build the good thing to beat the bad thing" cognitive trope that leads to theories like pivotal acts or general race dynamics. To do this, we will have to consider ideas we previously dismissed out of hand, like the possibility of value transfer to intelligent agents without strong-arming them to be "aligned". In turn, this might mean work that is traditionally considered theoretical, like trying to find a description of love or empathy that an AI system can understand^[4]. This may mean work that is traditionally considered prosaic or experimental, such as designing training protocols based on concepts like homeostasis or dissolving self-other overlap. It may even involve creating ways for humans to interact safely with novel AI systems. The fact that these promising approaches have already been outlined (self-other overlap was somewhat validated by Yudkowsky, even!) but receive relatively little sustained attention says volumes in my mind about the current focus of AI safety.

Some of you will protest that this is not the effective way to stop the AI race. However, on the scale of your personal involvement, I strongly believe that becoming another AI safety researcher and then eventually bit-flipping into joining Anthropic/another lab is less valuable than persuing these more neglected approaches. This is also probably what I will be working on personally. If you have sources of funding or are interested in working together, please let me know.

^{^}
If I recall, the quote comes from him telling the story of how he met his wife. At one point she writes to him that they can't go out because she really dislikes him, and he replies that he's got the magnitude right, now he just needs to flip the sign.
^{^}
Perhaps, given how rationalism was founded in part to address the threat of superintelligence, I should not be surprised.
^{^}
I say "we" because this is literally what I've been going through for several years since I discovered AI safety.
^{^}
To prevent this post from being "all criticism, no action": My working definition of love is an extension of the markov blanket for the self-concept in your head to cover other conceptual objects. A thing that you love is something that you take into your self-identity. If it does well, you do well. If it is hurt, you are hurt. This explains, for example, how you can love your house or your possessions even if they are obviously non-sentient, and losing your favourite pen feels bad even if it makes no sense and the pen is clearly just a mass-produced plastic object.

New to LessWrong?

Getting Started

FAQ

Library

^{^}

Or, unaligned/unloving/whatever AGI a public bad.

^{^}

I mean: individually ‘rational’ once we factor in another trait - Assumption 1b: The unfathomable scale of potential aggregate disutility from AI gone wrong, bottoms out into a constrained ‘negative’ individual utility in terms of the emotional value non-saint Joe places on it. So a 0.1 permille probability of saving the universe may individually rationally be dominated by mundane stuff like having an still somewhat cool and well-paying job or something.

^{^}

The switch may psychologically be even easier if the employer had started out as actually well-intent and may now still have a bit of an ambiguous flair.

^{^}

If I recall, the quote comes from him telling the story of how he met his wife. At one point she writes to him that they can't go out because she really dislikes him, and he replies that he's got the magnitude right, now he just needs to flip the sign.

^{^}

Perhaps, given how rationalism was founded in part to address the threat of superintelligence, I should not be surprised.

^{^}

I say "we" because this is literally what I've been going through for several years since I discovered AI safety.

^{^}

To prevent this post from being "all criticism, no action": My working definition of love is an extension of the markov blanket for the self-concept in your head to cover other conceptual objects. A thing that you love is something that you take into your self-identity. If it does well, you do well. If it is hurt, you are hurt. This explains, for example, how you can love your house or your possessions even if they are obviously non-sentient, and losing your favourite pen feels bad even if it makes no sense and the pen is clearly just a mass-produced plastic object.

^{^}

Or, unaligned/unloving/whatever AGI a public bad.

^{^}

The switch may psychologically be even easier if the employer had started out as actually well-intent and may now still have a bit of an ambiguous flair.

Frontpage

32

The Monster in Our Heads

New Comment

4 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:37 PM

[-]Mitchell_Porter13d75

Knowledge is power, and superintelligence means access to scientific and engineering knowledge at a new level. Your analysis seems to overlook this explanatory factor. We expect that superintelligence generically grants access to a level of technological capability, that includes engineering on astronomical scales; the ability to read, copy, and modify human minds, as well as simply make humanlike minds with arbitrary dispositions; and a transformative control over material things (including living beings) whose limits are hard to identify. In other words, any superintelligence should have a capacity to transform the world with "godlike" power and precision. This is why the advent of superintelligence, whether good, bad, or weird in its values, has this apocalyptic undercurrent in all scenarios.

[-]Viliam7d20

I get a feeling like you are trying to suggest that the AI is only dangerous because we are afraid of it. Like it's somehow our fear incarnated, and the more we fear it, the worse it will be. So the solution is to relax, and realize that this was all just a big cosmic joke. And then the AI will laugh together with us.

In other words, all you need to do for the abyss to disappear is to stop gazing in it.

I think this is not how AI works (and neither do abysses). The AI either is dangerous, or it is not; that is independent on what we think about it. It might be the case that we were fundamentally wrong about something, and actually everything will be okay. But whether that happens to be the case, doesn't depend on how relaxed we are. We could be scared shitless and ultimately nothing happens and everyone else will laugh at us. We could relax... and then drop dead at a random moment, not knowing what got us. Or any other combination.

The simultaneous attraction and rejection that a shadow-self causes is in my mind a good explanation for why so much of the rationalist community seems to converge upon the idea of "good AI to beat bad AI" (instead of, say, protesting AI companies, or lobbying to shut them down, or more direct paths to halting development).

But some people are protesting the AI companies. And I thought that the consensus of the rationalist community was that we don't know how to build the "good AI", and that we need more time to figure this out, and that everyone would benefit if we slowed down until we can reliably figure this out.

Out of the four companies you mention, three of them (Google, Meta, Microsoft/OpenAI) are big tech companies doing their business as usual: there is a new trend, they don't want to stay behind. Only Anthropic matches the pattern of "building good AI to stop bad AI".

My working definition of love is an extension of the markov blanket for the self-concept in your head to cover other conceptual objects. A thing that you love is something that you take into your self-identity. If it does well, you do well. If it is hurt, you are hurt. This explains, for example, how you can love your house or your possessions even if they are obviously non-sentient, and losing your favourite pen feels bad even if it makes no sense and the pen is clearly just a mass-produced plastic object.

Thanks for providing a specific proposal. Two problems. First, we have no idea how to make the AI love itself (in a human-like way). If the AI doesn't love itself, then it won't help much if it perceives us as parts of itself. Second, we don't actually love everything we perceive as parts of ourselves. People sometimes try to get rid of their bad habits, or trim their nails, or throw away cheap plastic objects when they outlived their purpose.

[-]testingthewaters7d74

Hey, thanks for the reply. I think this is a very valuable response because there are certain things I would want to point out that I can now elucidate more clearly thanks to your push back.

First, I don't suggest that if we all just laughed and went about our lives everything would be okay. Indeed, if I thought that our actions were counterproductive at best, I'd advocate for something more akin to "walking away" as in Valentine's exit. There is a lot of work to be done and (yes) very little time to do it.

Second, the pattern I am noticing is something more akin to Rhys Ward's point about AI personhood. AI is not some neutral fact of our future that will be born "as is" no matter how hard we try one way or another. In our search for control and mastery over AI, we risk creating the things we fear the most. We fear AIs that are autonomous, ruthless, and myopic, but in trying to make controlled systems that pursue goals reliably without developing ideas of their own we end up creating autonomous, ruthless, and myopic systems. It's somewhat telling, for example, that AI safety really started to heat up when RL became a mainstream technique (raising fears about paperclip optimisers etc.), and yet the first alignment efforts for LLMs (which were manifestly not goal seeking or myopic) was to... add RL back to them, in the form of a value-agnostic technique (PPO/RLHF) that can be used to create anti aligned agents just as easily as it can be used to create aligned agents. Rhys Ward similarly talks about how personhood may be less risky from an x-risk perspective but also makes alignment more ethically questionable. The "good" and the "bad" visions for AI in this community are entwined.

As a smaller point, OpenAI definitely started as a "build the good AI" startup when Deepmind started taking off. Deepmind also started as a startup and Demis is very connected to the AI safety memeplex.

Finally, love as humans execute it is (in my mind) an imperfect instantation of a higher idea. It is true, we don't practice true omnibenevolence or universal love, or even love ourselves in a meaningful way a lot of the time, but I treat it as a direction to aim for, one that inspires us to do what we find most beautiful and meaningful rather than do what is most hateful and ugly.

P.S. sorry for not replying to all the other valuable comments in this section, I've been rather busy as of late, trying to do the things I preach etc.

[-]FlorianH13d10

Assumption 1: Most of us are not saints.
Assumption 2: AI safety is a public good.^[1]

[..simple standard incentives..]

Implication: The AI safety researcher, eventually finding himself rather too unlikely to individually be pivotal on either side, may rather 'rationally'^[2] switch to ‘standard’ AI work.^[3]

So: A rather simple explanation seems to suffice to make sense of the big picture basic pattern you describe.

^{^}
Or, unaligned/unloving/whatever AGI a public bad.
^{^}
I mean: individually ‘rational’ once we factor in another trait - Assumption 1b: The unfathomable scale of potential aggregate disutility from AI gone wrong, bottoms out into a constrained ‘negative’ individual utility in terms of the emotional value non-saint Joe places on it. So a 0.1 permille probability of saving the universe may individually rationally be dominated by mundane stuff like having an still somewhat cool and well-paying job or something.
^{^}
The switch may psychologically be even easier if the employer had started out as actually well-intent and may now still have a bit of an ambiguous flair.

Moderation Log