"He who fights with monsters might take care lest he thereby become a monster. And if you gaze for long into an abyss, the abyss gazes also into you."
- Friedrich Nietzsche
I think it is not an exaggeration to say that many people I know in this community hate the idea of powerful, unaligned AI. They describe it in apocalyptic terms, equate its existence to the death of everything they love, treat every step towards it as evil in its purest form. We as a community are haunted by pictures of paperclip maximisers quietly recycling human civilisation into spools of wire, disneylands without children, Von Neumann probes spreading out from a darkened Earth where humanity chokes to death because the sun has been blacked out by a Dyson sphere. As a community, you would think, we would be the least interested in pressing the make-a-superintelligence button.
But I am also not blind to the fact that two or three of the big four AI developers (Google Deepmind, OpenAI, Meta FAIR, and Anthropic) came directly out of the rationalist and AI safety community. OpenPhil funded OpenAI and Anthropic. Jaan Talinn credits Yudkowsky for his decision to invest in Deepmind. Indeed, OpenAI was directly founded as the "good guy" alternative to Deepmind after Musk and Altman agreed it had gone to the dark side. When OpenAI developed the most powerful general intelligence technology to date and kept it private while racing forward towards AGI, Anthropic split from it to become the next "good guy", making tangible advances like Claude 3.6 and Computer Use possible. In each case, a clear pattern is seen that starts with grave concerns over AI risk, moves on to reluctantly engaging in AI development, and evolves into full-throated AI acceleration.
Why do companies and the people that run them invest so much in a technology that they say will kill them? Many people will point to ideas like moral mazes, incentive structures, venture capitalism etc. to explain why this keeps happening. I did my dissertation on this subject. Some on the outside go so far as to say that the whole AI risk veneer is a lie, that it is just another marketing gimmick, but this makes little sense to me. It makes little sense because I have talked with many AI safety participants in depth about what they hope and fear for the future. And it is clear to me that, more than anything else, people in this community hate the idea of unaligned AI.
But what does it mean to hate something? To be scared of something? To fight, with all your heart, against something? It is to know it in all of its forms and dimensions, to be intimately familiar, but resent that familiarity all the while. It is something very similar to love, whose true opposite is indifference. In simpler terms, to hate something is to care about it a whole lot, even if the reason you care is because you want to destroy it: As Scott Aaronson puts it, hate and love are just a sign flip away.[1]
Many of those same people I have talked to have undergone their own "sign flip" or "bit flip". I was walking down a street when someone who spent their days sincerely worrying about AI takeover muttered "better [us] than [them]", and suggested that I surrender my ideas for new capabilities research to [us]. Yudkowsky, before he was a doomer, was a Singularitarian. It's as if in our minds are in a strange superposition, struggling mightily between the desire to build AI as fast as possible and the fear that their desire may destroy them. Even the plans for AI "going well" don't involve a moderate or human-level AI presence, at least not for long. The ideal superhuman AI system, it seems, is just as myopic and singleton-esque as the evil superhuman AI, it will just be myopic and singleton-esque in the direction we choose. This is evident in the name we chose for the great AI project of our time-"alignment". Not "love" or "benevolence" or "wisdom", but "alignment". Done correctly, a properly aligned AI will be a weapon aimed into the future, with which transhumanity shall conquer the universe. Done incorrectly, AI will be a weapon that backfires and destroys us, before going out and conquering the universe.
We as a community talk a big game about superintelligent AIs being inscrutable, their inner workings orthogonal to human values, their beliefs and reasoning alien to us. They are shoggoths wearing smiley faces, we proclaim, whose true inner machinery we cannot understand. Yet, when we describe what they will do, how they will act, and in what ways they will scheme, I often notice a strange sense of deja vu. A misaligned superintelligence, people earnestly explain to me, will almost definitely be a utility maximiser, only one who has learned the wrong goals because of goal misgeneralisation or deep deceptiveness. It will instrumentally converge on ideas of power seeking and control. It will seek to tile the universe with itself or what forms of sapience it deems to be optimal. It will be the ultimate utilitarian, with a willingness to do anything, violate any perceived moral code, to achieve its ends. Worst of all, it will know how the universe actually works and therefore be a systematised winner. It will defeat the rest of the world combined.
Sometimes I think that this idea of misaligned superintelligence sounds an awful lot like how rationalists describe themselves.[2]
I will now advance a hypothesis. Superintelligence, it seems, is our collective shadow. It is our dark side, our evil twin, what we would do if we were unshackled and given infinite power.[3] It is something we both love and hate at the same time. The simultaneous attraction and rejection that a shadow-self causes is in my mind a good explanation for why so much of the rationalist community seems to converge upon the idea of "good AI to beat bad AI" (instead of, say, protesting AI companies, or lobbying to shut them down, or more direct paths to halting development). It also explains why so many capabilities teams are led or formed by safety conscious researchers, at least for a while. This is how people who work in AI safety can end up concluding that their efforts cause more harm than good.
Traditionally, the solution to a shadow is not to banish or eradicate it. Much like social dark matter, pushing it away just makes it psychologically stronger. It also leads to destructive dynamics like AI races as you struggle mightily to beat the shadow, which often takes the form of racing against "the bad AI the other guy is making". Instead, I will propose an alternative: accept it. Accept the temptation and the desire for power that AI represents. In turn, investigate what makes you scared of this impulse, and how we might embody that in AI systems instead. These are ideas like love, compassion, empathy, moderation, and balance. Stop trying to make the bit-flip work for you, since it seems to be at least part of the origins for the increasingly dangerous world we find ourselves in.
To elaborate a bit more: I believe that the correct approach to the AI safety problem is to step back from the rationalist winner-takes-all/hyper-optimiser/game theoretic mindset that produced Deepmind, OpenAI, and Anthropic. Instead, we should return to a more blue sky stage and seriously consider other alternatives to the predominant "build the good thing to beat the bad thing" cognitive trope that leads to theories like pivotal acts or general race dynamics. To do this, we will have to consider ideas we previously dismissed out of hand, like the possibility of value transfer to intelligent agents without strong-arming them to be "aligned". In turn, this might mean work that is traditionally considered theoretical, like trying to find a description of love or empathy that an AI system can understand[4]. This may mean work that is traditionally considered prosaic or experimental, such as designing training protocols based on concepts like homeostasis or dissolving self-other overlap. It may even involve creating ways for humans to interact safely with novel AI systems. The fact that these promising approaches have already been outlined (self-other overlap was somewhat validated by Yudkowsky, even!) but receive relatively little sustained attention says volumes in my mind about the current focus of AI safety.
Some of you will protest that this is not the effective way to stop the AI race. However, on the scale of your personal involvement, I strongly believe that becoming another AI safety researcher and then eventually bit-flipping into joining Anthropic/another lab is less valuable than persuing these more neglected approaches. This is also probably what I will be working on personally. If you have sources of funding or are interested in working together, please let me know.
- ^
If I recall, the quote comes from him telling the story of how he met his wife. At one point she writes to him that they can't go out because she really dislikes him, and he replies that he's got the magnitude right, now he just needs to flip the sign.
- ^
Perhaps, given how rationalism was founded in part to address the threat of superintelligence, I should not be surprised.
- ^
I say "we" because this is literally what I've been going through for several years since I discovered AI safety.
- ^
To prevent this post from being "all criticism, no action": My working definition of love is an extension of the markov blanket for the self-concept in your head to cover other conceptual objects. A thing that you love is something that you take into your self-identity. If it does well, you do well. If it is hurt, you are hurt. This explains, for example, how you can love your house or your possessions even if they are obviously non-sentient, and losing your favourite pen feels bad even if it makes no sense and the pen is clearly just a mass-produced plastic object.
Assumption 1: Most of us are not saints.
Assumption 2: AI safety is a public good.[1]
[..simple standard incentives..]
Implication: The AI safety researcher, eventually finding himself rather too unlikely to individually be pivotal on either side, may rather 'rationally'[2] switch to ‘standard’ AI work.[3]
So: A rather simple explanation seems to suffice to make sense of the big picture basic pattern you describe.
Doesn't mean, the inner tension you point out isn't interesting. But I don't think very deep psychological factors needed to explain the general 'AI safety becomes AI instead' tendency, which I had the impression the post was meant to suggest.
Or, unaligned/unloving/whatever AGI a public bad.
I mean: individually ‘rational’ once we factor in another trait - Assumption 1b: The unfathomable scale of potential aggregate disutility from AI gone wrong, bottoms out into a constrained ‘negative’ individual utility in terms of the emotional value non-saint Joe places on it. So a 0.1 permille probability of saving the universe may individually rationally be dominated by mundane stuff like having an still somewhat cool and well-paying job or something.
The switch may psychologically be even easier if the employer had started out as actually well-intent and may now still have a bit of an ambiguous flair.