AGI Farm

Rahul Chand

This post discusses Joe Carlsmith’s views on how to approach the problem of AI risk as interspecies interaction and how humans can use it navigate future AI development better. The essay is divided into three parts. First I give my understanding of Carlsmith's views, then I build upon some of his ideas by relating them to the field of superalignment and how the logical conclusion of his ideas might lead to scenarios where "too good" becomes bad.

The Prior - A brief summary of Joe Carlsmith’s views

In “Artificial Other” Carlsmith talks about two things

How to frame the discussion of potential AI risk as inter species interaction & the issues & opportunities that arise with the framing.
What our vast human experiences interacting with other species can teach us about what is to come.

In Carlsmith's views discussing AI risk becomes more accessible when framed as interaction with another species. This framing gives us more grounded context. Why? First, because we can relate it to our vast experiences interacting with different species on earth, and second because we can picture what our interactions with future super-intelligent AI's might look like by looking at the skewed power dynamics we see in our current interactions. e.g. with wildlife, with other humans (e.g. talking to a toddler), and in our popular culture (e.g. lot of fiction is based on what would happen if aliens landed on Earth). This framing however requires you to anthropize AI to an extent. How do you view it as an other species? It's easier to anthropize some future super AI, but what about current AIs? Is GPT-4 conscious or sentient? An example that Carlsmith talks about that I want to extend upon is that of a jellyfish. The jellyfish, an extremely simple organism with just two cell layers, is in many ways more 'alive' than GPT-4 or Claude, despite these AIs being much more intelligent. How do I anthropize an intelligent “other” which feels less alive than a jellyfish? Carlsmith touches on the weird foray we are getting ourselves into with

“I say all this partly because I want us to be prepared for just how confusing and complicated “AI Otherness” is about to get”

Foster’s Octopus or Herzog’s Bear? Yin or Yang?

In the rest of “Artificial other” & “Loving a World you don't trust” Carlsmith touches on topics ranging from what our experiences interacting with animals can tell us about how our interactions with AI should and might go (octopus vs bear). How our view on the fundamental essence of nature and power can guide us to steer future AIs (Yin vs. Yang), and how we can navigate our place in such a world (deep atheism vs. humanism). Below I briefly try to describe my understanding of his ideas.

Octopus or Bear?

Carlsmith first story is about Craig Foster's interaction with an octopus, an animal that perfectly embodies the intelligent but completely alien life-form ("otherness"). In Foster's interaction, Carlsmith sees what he calls "gentleness". To approach other species softly and carefully but also with respect, in what he calls "moment of mutuality". These are the vibes that he hopes to bring to our approach to building and understanding AI and less of the imagining AI as a "tool" or "competitor". His second story is about Timothy Tredwell who gets eaten up by the same bears he spent 13 years with.

Why does Carlsmith choose this story? First, obviously as a warning of how "gentleness" and looking for kinship from "fellow creatures" can go wrong. But he wants to hammer home a couple more important ideas. First, the essence of nature itself, Carlsmith (often quoting Herzog) wants to puncture our romanticism with nature, especially the vision of "Nature in Harmony". Secondly, by pointing out that unlike bears or aliens, AI's will be more human not less (or at least pretend to be). We will have stronger "bond" and "kinship" with them than with bears or any other animal. They will understand us better than our friends, family and maybe our own selves. For Carlsmith, the future AI's will not be the "dead-eyed" killing factories that they are presented as, but more similar to the female-robot in Ex Machina^[1]. In his view this only makes everything harder and more confusing.

"There will come a time, I suggest, when even you should be confused. When you should realize you are out of your depth .... These AIs are neither spreadsheets nor bears nor humans but some other different Other thing."

Yin or Yang?

In "Loving a World you don't trust", Carlsmith ends his essay series talking about a lot of things starting from giving "Yang" its due. What does "Yang" mean in the context of AI? For Carlsmith "Yang" presents a firm total control-seeking attitude towards AI development to mitigate any potential risk scenarios. For him there are many ways in which this can go wrong and he presents better alternatives (e.g. "gentleness" and "liberalism"). So why does he give "Yang" its due? In Carlsmith's opinion some "Yang" qualities are important and can help us cut through the bullshit and have a more pragmatic attitude towards whats going to come.

"In this, it aspires to a kind of discipline ...... the sort of vibe that emerges when, if you’re wrong, then your friends all die"

He then makes a case for "humanism", where rather than the bleakness that comes with having a fundamental distrust toward nature and intelligence (like in deep atheism), you turn this into a sense of wonderment and resilience. How we should be more Sagan than Lovecraft.

"And I think of humanism as, partly, about standing together on this island; nurturing our campfire; learning to see into the dark, and to voyage further"

His final essay is much more philosophically dense than "Artificial Other" and touches upon a lot of other things apart from Yang and Humanism (virtue of Void, God, Evil etc). Coming from a non-philosophy background, I had a much harder time following it than I had with the first essay. I try my best to concretely present how some of his ideas might relate to development of future AI systems in my final section.

Carlsmith, Superalignment & When Too Good is Bad

In previous sections I described what I understood of Carlsmith's work. In this section I try to build upon these ideas and connect them to works like Superalignment and then finally discuss some critiques of his framing and where it can lead us.

To start, what kind of species are we dealing with when we talk about AI risk? Any plausible scenario about future AI risk requires a much more advanced AI than currently possible. This is not just a top percentile PhD student or scientist; the AI we are talking about might be close to incomprehensible for humans (eventually).

Do humans have experience interacting with such species? Directly, no. However we have proxy experiences that might help. Humans weren't always as technologically advanced as they are currently. There was a time when most interspecies communication humans had wasn’t as power-skewed as it is today. How did those interactions go? Not great. Both sides wanted to kill each other. We came out on top, and when we were safe enough, we started building zoos and wildlife sanctuaries for them. Do I beleive this is the future? Well, No. I agree with Carlsmith's view on how we can’t treat future AI as Herzog’s bear. Herzog’s bear kills because it doesn’t know any better. Future AI’s, if they do kill, will do because of the opposite.

"Herzog finds, in the bears, no kinship, or understanding, or mercy. But the AIs, at least, will understand"

So future AI is not the embodiment of what Herzog call “indifference of nature”. Good. What's next? A simple next question is, “a very intelligent AI will obviously be nice”. Why? Well, our current AIs are nice, aren't they? Sometimes too nice? The future AI trained with GPU datacenters stretching across countries and powered directly by fusion reactors will be just as good as our current AI, just much smarter. This is a really important question. Because if you believe this is true, then your p(doom) is 0 & you dont have to worry about AI risk^[2]. Wouldn't it be good if we could confirm this? How? Do we resurrect Von Neumann, clone him a 10000 times, and then make him interact with as many people as possible? Maybe something else? What if we made LLMs roleplay this scenario? Smaller LLMs act as humans, and larger LLMs act as future AI. Can these smaller LLMs reliably steer these larger (much smarter) AIs to be “good”?

This is the idea of OpenAI’s work on weak-to-strong generalization (superalignment^[3]). Where they take models with capabilities close to gpt-3 (3rd grader) and try to steer GPT-4 model (12th grader)^[4]. Below I include parts of the paper relevant to our discussion

What they study?
How does it work out? If we were dropped a GPT-10 level non-aligned base model from the sky, would we be able to align it using current techniques? Not quite.
This doesn’t mean we won't be able to do it in the future. And especially if you believe that the progress of AI intelligence will be gradual. I am not quite sure which side I fall on here, GPT-3 to GPT-4 to O-1 hasn't been gradual^[5]. I feel I am more on the “slowly then all of a sudden” side. But if you are the opposite there is cause for optimism.

Is Too Good actually Bad?

Where does this leave us? If you believe in human ingenuity to come up with RLHF of the future and hold on to certain assumptions, like gradual takeoff and a bunch of others^[6], then it seems our odds of aligning superintelligent AI are fairly good. Is this it? Can we sleep peacefully? Well, maybe no, one thing Carlsmith doesn't discuss much is that human "gentleness" to others is based on our perception of their complexity.

We feel more kinship towards a monkey than we do towards a mouse, and we feel more kinship towards a mouse than we do towards an ant. Do we feel the same "children of the universe" that Tredwell felt when he saw bears, when we look at the daily life of ants? No. Our lack of kinship towards ants is not just an issue with our view of nature, or our lack of "gentleness", or not treating them as "others", it is a fundamental issue with the gap in our complexities.

How does this relate to Carlsmith? If you believe that AIs will keep getting smarter and at some point reach a positive enforcement loop (super-smart AI will help train super-human AI which will then help train AGI and so on) then it leaves us in a bad place in the interspecies graph. Though Carlsmith does discuss such scenarios (e.g. Arrival), real world examples of vastly different complexity of species interacting paints a different picture. At some point, human values and experiences would mean as less to an AI as that of ants mean to humans. The ideas I am refering to here have similar to works like The Day the Earth Stood Still and Matrix.

For a moment, imagine yourself living in a world where it's just you and humans. The only thing you remember is that your job is to maximize for the humans' safety and happiness. You are much smarter than any of them; you understand things on a level they will never. Solving for all the different variables that contribute to humans being "unhappy" like disease, aging, random events in environment etc. is tough. What if you were given a chance to plug them all in a machine where they live all their lives in perfectly happy state, would you do it?^[7] What about if they were squirrels? What if they were ants? Does your answer change? It's possible eventually^[8] future AI will have to deal with these questions about humans too.

The Posterior

So how does Carlsmith's view affect my opinions on AI risk & development? For one, I fall more into the "Yin" camp than the "Yang" camp, I believe in a more hands free approach to AI development and lesser of the control. This is not because I believe AIs don't pose a risk (they do), but because I believe near future AIs (next 20-30 years) will be easier to align, and their benefits greatly outweights their harm. Secondly, because I believe human progress bottlenecks without them. This is to say, in all the futures I imagine, one of two things happens:

Human progress slows down greatly.
We get the "extremely super-intelligent incompressible to human" AI that might or might not treat us as ants.

Any path that leads us out of 1 will eventually lead us to 2. So it seems like we are done either way? Here, I take comfort in Carlsmith's humanist approach, I see opportunities with AI more than I see the despair.

It’s like he’s looking at a different and less beautiful world. He’s too far on the “terror” aspect ..... He’ll stand as wanderer over the sea of fog; but instead of splendor he’ll see hideous madness.

Call it.
But what do I got to win?
Everything. You stand to win Everything.

Scene from "No Country For Old Men" describes how I feel about our current position in AI development. We have no choice but to call & everything to win.

^{^}
https://www.imdb.com/title/tt0470752/
^{^}
Actually you still do as I talk about in "Is Too Good actually Bad?"
^{^}
https://arxiv.org/abs/2312.09390
^{^}
I believe the difference between humans and future AI will be much greater than between a 3rd grader and a 12th grader but this serves as a good proxy.
^{^}
We went from a 3rd grader level in GPT-3 to a high school level in GPT-4 and now with o1 near grad level in 2 years. This is not gradual to me.
^{^}
A complete list of assumptions made in the OpenAI weak-to-strong can be found in page 48 of https://arxiv.org/abs/2312.09390
^{^}
Similar to the "Experience Machine" thought experiment by Robert Nozick.
^{^}
I say eventually because I dont think these AIs are possible in the near future. But if you believe AIs will keep getting better then this is a scenario that has a good probability of eventually happening.

LESSWRONG
LW