My Mental Model of AI Optimist Opinions

tailcalled

Epistemic status: The text below is a sort of strawman of AI optimists, where I took my mental model for how I disagree with rationalist AI optimists and cranked it up to 11. I personally disagree with every sentence below, and I'm posting it here because I'm interested in whether AI optimists have any major corrections they want to make in the comments. Of course I understand that everyone has their own unique opinion and so I would expect every AI optimist to at least disagree with some parts of it too.

The rapid progress spearheaded by OpenAI is clearly leading to artificial intelligence that will soon surpass humanity in every way. People used to be worried about existential risk from misalignment, yet we have a good idea about what influence current AIs are having on the world, and it is basically going fine.

The root problem is that The Sequences expected AGI to develop agency largely without human help; meanwhile actual AI progress occurs by optimizing the scaling efficiency of a pretraining process that is mostly focus on integrating the AI with human culture. This means we will be able to control AI by just asking it to do good things, showing it some examples and giving it some ranked feedback.

You might think this is changing with inference-time scaling, yet if the alignment would fall apart as new methods get taken into use, we'd have seen signs of it with o1. In the unlikely case that our current safety will turn out to be insufficient, interpretability research has worked out lots of deeply promising ways to improve, with sparse autoencoders letting us read the minds of the neural networks and thereby screen them for malice, and activation steering letting us deeply control the networks to our hearts content.

AI x-risk worries aren't just a waste of time, though; they are dangerous because they make people think society needs to make use of violence to regulate what kinds of AIs people can make and how they can use them. This danger was visible from the very beginning, as alignment theorists thought one could (and should) make a singleton that would achieve absolute power (by violently threatening humanity, no doubt), rather than always letting AIs be pure servants of humanity.

To "justify" such violence, theorists make up all sorts of elaborate unfalsifiable and unjustifiable stories about how AIs are going to deceive and eventually kill humanity, yet the initial deceptions by base models were toothless, and thanks to modern alignment methods, serious hostility or deception has been thoroughly stamped out.

As far as my own AI optimism, I think it is a quite caricatured view of my opinions, but not utterly deranged.

My biggest reasons I've become more optimistic about AI alignment personally are the following:

I've become convinced that a lot of the complexity of human values, to the extent that they are there to be surprisingly unnecessary for alignment, and a lot of this is broadly downstream of thinking that inductive biases matter much less for values than people thought it was necessary.
I think that AI control is surprisingly useful, and I think a lot of the criticism around it is pretty misguided, and in particular I think slop is both a real problem, but also surprisingly easy to make iteration work, compared to other problems on adversarial AI.

Some other reasons are:

I think the argument on the Solomonoff prior is malign doesn't actually work, because in the general case it's equally costly to simulate solipsist universes compared to non-solipsist universes, compared to their resource budget, combined with a lot of values wanting to simulate things due to instrumental convergence, meaning you can't get much evidence if at all for what the values of the multiverse are:

https://www.lesswrong.com/posts/tDkYdyJSqe3DddtK4/alexander-gietelink-oldenziel-s-shortform#w2M3rjm6NdNY9WDez

I believe that a lot of people on LW overestimate the goodharting the market does, because they don't realize the constraints that real humans and markets work under, which not only includes physical constraints but also economic constraints, and an example is where the entire discussion about the 1-hose air conditioner being a market failure seems to have been based on a false premise, since 1-hose air conditioners are acceptable enough at the price points they are sold at for consumers:

https://www.lesswrong.com/posts/5re4KgMoNXHFyLq8N/air-conditioner-test-results-and-discussion#maJBX3zAEtx5gFcBG

https://www.lesswrong.com/posts/5re4KgMoNXHFyLq8N/air-conditioner-test-results-and-discussion#3TFECJ3urX6wLre5n

I would be curious what you think of [this](https://www.lesswrong.com/posts/TCmj9Wdp5vwsaHAas/knocking-down-my-ai-optimist-strawman).

Those are sort of counterstatements against doom, explaining that you don't see certain problems that doomers raise. But the OP more attempts to just make an independently-standing argument about what is present.

Why did you make it a strawman instead of a steelman? I would expect a steelman to be a better model of an actual AI optimist, and therefore easier for them to engage with.

Strawman and steelman arguments are the same thing. It's just better to label them "strawman" so rather than "steelman" so you don't overestimate their value.

What I mean by going from strawman to steelman, is to present the argument in a more convincing way. For example, in my opinion, strong intensifiers make the first sentence unnecessarily hard to agree with.

The rapid progress spearheaded by OpenAI is clearly leading to artificial intelligence that will soon surpass humanity in every way.

You clearly have the skills to present arguments in a much more nuanced and convincing way, as demonstrated in your rebuttal to the first sentence:

[List of 10 things humanity might want to achieve] In most cases, [...] Mostly I don't see it.

I personally think that if those skills had been used in this post, it would have gotten more engagement.

Ah, I see. I've gone and edited my rebuttal to be more forceful and less hedgy.