7 min read

I recently posted my model of an optimistic view of AI, asserting that I disagree with every sentence of it. I thought I might as well also describe my objections to those sentences:

"The rapid progress spearheaded by OpenAI is clearly leading to artificial intelligence that will soon surpass humanity in every way."

Here's some of the main things humanity should want to achieve:

  • Curing aging and other diseases
  • Plentiful clean energy from e.g. nuclear fusion
  • De-escalating nuclear MAD while extending world peace and human freedom
  • ... even if hostile nations would use powerful unaligned AI[1] to fight you
  • Stopping criminals, even if they would make powerful unaligned AI[1] to fight you
  • Educating people to be great and patriotic
  • Creating healthy, tasty food without torturing animals
  • Nice homes for humans near important things
  • Good, open channels for honest, valuable communication
  • Common knowledge of the virtues and vices of executives, professionals, managers, politicians, and various other groups and people of interest

We already have humans working on this, based on the assumption that humans have what it takes to contribute to these. Do large multimodal models seem to move towards being able to take over here? Mostly I don't see it - and the few times I see it, there's as good reason to think this will cause regress as that it will cause progress:

"People used to be worried about existential risk from misalignment, yet we have a good idea about what influence current AIs are having on the world, and it is basically going fine."

We have basically no idea how AI is influencing the world.

Like yes, we can come up with spot checks to see what the AI writes when it is prompted in a particular way. But we don't have a good overview over the things it is prompted to in practice, or how most humans use these prompts. Even if we had a decent approximation for that, we don't have a great way to evaluate what parts really add up to problems, and approximations intrinsically break down in the case of long tails.

Of course the inability to work out problems from first principles is a universal issue, so in practice bad things get detected via root-cause analyses of problems. This can be somewhat difficult because some of the problems are cases where the people in question are incentivized to hide it. But we do have some examples:

  • Personalized tutoring was one of the most plausible contributions of LLMs, which could have contributed to "Educating the young to be great" in my previous list, but instead in practice LLMs seem to be more used to skip learning, and the things they teach are often still slop. It seems quite plausible that LLMs are making education worse rather than better.
  • Automatic moderation was also one of the most plausible contributions of AI, which could have contributed to "Good, open channels for honest, valuable communication", but anecdotally, spam seems to have gone up, and platforms seem to have become more closed, indicating that current AI technology really is making this issue much worse.

The error is linked to assumptions about the agency of the AI. Like it's assumed that if the AI seems to be endorsing nice values and acting according to them when sampled, then this niceness will add up over all of the samplings. But LLMs don't have much memory or context-awareness, so they can't apply their agency across different uses very well. Instead, the total effect of the AI is determined by environmental factors distinct from its values, especially by larger-scale agents that are capable of manipulating the AIs. (This is presumably going to change when AI gets strengthened in various ways.)

Just to emphasize, this doesn't necessarily mean that AI is net bad, just that we don't know how good/bad AI is. Recently society kind of seems to have gotten worse, but it seems to me like that's not driven mainly by AI.

"The root problem is that The Sequences expected AGI to develop agency largely without human help; meanwhile actual AI progress occurs by optimizing the scaling efficiency of a pretraining process that is mostly focus on integrating the AI with human culture."

Large multimodal models are good at simple data transformations and querying common knowledge. I'm sure optimizing the scaling efficiency of pretraining processes will make them even better at that.

However, this is still mostly just copying humans, and for the ambitious achievements I mentioned in the beginning of the post, copying humans doesn't seem to be enough. E.g. to build a fusion power plant, we'd need real technical innovations. If these are supposed to be made by a superhuman AI, it needs to be able to go beyond just copying the innovations humans have already come up with.

So if we imagine AI as a tool that makes it easier to process and share certain kinds of information, then sure, improving scaling efficiency is how you develop AI, but that's not the sort of thing the original arguments about existential risk concern, and we have good reasons to believe that AI will be developed with more ambitious methods too. These "good reasons" mostly boil down to adversarial relationships; spammers, propagandists, criminals and militaries will want to use AI to become stronger, and we need to be able to fight that, which also requires AI.

"This means we will be able to control AI by just asking it to do good things, showing it some examples and giving it some ranked feedback."

RLHF trains an AI to do things that look good to humans. This makes it much harder to control because it makes it makes it hide anything bad. Also, RLHF is kind of a statistical approach, which makes it work better for context-independent goodness, whereas often the hard part is recognizing rare forms of goodness. (Otherwise you just end up with very generic stuff.)

Examples/prompt engineering requires the AI to work by copying humans, which to some extent I addressed in the previous section. The primary danger of AI is not when it does things humans understand well, but rather when it does things that are beyond the scale or abilities of human understanding.

"You might think this is changing with inference-time scaling, yet if the alignment would fall apart as new methods get taken into use, we'd have seen signs of it with o1."

o1-style training is not optimizing against the real world to handle long-range tasks, so instrumental convergence does not apply there. You need to consider the nuances of the method in order to be able to evaluate whether the alignment properties of current methods will fall apart. In particular it gets more problematic as optimization against adversaries gets involved.

"In the unlikely case that our current safety will turn out to be insufficient, interpretability research has worked out lots of deeply promising ways to improve, with sparse autoencoders letting us read the minds of the neural networks and thereby screen them for malice, and activation steering letting us deeply control the networks to our hearts content."

SAEs and activation steering focus on the level of individual tokens or text generations, rather than on the overall behavior of the network. Neither of them can contribute meaningfully to current alignment issues like improving personalized tutoring because it occurs on a much broader level than tokens, so we shouldn't expect them to scale to more difficult issues like keeping down crime or navigating international diplomacy.

"AI x-risk worries aren't just a waste of time, though; they are dangerous because they make people think society needs to make use of violence to regulate what kinds of AIs people can make and how they can use them."

Obviously there will be some very bad ways to make and use AI, and we need norms against it. Violence is the ultimate backstop for norm enforcement: it's called the police and the military.

"This danger was visible from the very beginning, as alignment theorists thought one could (and should) make a singleton that would achieve absolute power (by violently threatening humanity, no doubt), rather than always letting AIs be pure servants of humanity."

It seems extremely valid to be concerned about AI researchers (including those with an alignment focus) aspiring to conquer the world (or to make something that conquers the world). However, always having humans on top won't be able to deal with the rapid and broad action that will be needed against AI-enabled adversaries.

Traditionally the ultimate backstop for promoting human flourishing was that states were reliant on men in the military, so if those men were incapacitated or did not see value in the state they were fighting for, the states would be weaker. This incentivized the states to develop things that helped the men in their military, and made states which failed to do so get replaced by states that did so.

This backstop has already been weakening with more advanced weaponry and more peace. Eventually all fighting will be done by drones rather than by people, at which point the backstop will be nearly gone. (Of course there's also the manufacturing and programming of the drones, etc..) This lack of backstop is the longest-term alignment problem, and if it fails there's endless ways most value could be destroyed, e.g.:

  • The machinery of war (mining, manufacturing, targeting, ...) has been fully automated, and a death cult (like Hamas or the Zizians) develops in the upper ranks of some military (would probably require a situation like Russia because death cults usually develop from external pressure?), and they destroy the world.
  • World peace is achieved, and the world elites are heavily filtered through processes that make them obsessed with superficial appearances rather than what is really going on, and they use their policing power to do that to everyone else too. (Imagine social credit scores that subtract points whenever you frown.)
  • All production is fully automated and the human reward system gets so fully reverse engineered that everyone spends all their time watching what basically amounts to those Tik-Toks that layer some attractive commentary (e.g. a joke) on top of some attractive video (e.g. satisfying hydraulic press moments).

"To "justify" such violence, theorists make up all sorts of elaborate unfalsifiable and unjustifiable stories about how AIs are going to deceive and eventually kill humanity, yet the initial deceptions by base models were toothless, and thanks to modern alignment methods, serious hostility or deception has been thoroughly stamped out."

AI optimists been totally knocked out by things like RLHF, becoming overly convinced of the AI's alignment and capabilities just from it acting apparently-nicely. This is an a form of deceptive alignment, just in a "law of earlier failure" sense as the AIs that knocked them out are barely even agentic.

  1. ^

    "But wouldn't it just be aligned to them, rather than unaligned?" Sometimes, presumably especially with xrisk-pilled adversaries. But some adversaries won't be xrisk-pilled and instead will be willing to use more risky strategies until they win. So you either need to eliminate them ahead of time or be able to destroy the unaligned AIs.

New to LessWrong?

New Comment
3 comments, sorted by Click to highlight new comments since:

Thanks to modern alignment methods, serious hostility or deception has been thoroughly stamped out.

AI optimists been totally knocked out by things like RLHF, becoming overly convinced of the AI's alignment and capabilities just from it acting apparently-nicely.

I'm interested in how far you think we can reasonably extrapolate from the apparent niceness of an LLM. One extreme:

This LLM is apparently nice therefore it is completely safe, with no serious hostility or deception, and no unintended consequences.

This is false. Many apparently nice humans are not nice. Many nice humans are unsafe. Niceness can be hostile or deceptive in some conditions. And so on. But how about a more cautious claim?

This LLM appears to be nice, which is evidence that it is nice.

I can see the shape of a counter-argument like:

  1. The lab won't release a model if it doesn't appear nice.
  2. Therefore all models released by the lab will appear nice.
  3. Therefore the apparent niceness of a specific model released by the lab is not surprising.
  4. Therefore it is not evidence.

Maybe something like that?

Disclaimer: I'm not an AI optimist.

I think the clearest problems in current LLMs are what I discussed in the "People used to be worried about existential risk from misalignment, yet we have a good idea about what influence current AIs are having on the world, and it is basically going fine." section. And this is probably a good example of what you are saying about how "Niceness can be hostile or deceptive in some conditions.".

For example, the issue of outsourcing tasks to an LLM to the point where one becomes dependent on it is arguably an issue of excessive niceness - though not exactly to the point where it becomes hostile or deceptive. But where it then does become deceptive in practice is that when you outsource a lot of your skills to the LLM, you start feeling like the LLM is a very intelligent guru that you can rely on, and then when you come up with a kind of half-baked idea, the RLHF makes the LLM praise you for your insight.

A tricky thing with a claim like "This LLM appears to be nice, which is evidence that it is nice." is what it means for it to "be nice". I think the default conception of niceness is as a general factor underlying nice behaviors, where a nice behavior is considered something like an action that alleviates difficulties or gives something desired, possibly with the restriction that being nice is the end itself (or at least, not a means to an end which the person you're treating nicely would disapprove of).

The major hurdle in generalizing this conception to LLMs is in this last restriction - both in terms of which restriction to use, and in how that restriction generalizes to LLMs. If we don't have any restriction at all, then it seems safe to say that LLMs are typically inhumanly nice. But obviously OpenAI makes ChatGPT so nice in order to get subscribers to earn money, so that could be said to violate the ulterior motive restriction. But it seems to me that this is only really profitable due to the massive economies of scale, so on a level of an individual conversation, the amount of niceness seems to exceed the amount of money transferred, and seems quite unconditional on the money situation, so it seems more natural to think of the LLM as being simply nice for the purpose of being nice.

I think the more fundamental issue is that "nice" is a kind of confused concept (which is perhaps not so surprising considering the etymology of "nice"). Contrast for instance the following cultures:

  1. Everyone has strong, well-founded opinions on what makes for a good person, and they want there to be more good people. Because of these norms, they collaborate to teach each other skills, discuss philosophy, resolve neuroses, etc., to help each other be good, and that makes them all very good people. This goodness makes them all like and enjoy each other, and thus in lots of cases they conclude that the best thing they could do is to alleviate each other's difficulties and give each other things they desire (even from a selfish perspective, as empowering the others means that the others are better and do more nice stuff).
  2. Nobody is quite sure how to be good, but everyone is quite sure that goodness is something that makes people increase social welfare/sum of utility. Everyone has learned that utility can be elicited by choices/revealed preferences. They look at what actions others take and on how they seem to feel in order to deduce information about goodness. This often leads to them executing nice behaviors because nice behaviors consistently makes people feel better in the period shortly after they were executed.

They're both "nice", but the niceness of the two cultures have fundamentally different mechanisms with fundamentally different root causes and fundamentally different consequences. Even if they might both be high on the general factor of niceness, most nice behaviors have relatively small consequences, and so the majority of the consequence of their niceness is not determined by the overall level of the general factor of niceness, but instead by the nuances and long tails of their niceness, which differs a lot between the two cultures.

Now, LLMs don't do either of these, because they're not human and they don't have enough context to act according to either of these mechanisms. I don't think one can really compare LLMs to anything other than themselves.

Thanks. This helped me realize/recall that when an LLM appears to be nice, much less follows from that than it would for a human. For example, a password-locked model could appear nice, but become very nasty if it reads a magic word. So my mental model for "this LLM appears nice" should be closer to "this chimpanzee appears nice" or "this alien appears nice" or "this religion appears nice" in terms of trust. Interpretability and other research can help, but then we're moving further from human-based intuitions.

Curated and popular this week