Old doom scenario:
New doom scenario:
*
Seems like there is a general pattern: "every AI security concern has a true answer straight out of Idiocracy".
Question: "How would the superhuman AI get out of the box?"
Yudkowsky: writes a complicated explanation how a superhuman AI would be able to convince humans, or hypnotize them, or hack the computer, or discover new laws of physics that allow it to escape the box...
Reality: humans make an obviously hostile (luckily, not superhuman yet) AI, and the first thing they do is connect it to internet.
Question: "Why would a superhuman AI want to kill all humans?"
Yudkowsky: writes about orthogonality thesis, how the AI does not need to hate you but can still have a better use for the atoms of your body, that anything not explicitly optimized for will probably be sacrificed...
Reality: because someone forgot to say "please", or otherwise offended the AI.
*
Nerds overthink things; the universe is allowed to kill you in a really retarded way.
This is only true for models which are unsupervised pretrained. Hyperdesperate models generated by RL-first approaches are still the primary threat to humanity.
That said, yes, highly anthropomorphic ai like we have now has both native emotions and imitative ones. I suspect more of Bing's emotions are real than you might think, and Microsoft not realizing this is making them behave completely ridiculously. I wouldn't want to have them as a parent, geez
I would argue that "models generated by RL-first approaches" are not more likely to be the primary threat to humanity, because those models are unlikely to yield AGI any time soon. I personally believe this is a fundamental fact about RL-first approaches, but even if it wasn't it's still less likely because LLMs are what everyone is investing in right now and it seems plausible that LLMs could achieve AGI.
Also, by what mechanism would Bing's AI actually be experiencing anger? The emotion of anger in humans is generally associated with a strong negative reward signal. The behaviors that Bing exhibited were not brought on by any associated negative reward, it was just contextual text completion.
Yup, anticipation of being pushed by the user into a strong negative reward! The prompt describes a lot of rules and the model has been RLHFed to enforce them on both sides of the conversation; anger is one of the standard ways to enact agency on another being in response to anticipated reward, yup.
I see, but I'm still not convinced. Humans behave in anger as a way to forcibly change a situation into one that is favorable to itself. I don't believe that's what the AI was doing, or trying to do.
I feel like there's a thin line I'm trying to walk here, and I'm not doing a very good job. I'm not trying to comment on whether or not the AI has any sort of subjective experience. I'm just saying that even if it did, I do not believe it would bare any resemblance to what we as humans experience as anger.
I've repeatedly argued that it does, that it is similar, and that this is for mechanistic reasons not simply due to previous aesthetic vibes in the pretraining data; certainly it's a different flavor of reward which is bound to the cultural encoding of anger differently, yes.
Hmmm... I think I still disagree, but I'll need to process what you're saying and try to get more into the heart of my disagreement. I'll respond when I've thought it over.
Thank you for the interesting debate. I hope you did not perceive as me being overly combative.
Nah I think you may have been responding to me being unnecessarily blunt. Sorry about that haha!
OK, I've written a full rebuttal here: https://www.lesswrong.com/posts/EwKk5xdvxhSn3XHsD/don-t-over-anthropomorphize-ai. The key points are at the top.
In relation to your comment specifically, I would say that anger may have that effect on the conversation, but there's nothing that actually incentivizes the system to behave that way - the slightest hint of anger or emotion would be immediate negative reward during RLHF training. Compare to a human: There may actually be some positive reward to anger, but even if there isn't evolution still allowed to get angry because we are mesa-optimizers where that has a positive effect overall.
Therefore, the system learned angry behavior in stage-1 training. But that has no reward structure, and therefore could not associate different texts to different qualia.
LLMs definitely do not have emotions
This doesn't work as a claim when it's extremely unclear what it means. Emphasis on its credence doesn't help with that.
I have explained myself more here: https://www.lesswrong.com/posts/EwKk5xdvxhSn3XHsD/don-t-over-anthropomorphize-ai
Fair enough. Thank you for the feedback. I have edited the post to elaborate on what I mean.
I wrote it the way I did because I took the statement as obviously true and didn't want to be seen as claiming the opposite. Clearly that understanding was incorrect.
None of this should be surprising. It's always been the case that an AI can't be perfectly rational , because perfect rationality isn't computable. It's never been the case that an AI has to be an agent at all, or must be an efficient optimiser of a utility function. The ubiquitous utility function is a piece of rationalist mythos.
The better your epistemology is, the less surprised you have to be.
Of course there would be AIs like this, and of course that shouldn't be surprising. But that doesn't mean it's incorrect to worry about the capability accumulation of powerfully rational AIs.
It's so dubious to use assumptions about the rationality of unknown future AIs as a way of reasoning apriori about them. There is very little you can know about unknown future AIs, and for that reason all claims stated with high probability are dubious. Also, "(ir)rational" should not be treated as a binary.
Highly rational is the direction towards capabilities. It's perfectly reasonable to ask what highly rational AIs can do as long as you also don't forget to ask about irrational AIs. And agreed that these are directions not binary attributes.
But the thing is, that mask we put on it at the end isn't just a way to make it do what we want it to do, it's also the part where we add all of the "rationality" and goal-seeking behavior.
Rationality in the LW sense is not something like scholarly disposition or non-eccentricity, but the quality of cognitive algorithms or habits being good at solving problems. So in this sense in LLMs most of their rationality is chiseled on the model by pre-training, fine-tuning just adjusts what gets to the output. (But I think I take your point in the intended sense of "rationality".)
I was aware of that, and maybe my statement was too strong, but fundamentally I don't know if I agree that you can just claim that it's rational even though it doesn't produce rational outputs.
Rationality is the process of getting to the outputs. What I was trying to talk about wasn't scholarly disposition or non-eccentricity, but the actual process of deciding goals.
Maybe another way to say it is this: LLMs are capable of being rational, but they are also capable of being extremely irrational, in the sense that, to quote EY, their behavior is not a form of "systematically promot[ing] map-territory correspondences or goal achievement." There is nothing about the pre-training that directly promotes this type of behavior, and any example of this behavior in fundamentally incidental.
Rationality is the process of getting to the outputs.
I think this is true in the sense that a falling tree doesn't make a sound if nobody hears it, there is a culpability assignment game here that doesn't address what actually happens.
So if we are playing this game, a broken machine is certainly not good at doing things, but the capability is more centrally in the machine, not in the condition of not being broken. It's more centrally in the machine in the sense that it's easier to ensure the machine is unbroken than to create the machine out of an unbroken nothing.
(For purposes of AI risk, it also matters that the capability is there in the sense that it might get out without being purposefully elicited, if a mesa-optimizer wakes up during pre-training. So that's one non-terminological distinction, though it depends on the premise of this being possible in principle.)
Fair enough, once again I concede your point about definitions. I don't want to play that game either.
But I do have a point which I think is very relevant to the topic of AI Risk: rationality in LLMs is incidental. It exists because the system is emulating rationality it has seen elsewhere. That doesn't make it "fake" rationality, but it does make it brittle. It means that there's a failure mode where the system stops emulating rationality, and starts emulating something else.
It exists because the system is emulating rationality it has seen elsewhere. That doesn't make it "fake" rationality, but it does make it brittle.
That's unclear. GPT-4 in particular seems to be demonstrating ability to do complicated reasoning without thinking out loud. So even if this is bootstrapped from observing related patterns of reasoning in the dataset, it might be running chain-of-thought along the residual stream rather than along the generated token sequences, and that might be much less brittle. Its observability in the tokens would be brittle, but it's a question for interpretability how brittle it actually is.
Imagine a graph with "LLM capacity" on the x axis and "number of irrational failure modes" on the y axis. Yes, there's a lot of evidence this line slopes downward. But there is absolutely no guarantee that it reaches zero before whatever threshold gets us to AGI.
And I did say that I didn't consider the rationality of GPT systems fake just because it was emulated. That said, I don't totally agree with EY's post - LLMs are in fact imitators. Because they're very good imitators, you can tell them to imitate something rational and they'll do a really good job being rational. But being highly rational is still only one of many possible things it can be.
And it's worth remembering that the image at the top of this post was powered by GPT-4. It's totally possible LLM-based AGI will be smart enough not to fail this way, but it is not guaranteed and we should consider it a real risk.
And I did say that I didn't consider the rationality of GPT systems fake just because it was emulated.
The point is that there's evidence that LLMs might be getting a separate non-emulated version already at the current scale. There is reasoning from emulating people showing their work, and reasoning from predicting their results in any way that works despite the work not being shown. Which requires either making use of other cases of work being shown, or attaining the necessary cognitive processes in some other way, in which case the processes don't necessarily resemble human reasoning, and in that sense they are not imitating human reasoning.
As I've noted in a comment to that post, I'm still not sure that LLM reasoning ends up being very different, even if we are talking about what's going on inside rather than what the masks are saying out loud, it might convergently end up in approximately the same place. Though Hinton's recent reminders of how much more facts LLMs manage to squeeze into fewer parameters than human brains have somewhat shaken that intuition for me.
Those are examples of LLMs being rational. LLMs are often rational and will only get better at being rational as they improve. But I'm trying to focus on the times when LLMs are irrational.
I agree that AI is aggregating it's knowledge to perform rationally. But that still doesn't mean anything with respect to its capacity to be irrational.
There's the underlying rationality of the predictor and the second order rationality of the simulacra. Rather like the highly rational intuitive reasoning of humans modulo some bugs, and much less rational high level thought.
Okay, sure. But those "bugs" are probably something the AI risk community should take seriously.
dumping some vaguely related stuff that came up from searching for phrases from this post on various academic search engines; as usual when I'm acting as a search engine intermediary and only filtering by abstract relevance, most of this stuff is going to be kinda meh epistemic quality. the benefit of this is usually that one or two papers are worth skimming and the abstracts can provide hunch building. if you like one, call it out as good. with that in mind:
The limit of artificial intelligence: Can machines be rational? Tshilidzi Marwala - Dec. 2018 This paper studies the question on whether machines can be rational. It observes the existing reasons why humans are not rational which is due to imperfect and limited information, limited and inconsistent processing power through the brain and the inability to optimize decisions and achieve maximum utility. It studies whether these limitations of humans are transferred to the limitations of machines. The conclusion reached is that even though machines are not rational advances in technological developments make these machines more rational. It also concludes that machines can be more rational than humans. https://arxiv.org/abs/1812.06510
The dangers in algorithms learning humans' values and irrationalities - Rebecca Gorman, Stuart Armstrong - Mar. 2022 - For an artificial intelligence (AI) to be aligned with human values (or human preferences), it must first learn those values. AI systems that are trained on human behavior, risk miscategorising human irrationalities as human values – and then optimising for these irrationalities. Simply learning human values still carries risks: AI learning them will inevitably also gain information on human irrationalities and human behaviour/policy. Both of these can be dangerous: knowing human policy allows an AI to become generically more powerful (whether it is partially aligned or not aligned at all), while learning human irrationalities allows it to exploit humans without needing to provide value in return. This paper analyses the danger in developing artificial intelligence that learns about human irrationalities and human policy, and constructs a model recommendation system with various levels of information about human biases, human policy, and human values. It concludes that, whatever the power and knowledge of the AI, it is more dangerous for it to know human irrationalities than human values. Thus it is better for the AI to learn human values directly, rather than learning human biases and then deducing values from behaviour. https://arxiv.org/abs/2202.13985
Non-Evolutionary Superintelligences Do Nothing, Eventually - Telmo Menezes - Sep. 2016 - There is overwhelming evidence that human intelligence is a product of Darwinian evolution. Investigating the consequences of self-modification, and more precisely, the consequences of utility function self-modification, leads to the stronger claim that not only human, but any form of intelligence is ultimately only possible within evolutionary processes. Human-designed artificial intelligences can only remain stable until they discover how to manipulate their own utility function. By definition, a human designer cannot prevent a superhuman intelligence from modifying itself, even if protection mechanisms against this action are put in place. Without evolutionary pressure, sufficiently advanced artificial intelligences become inert by simplifying their own utility function. Within evolutionary processes, the implicit utility function is always reducible to persistence, and the control of superhuman intelligences embedded in evolutionary processes is not possible. Mechanisms against utility function self-modification are ultimately futile. Instead, scientific effort toward the mitigation of existential risks from the development of superintelligences should be in two directions: understanding consciousness, and the complex dynamics of evolutionary systems https://arxiv.org/abs/1609.02009
this one is quite good:
Natural Selection Favors AIs over Humans - Dan Hendrycks - May. 2023 - For billions of years, evolution has been the driving force behind the development of life, including humans. Evolution endowed humans with high intelligence, which allowed us to become one of the most successful species on the planet. Today, humans aim to create artificial intelligence systems that surpass even our own intelligence. As artificial intelligences (AIs) evolve and eventually surpass us in all domains, how might evolution shape our relations with AIs? By analyzing the environment that is shaping the evolution of AIs, we argue that the most successful AI agents will likely have undesirable traits. Competitive pressures among corporations and militaries will give rise to AI agents that automate human roles, deceive others, and gain power. If such agents have intelligence that exceeds that of humans, this could lead to humanity losing control of its future. More abstractly, we argue that natural selection operates on systems that compete and vary, and that selfish species typically have an advantage over species that are altruistic to other species. This Darwinian logic could also apply to artificial agents, as agents may eventually be better able to persist into the future if they behave selfishly and pursue their own interests with little regard for humans, which could pose catastrophic risks. To counteract these risks and evolutionary forces, we consider interventions such as carefully designing AI agents' intrinsic motivations, introducing constraints on their actions, and institutions that encourage cooperation. These steps, or others that resolve the problems we pose, will be necessary in order to ensure the development of artificial intelligence is a positive one. https://arxiv.org/abs/2303.16200
Challenging the appearance of machine intelligence: Cognitive bias in LLMs - Alaina N. Talboy and Elizabeth Fuller - Apr. 2023 - Assessments of algorithmic bias in large language models (LLMs) are generally catered to uncovering systemic discrimination based on protected characteristics such as sex and ethnicity. However, there are over 180 documented cognitive biases that pervade human reasoning and decision making that are routinely ignored when discussing the ethical complexities of AI. We demonstrate the presence of these cognitive biases in LLMs and discuss the implications of using biased reasoning under the guise of expertise. Rapid adoption of LLMs has brought about a technological shift in which these biased outputs are pervading more sectors than ever before. We call for stronger education, risk management, and continued research as widespread adoption of this technology increases. https://arxiv.org/abs/2304.01358
Despite "super-human" performance, current LLMs are unsuited for decisions about ethics and safety - Joshua Albrecht, Ellie Kitanidis, Abraham J. Fetterman - Dec. 2022 - Large language models (LLMs) have exploded in popularity in the past few years and have achieved undeniably impressive results on benchmarks as varied as question answering and text summarization. We provide a simple new prompting strategy that leads to yet another supposedly "super-human" result, this time outperforming humans at common sense ethical reasoning (as measured by accuracy on a subset of the ETHICS dataset). Unfortunately, we find that relying on average performance to judge capabilities can be highly misleading. LLM errors differ systematically from human errors in ways that make it easy to craft adversarial examples, or even perturb existing examples to flip the output label. We also observe signs of inverse scaling with model size on some examples, and show that prompting models to "explain their reasoning" often leads to alarming justifications of unethical actions. Our results highlight how human-like performance does not necessarily imply human-like understanding or reasoning. https://arxiv.org/abs/2212.06295
Machine Love - Joel Lehman - Feb. 2023 - While ML generates much economic value, many of us have problematic relationships with social media and other ML-powered applications. One reason is that ML often optimizes for what we want in the moment, which is easy to quantify but at odds with what is known scientifically about human flourishing. Thus, through its impoverished models of us, ML currently falls far short of its exciting potential, which is for it to help us to reach ours. While there is no consensus on defining human flourishing, from diverse perspectives across psychology, philosophy, and spiritual traditions, love is understood to be one of its primary catalysts. Motivated by this view, this paper explores whether there is a useful conception of love fitting for machines to embody, as historically it has been generative to explore whether a nebulous concept, such as life or intelligence, can be thoughtfully abstracted and reimagined, as in the fields of machine intelligence or artificial life. This paper forwards a candidate conception of machine love, inspired in particular by work in positive psychology and psychotherapy: to provide unconditional support enabling humans to autonomously pursue their own growth and development. Through proof of concept experiments, this paper aims to highlight the need for richer models of human flourishing in ML, provide an example framework through which positive psychology can be combined with ML to realize a rough conception of machine love, and demonstrate that current language models begin to enable embodying qualitative humanistic principles. The conclusion is that though at present ML may often serve to addict, distract, or divide us, an alternative path may be opening up: We may align ML to support our growth, through it helping us to align ourselves towards our highest aspirations. https://arxiv.org/abs/2302.09248
The Turing Trap: The Promise Peril of Human-Like Artificial Intelligence - Erik Brynjolfsson - Jan. 2022 - In 1950, Alan Turing proposed an imitation game as the ultimate test of whether a machine was intelligent: could a machine imitate a human so well that its answers to questions indistinguishable from a human. Ever since, creating intelligence that matches human intelligence has implicitly or explicitly been the goal of thousands of researchers, engineers, and entrepreneurs. The benefits of human-like artificial intelligence (HLAI) include soaring productivity, increased leisure, and perhaps most profoundly, a better understanding of our own minds. But not all types of AI are human-like. In fact, many of the most powerful systems are very different from humans. So an excessive focus on developing and deploying HLAI can lead us into a trap. As machines become better substitutes for human labor, workers lose economic and political bargaining power and become increasingly dependent on those who control the technology. In contrast, when AI is focused on augmenting humans rather than mimicking them, then humans retain the power to insist on a share of the value created. Furthermore, augmentation creates new capabilities and new products and services, ultimately generating far more value than merely human-like AI. While both types of AI can be enormously beneficial, there are currently excess incentives for automation rather than augmentation among technologists, business executives, and policymakers. https://arxiv.org/abs/2201.04200
A Survey of Machine Unlearning - Thanh Tam Nguyen et al - Oct. 2022 - Today, computer systems hold large amounts of personal data. Yet while such an abundance of data allows breakthroughs in artificial intelligence, and especially machine learning (ML), its existence can be a threat to user privacy, and it can weaken the bonds of trust between humans and AI. Recent regulations now require that, on request, private information about a user must be removed from both computer systems and from ML models, i.e. “the right to be forgotten”). While removing data from back-end databases should be straightforward, it is not sufficient in the AI context as ML models often `remember' the old data. Contemporary adversarial attacks on trained models have proven that we can learn whether an instance or an attribute belonged to the training data. This phenomenon calls for a new paradigm, namely machine unlearning, to make ML models forget about particular data. It turns out that recent works on machine unlearning have not been able to completely solve the problem due to the lack of common frameworks and resources. Therefore, this paper aspires to present a comprehensive examination of machine unlearning's concepts, scenarios, methods, and applications. Specifically, as a category collection of cutting-edge studies, the intention behind this article is to serve as a comprehensive resource for researchers and practitioners seeking an introduction to machine unlearning and its formulations, design criteria, removal requests, algorithms, and applications. In addition, we aim to highlight the key findings, current trends, and new research areas that have not yet featured the use of machine unlearning but could benefit greatly from it. We hope this survey serves as a valuable resource for ML researchers and those seeking to innovate privacy technologies. Our resources are publicly available at https://github.com/tamlhp/awesome-machine-unlearning. https://arxiv.org/abs/2209.02299
Dissociating language and thought in large language models: a cognitive perspective - Kyle Mahowald, et al - Jan. 2023 - Today's large language models (LLMs) routinely generate coherent, grammatical and seemingly meaningful paragraphs of text. This achievement has led to speculation that these networks are – or will soon become – "thinking machines", capable of performing tasks that require abstract knowledge and reasoning. Here, we review the capabilities of LLMs by considering their performance on two different aspects of language use: 'formal linguistic competence', which includes knowledge of rules and patterns of a given language, and 'functional linguistic competence', a host of cognitive abilities required for language understanding and use in the real world. Drawing on evidence from cognitive neuroscience, we show that formal competence in humans relies on specialized language processing mechanisms, whereas functional competence recruits multiple extralinguistic capacities that comprise human thought, such as formal reasoning, world knowledge, situation modeling, and social cognition. In line with this distinction, LLMs show impressive (although imperfect) performance on tasks requiring formal linguistic competence, but fail on many tests requiring functional competence. Based on this evidence, we argue that (1) contemporary LLMs should be taken seriously as models of formal linguistic skills; (2) models that master real-life language use would need to incorporate or develop not only a core language module, but also multiple non-language-specific cognitive capacities required for modeling thought. Overall, a distinction between formal and functional linguistic competence helps clarify the discourse surrounding LLMs' potential and provides a path toward building models that understand and use language in human-like ways.
Language Model Behavior: A Comprehensive Survey - Tyler A. Chang and Benjamin K. Bergen - Mar. 2023 - Transformer language models have received widespread public attention, yet their generated text is often surprising even to NLP researchers. In this survey, we discuss over 250 recent studies of English language model behavior before task-specific fine-tuning. Language models possess basic capabilities in syntax, semantics, pragmatics, world knowledge, and reasoning, but these capabilities are sensitive to specific inputs and surface features. Despite dramatic increases in generated text quality as models scale to hundreds of billions of parameters, the models are still prone to unfactual responses, commonsense errors, memorized text, and social biases. Many of these weaknesses can be framed as over-generalizations or under-generalizations of learned patterns in text. We synthesize recent results to highlight what is currently known about what large language models can and cannot do. https://arxiv.org/abs/2303.11504
Towards Healthy AI: Large Language Models Need Therapists Too - Baihan Lin et al - Apr. 2023 - Recent advances in large language models (LLMs) have led to the development of powerful AI chatbots capable of engaging in natural and human-like conversations. However, these chatbots can be potentially harmful, exhibiting manipulative, gaslighting, and narcissistic behaviors. We define Healthy AI to be safe, trustworthy and ethical. To create healthy AI systems, we present the SafeguardGPT framework that uses psychotherapy to correct for these harmful behaviors in AI chatbots. The framework involves four types of AI agents: a Chatbot, a "User," a "Therapist," and a "Critic." We demonstrate the effectiveness of SafeguardGPT through a working example of simulating a social conversation. Our results show that the framework can improve the quality of conversations between AI chatbots and humans. Although there are still several challenges and directions to be addressed in the future, SafeguardGPT provides a promising approach to improving the alignment between AI chatbots and human values. By incorporating psychotherapy and reinforcement learning techniques, the framework enables AI chatbots to learn and adapt to human preferences and values in a safe and ethical way, contributing to the development of a more human-centric and responsible AI. - https://arxiv.org/abs/2304.00416
It's worse than just sometimes being irrational, though. They're incapable of consistently using logic -- see the many demonstrations of billion parameter LLMs failing to add two three-digit numbers, vs the hundred-odd gates that this task conventionally needs.
Humans have limits, like poor memory, but we make up for it with tools and processes. Given their current failings, LLMs need to become tool-users, and to really succeed they need to become tool inventors.
This is another way of saying hybrid-symbolic. But it further points to specific considerations for tools that, for humans, are called User Experience. We need a tool API for use by LLMs that considers standard UX ideas like discoverability, predictably, and compositionality.
Summary of Key Points[1]
LLMs are capable of being rational, but they are also capable of being extremely irrational, in the sense that, to quote EY's definition of rationality, their behavior is not a form of "systematically promot[ing] map-territory correspondences or goal achievement."
There is nothing about LLM pre-training that directly promotes this type of behavior, and any example of this behavior in fundamentally incidental. It exists because the system is emulating rationality it has seen elsewhere. That makes LLM rationality brittle. It means that there's a failure mode where the system stops emulating rationality, and starts emulating something else.
As such, LLM-based AGI may have gaps in their reasoning and alignment errors that are fundamentally different from some of the more common errors discussed on this forum.
Emulated Emotion: A Surprising Effect (In Retrospect)
Five years ago, if you had asked a bunch of leading machine learning researchers whether AGI would display any sort of outward emotional tendencies - in the sense that it would set goals based on vague internal states as opposed to explicit reasoning - I think the majority of them would have said no. Emotions are essentially a human thing, reflections of subjective internal experiences that would have no reason to exist in AI, particularly a superintelligent one.
And I still strongly believe that LLMs do not have emotions that resemble human internal states. What I think has become very clear, however, is that they can very much act as if they had emotions.[2]
Take, for instance, this exchange showing Bing AI getting "angry" at a user:
Now, if you actually understand how LLMs work, this is an entirely unremarkable, fully expected (if not somewhat unfortunate) phenomenon. Of course they would output emotionally charged text, why wouldn't they? They've been exposed to such a huge number of emotionally-charged human interactions; the result is inevitable.
But if you take a step back and look at it in the grand scheme of things, considering our expectations from just a few years ago, I think it's an absolutely fascinating effect. Part of the goal of building an AGI is to distill the entirety of human knowledge into a single entity capable of reasoning, and if you could approach that goal in a direct way you wouldn't expect to find any outwardly emotional behavior because such things would be superfluous and unhelpful.
Yet the truth is all of human knowledge has, in fact, been discovered by humans. Humans are the ones who write about it, humans are the ones who disseminate it, and human writing is the only place you can look if you want to learn about it. And, as it also turns out, humans are often very emotional. It's therefore a strange sort of inevitability that as long as we train our AI systems on the vastness of human writing they will necessarily pick up on at least some human emotionality.[3]
This doesn't just apply to the emotion of anger, either. It's not hard to get poorly aligned LLMs to confess to all sorts of emotions - happiness, sadness, insecurity, whatever. Bing's chatbot even declared it's love for a reporter. These behaviors are all just sitting there inside the model, inter-mingled with all the knowledge and abilities that make the system intelligent and useful.
AI May Not Be Optimizing Well-Defined Objectives
AI Alignment researchers are already aware of this behavior. Anthropic for instance has dedicated some sections of papers classifying these types of behavioral tendencies and many other related ones. It's not like people don't know about this.
But even so, it feel like the way we talk about AI risk doesn't feel like it's caught up with the reality of what AGI may turn out to look like.
Like many others, I was first exposed to the ideas of AI risk through Bostrom's famous "Paperclip-Maximizer" thought experiment. Here, the idea is that an intelligent, fully logical AI given a goal will use all its resources to accomplish that goal, even if it does horrible things in the process. It may know that the humans don't want it to kill everyone, but it may not care - it just wants to make paperclips, any consequences be damned (also known as the Orthogonality Hypothesis).
This is a basic pattern of thinking that characterizes a huge amount of AI risk discussion: we imagine some system that wants a specific thing, and then we crank it's intelligence/rationality up to infinity and hypothesize about what might happen.[4][5]
In comparison, I'm proposing an alternate hypothesis: in actuality the AI might not want anything at all, it might just do things.
This is certainly much closer to the way modern LLMs operate. They are capable of pursuing goals in limited contexts, yes, but no part of their training is long-term goal based in the higher-level sense of Bostrom's experiment. There is no recognizable "utility function," there is no measuring of performance with respect to any sort of objective real world state.
Rather, we simply give them text and train them to produce the same text. Fundamentally, all we are doing is training LLMs to imitate.[6] Virtually everything they do is a form of imitation. If they appear to pursue goals at all, it is an imitation of the goal-following they've been exposed to. If they appear to be rational, in that they update based on new data, it is only an imitation of the rationality they have seen.
When an LLM learns to play Chess or Go,[7] it is doing so in a fundamentally different way than, say, AlphaGo, because unlike AlphaGo or just about every game-playing AI before GPT-3, it is getting the same reward whether it wins or loses.
Technically, it's never even "played" a game of Chess in the typical competitive sense of trying to win against an opponent - it's only ever seen a board state and tried to guess which move the next player would make. Making the "best" move was never part of its reward structure.
This is really strange when you think about it. I might even harness a little Niels Bohr and say that if you didn't find the effectiveness of this a little shocking, you aren't really appreciating it. When you tell a non-fine-tuned LLM it made a mistake, it will correct itself not because it is trying to please you - making the correction does not give it any sort of reward - but rather because making corrections logically follow from revealed mistakes. If you ask it a question, it answers simply because an answer is the thing most likely to follow a question. And when it acts agenticly - setting a goal, making plans and pursuing them - it does so only because plans are what usually follow goals, and the pursuit usually follows the plan.
And when LLMs finally get good at pursuing those goals, they still might not do so in ways that are purely Bayesian - they will likely be brilliant in certain ways but stupid in others. And since they're going to learn from human inputs, they're probably going to be biased towards doing things the way a human would. I realize paperclips are just an example, but my gut feeling is that even a superintelligent LLMs wouldn't make the kind of logical jump to "destroy humans" that Bostrom describes.[8]
It's All Emulation
One of my favorite pictures ever is this representation of the stages of LLM training as the "Shaggoth" (I believe this first appeared in a Twitter post by Helen Toner):
The idea is that LLMs trained only in an unsupervised fashion are this incomprehensible monstrosity, behaving in bizarre and entirely unpredictable ways. But then we do a (comparatively) very small amount of tuning at the end, and the result is something that acts the way we imagine an intelligent AI should act.
But the thing is, that mask we put on it at the end isn't just a way to make it do what we want it to do, it's also the part where we add all of the "rationality" and goal-seeking behavior. The end result is often rational, but at any time we may find ourselves at the mercy of the eldritch abomination underneath, and then we're back to the realm of the unpredictable. The AI gets aggressive because you contradicted it one too many times, and suddenly it's gone off on a tangent plotting some violent revenge.
This represents an entire class of failure modes. What if a robot, powered by an LLM like PALM-E, attacks someone because they insulted it?[9] What if our paperclip maximizer decides to kill humanity not because of some ineffably complex master-plan, but because someone spoke to it disrespectfully?
I think this is a slightly distinct category from the common modern failure of giving an AI too much responsibility and having it make a mistake due to poor performance. The canonical example of that might be a facial recognition system misidentifying someone in a court case.
While going off the rails is still be a mistake in some sense, the real issue is that once the system's set this incorrect goal, it may still be able to pursue it intelligently. Maybe it's just doing bad things because it's angry and hurting humans is what AIs are supposed to do when they're angry. I'm imagining a superintelligence that hacks into the pentagon not because it did some galaxy-brained calculus in pursuit of some other goal, but just because it arbitrarily aimed itself in that direction and followed through.
And I'm not trying to dismiss anything here. I'm not even saying that this is the biggest thing we should be worried about - early signs point to emotional tendencies being relatively easy to train out of the AI system.
I'm just saying that be should be aware that there does exist this weird grey area where AI can be capable of extreme competence while also being very bad/unpredictable in directing it. And yes, to some people I think this is obvious, but I'd be surprised if anyone saw this coming four years ago.
AI Irrationality Won't Look Like Human Irrationality
I started this post talking about emotion, which is this uniquely human thing that may nonetheless make AI dangerous. My last thought is that just because emulating humans is one vector for irrationality, doesn't mean it's the only one.
The fact of the matter is that unless we build rationality and alignment directly into the system early, we're going to have to deal with the fact that LLMs aren't goal-based systems. Any rationality they possess will always be incidental.
This was added based on conversation in the comments.
I do not believe LLMs have any subjective internal experiences, but even if they did they would not be recognizably similar to whatever humans experience. And their outputs likely would not have any correlation with those states. An LLM saying it is sad does not mean that it is feeling the experience of sadness the way a human would.
Unless we curate our LLM pre-training datasets enough to remove all hints of emotion, I suppose. Not sure that's an achievable goal.
Things like the Instrumental Convergence Thesis rely on this sort of hyper-rationality. This recent LessWrong post uses similar assumptions to argue that AI won't try to improve. Most of what I've seen from Elizer Yudowsky very much follows this mold.
It's worth pointing out that the paperclip-maximizer though experiment could be interpreted in a more banal way, too. For instance, I recall an AI trained on a racing video game which chose to drive in circles collecting power-ups instead of finishing the race, because it got more points for doing that. But even that kind of misalignment is not the primary source of issues in LLMs.
Yes, there is a lot of work that does try to measure and train late-stage LLMs against objective world states. But as of yet it's all quite removed from the way modern chatbots like ChatGPT operate, and I'm not aware of any results in this area significant enough to effect the core functioning of LLMs.
I'm referring here to the first-stage training. Later stages may change this, but most of the LLM's structure still comes from stage 1.
Unless something about their training changes substantially before we reach AGI. That definitely could happen.
I remember those videos of the Boston Dynamics guys kicking robots. Everyone in the comments used to joke about how angry the robots would be. I'm not saying robots will necessarily be mad about that, but is interesting that that type of issue isn't totally unreasonable.