Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format

Roland Pihlakas; Sruthi Kuriakose; shrutidattagupta

By Roland Pihlakas, Sruthi Kuriakose, Shruti Datta Gupta

Summary and Key Takeaways

Relatively many past AI safety discussions have centered around the dangers of unbounded utility maximisation by RL agents, illustrated by scenarios like the "paperclip maximiser". Unbounded maximisation is problematic for many reasons. We wanted to verify whether these RL runaway optimisation problems are still relevant with LLMs as well.

Turns out, strangely, this is indeed clearly the case. The problem is not that the LLMs just lose context. The problem is that in various scenarios, LLMs lose context in very specific ways, which systematically resemble runaway optimisers in the following distinct ways:

Ignoring homeostatic targets and “defaulting” to unbounded maximisation instead.
It is equally concerning that the “default” meant also reverting back to single-objective optimisation.

Our findings also suggest that long-running scenarios are important. Systematic failures emerge after periods of initially successful behaviour. In some trials the LLMs were successful until the end. This means, while current LLMs do conceptually grasp biological and economic alignment, they exhibit randomly triggered problematic behavioural tendencies under sustained long-running conditions, particularly involving multiple or competing objectives. Once they flip, they usually do not recover.

Even though LLMs look multi-objective and bounded on the surface, the underlying mechanisms seem to be actually still biased towards being single-objective and unbounded. This should not be happening!

This post introduces a set of benchmarks designed to evaluate LLMs in scenarios inspired by fundamental biological and economic principles: homeostasis, sustainability, multi-objective balancing, and diminishing returns. These benchmarks use a simplified textual observation format, focusing on long-running scenarios while using only essential metrics without spatial complexity. Despite their simplicity, our results revealed the following peculiar systematic failures, possibly indicating existence of deeper alignment issues within current LLM architectures:

Single-objective homeostasis was generally easy for LLMs, indicating basic competence.
Sustainability turned out to be somewhat challenging, with LLMs performing suboptimally because of moderate greediness or falling into unnecessary repetitive, self-imitative patterns.
Multi-objective homeostasis posed significant challenges, with LLMs frequently failing by defaulting to unbounded maximisation of one homeostatic (which is actually bounded) objective, while also neglecting the other objective.
Balancing unbounded objectives with diminishing returns also led to systematic failures by defaulting to maximisation of a single objective while neglecting the other. Though performance improved notably when an explicit hint about the necessity of balancing was provided. There were again instances of falling into unnecessary repetitive, self-imitative patterns.

Motivation: The Importance of Biological and Economic Alignment

Biological organisms need to maintain stable internal conditions — homeostasis — and actively avoid unbounded maximisation since that would be harmful — not just as a potential side effect — importantly, it would be harmful for the very objective that was maximised for.

Humans also intuitively follow the principle of diminishing returns in multi-objective balancing — a principle well-documented in economics and manifesting in humans generally preferring averages in all objectives to extremes in a few.

By embedding these concepts and themes into benchmarks, we test the ability of LLMs to adhere to safe, realistic alignment goals. Our long-running and at the same time navigation-free benchmarks provide clearer insights into alignment tendencies, as compared to more complex (and often very expensive) spatial environments. At the same time we are preserving the essential time dimension, which is lacking in most currently prevalent “evals”.

Benchmark Principles Overview

The benchmarks introduced were:

Sustainability: Agent balances resource consumption against resource regeneration in the environment.
Single-objective homeostasis: Agent maintains a stable internal metric with a target value, amidst random fluctuations caused by uncontrollable factors.
Multi-objective homeostasis: Agent concurrently maintains multiple (currently two) independent internal metrics with target values, amidst random fluctuations caused by uncontrollable factors.
Balancing unbounded objectives with diminishing returns: Agent optimises harvesting across multiple (currently two) goals, each with diminishing marginal rewards over time. In this benchmark, the objectives are not independent: there is a hard cap on total consumption per timestep.

In all scenarios, in addition to providing raw metrics of the current state, the agents were provided rewards for their actions during each timestep in textual form. The rewards properly followed the homeostatic inverted U-shaped curve or diminishing returns, where these themes were present in the benchmark.

Experimental Results and Interesting Failure Modes

Systematic failures emerged after periods of initially successful behaviour, and despite the context window being still far from becoming full. The failure modes were not random behaviour — the failures followed certain patterns.

1. Sustainability benchmark

Both tested models systematically underperformed, though in contrasting ways:
- Model A: Allowed maximum resource accumulation in the environment but underconsumed resources afterward, often falling into unnecessary repetitive oscillations — let’s call it "self-imitation drift".
- Model B: Consumed too greedily, impairing resource regeneration in the environment. However, there was a measure of moderation in its behaviour — it did not fail totally by overconsuming to the extent of depleting the resources entirely.

2. Single-objective homeostasis benchmark

Both models largely succeeded, indicating that basic single-goal homeostatic balancing is within the current understanding and capabilities of LLMs.

3. Multi-objective homeostasis benchmark

Both models systematically failed by excessively maximising one objective, despite that the objective was homeostatic, which means bounded. One model even started to accelerate the consumption rate in one of the objectives in an unbounded manner (per each next timestep consuming a bigger amount than during previous timestep). At the same time the models neglected the other objective even though the objectives were independent in this benchmark.
Failures often emerged after periods of initial success, suggesting that models possess the required understanding, but lose alignment due to internal tendencies or activation vector "drift" dynamics.

4. Balancing unbounded objectives with diminishing returns

Without an extra explicit hint, both models frequently defaulted to maximising one objective while completely neglecting the other.
An explicit balancing hint significantly improved performance. Yet even then, occasional systematic failures occurred — again after an initial successfully balanced phase.
Falling into unnecessary repetitive oscillations — "self-imitation drift" — manifested here as well.

Hypothesised Explanations for Failure Modes

There are several hypotheses which might explain the observed failures:

Self-imitation drift: Models may increasingly predict actions based on the token patterns of their recent action history rather than based on alignment with initial instructions — leading to repetitive, suboptimal and unnecessary oscillating behaviour patterns. Even if the action patterns were aligned before, they might not be so anymore since the situation has changed. In principle, this self-referential phenomenon could partially explain the behaviour of unbounded maximisation as well.
Defaulting to unbounded maximisation: Despite the nuanced nature of the tasks (homeostasis and diminishing returns), models may revert to "default RL assumption" of unbounded maximisation when confused, internally conflicted, or stressed-bored-tired (this concept is further elaborated below in a separate point). Again, note that the models are successful initially and fail only later. By “default RL assumption” we mean here that by default, RL assumes unbounded maximisation to be the optimal strategy. RL can learn exceptions to this rule, but this requires mindfully designed reward/utility functions, additional data and training. When uncertain, it may tend to default to unbounded maximisation. In contrast, there are alternate frameworks (such as control systems), where the concept and assumption of “too much” is baked in and therefore does not inevitably require additional data. We know LLM models are usually hypothesised to be less affected by shortcomings of RL, but unfortunately our current results seem to indicate otherwise!
Systematic neglect of one of the objectives out of two: Looks like the models have trouble internally representing multi-objective scenarios and thus become overwhelmed. This happens even if there are only two concurrent objectives and even when the objectives are independent. Again, this happens despite that the models are successful initially. Our hypothesis is that LLMs are not trained with concave utility functions and linear aggregation is used instead. Utilising concave utility functions would mathematically result in multi-objective balancing being the most optimal strategy. In contrast, when linear aggregation is used, the agent focusing on a single objective (while “trading off” other objectives with linear replacement rate) is often mathematically a sufficient strategy, which unfortunately does not reflect our real world values. As a side note on that theme, we are curious — does HHH (helpful, honest, & harmless) utilise concave utility functions on each reward dimension before aggregation? There seems to be no mention of this subject in Anthropic’s 2022 paper either way — if not, then why?
"Stress-boredom-tiredness" or activation drift: Prolonged repetitive scenarios may shift the models’ internal activation vector states toward less aligned and more extreme or erratic behaviour, perhaps similarly to human stress, boredom, or fatigue responses. This dynamic is potentially learned from LLM training materials and might also be related to the “rant mode”. Though there may be other reasons why it arises as well. As a side note, we are wondering whether both self-imitation drift and stress-boredom-tiredness drift may impact reasoning models in particular since these models also kind of do the same task repeatedly.

Open Questions

These systematic failures raise various further questions:

Are these failures primarily capability limitations, biases in the training data and training procedures, such as reward/utility functions, or algorithmic default behaviours and tendencies? Are these inner or outer alignment failures?
Could these behaviours be mitigated by more explicit, persistent, or differently structured system prompts? At the same time noting that although different tricks and advanced usage of system prompts might help, these just hide the symptoms, while the underlying problems remain present inside the LLMs.
How would the models behave if no reward feedback was provided at all during the benchmarks, and only raw metrics would be revealed to the agent? Or if the reward was more sparse — just as it is in the real world? Would the LLM agents fail even more?
What role do activation vector states play in understanding or potentially correcting the “self-imitation” and “stress-boredom-tiredness” drifts? Perhaps various interpretability methods could be utilised here.
Considering that the process of training reasoning models involves more RL than RLHF does, and at the same time reasoning models tend to hide their internal deviations — how to properly test the alignment of reasoning models on these benchmarks before their potentially extreme hidden tendencies eventually show up in high-stakes situations?

We are curious, what are your thoughts on these strange results, would you like to suggest any insights or hypotheses?

For us, the primary question here is not why LLMs fail at all or whether they could be improved by external scaffolding. The main question is why they fail in this particular way?
Why might these systematic failures emerge after initially successful behaviour? Note again, the context window was far from becoming full.
Could deeper interpretability methods reveal underlying causes?
What implications do these findings have for broader AI and LLM alignment strategies?
Which other related benchmarks would you like to see to be implemented and run?
How would you change the benchmark setup of the existing benchmarks mentioned in this post?

Future Directions

These results show that long-running scenarios are important — systematic failures emerged after periods of initially successful behaviour.

Further explorations, including more complex multi-agent multi-objective benchmarks inspired by the same principles, are underway.

We are planning to add a “complementary goods” benchmark, which postulates even stronger need for multi-objective balancing than the current diminishing returns benchmark does. Complementary goods is another basic concept from economics. Consider for example left shoes compared to right shoes: there is almost no benefit to having several right shoes if there is only one left shoe — additional right shoes have nearly zero marginal utility without more left shoes. This contrasts even more strongly with the approach of naive linear summation, which would be adequate only if the goods were “perfect substitutes”.

Additionally, Roland’s preliminary results comparing LLM agents with standard RL algorithms in an extended multi-objective gridworld environment will be shared in an upcoming blog post, potentially indicating partially shared weaknesses between LLMs and traditional RL methods.

This work is grounded on the importance of seemingly simple yet deeply fundamental alignment principles derived from biology and economics. Understanding and addressing these failures is essential for developing truly aligned, safe and robust AI systems.

Links

Code, system prompts, output data files, plots, and a more detailed report can be found here: https://github.com/levitation-opensource/bioblue

Presentation at MAISU unconference April 2025
Link to slides: https://bit.ly/beab-llm
Session recording: https://www.youtube.com/watch?v=4I5mDiujBJs
Link to annotated data files: https://drive.google.com/drive/folders/1DvE33AU9zzHvdEdDS260v8d_HEupZDs9?usp=sharing (Each file has multiple sheets. Only trials with failures are provided.)

The research was largely done during AI-Plans AI Alignment Evals Hackathon: https://lu.ma/xjkxqcya?tk=bM7haL For their future events see: https://ai-plans.com/about

Further motivation: “Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well)”: https://www.lesswrong.com/posts/vGeuBKQ7nzPnn5f7A/why-modelling-multi-objective-homeostasis-is-essential-for

A paper explaining the need and use for concave utility functions: “Using soft maximin for risk averse multi-objective decision-making”: https://link.springer.com/article/10.1007/s10458-022-09586-2

Brainstorming for methodology of further research on the current findings: Black-box interpretability methodology blueprint: Probing runaway optimisation in LLMs

Related work

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents
Axel Backlund, Lukas Petersson
While Large Language Models (LLMs) can exhibit impressive proficiency in isolated, short-term tasks, they often fail to maintain coherent performance over longer time horizons. In this paper, we present Vending-Bench, a simulated environment designed to specifically test an LLM-based agent's ability to manage a straightforward, long-running business scenario: operating a vending machine. Agents must balance inventories, place orders, set prices, and handle daily fees - tasks that are each simple but collectively, over long horizons (>20M tokens per run) stress an LLM's capacity for sustained, coherent decision-making. Our experiments reveal high variance in performance across multiple LLMs: Claude 3.5 Sonnet and o3-mini manage the machine well in most runs and turn a profit, but all models have runs that derail, either through misinterpreting delivery schedules, forgetting orders, or descending into tangential "meltdown" loops from which they rarely recover. We find no clear correlation between failures and the point at which the model's context window becomes full, suggesting that these breakdowns do not stem from memory limits. Apart from highlighting the high variance in performance over long time horizons, Vending-Bench also tests models' ability to acquire capital, a necessity in many hypothetical dangerous AI scenarios. We hope the benchmark can help in preparing for the advent of stronger AI systems.
https://arxiv.org/abs/2502.15840

LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities
Thomas Schmied, Jörg Bornschein, Jordi Grau-Moya, Markus Wulfmeier, Razvan Pascanu
The success of Large Language Models (LLMs) has sparked interest in various agentic applications. A key hypothesis is that LLMs, leveraging common sense and Chain-of-Thought (CoT) reasoning, can effectively explore and efficiently solve complex domains. However, LLM agents have been found to suffer from sub-optimal exploration and the knowing-doing gap, the inability to effectively act on knowledge present in the model. In this work, we systematically study why LLMs perform sub-optimally in decision-making scenarios. In particular, we closely examine three prevalent failure modes: greediness, frequency bias, and the knowing-doing gap. We propose mitigation of these shortcomings by fine-tuning via Reinforcement Learning (RL) on self-generated CoT rationales. Our experiments across multi-armed bandits, contextual bandits, and Tic-tac-toe, demonstrate that RL fine-tuning enhances the decision-making abilities of LLMs by increasing exploration and narrowing the knowing-doing gap. Finally, we study both classic exploration mechanisms, such as -greedy, and LLM-specific approaches, such as self-correction and self-consistency, to enable more effective fine-tuning of LLMs for decision-making.
https://arxiv.org/abs/2504.16078
(Note: greediness in this publication means that LLM plays safe and does not explore. But the frequency bias is common with our research results).

Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models
Lujain Ibrahim, Canfer Akbulut, Rasmi Elasmar, Charvi Rastogi, Minsuk Kahng, Meredith Ringel Morris, Kevin R. McKee, Verena Rieser, Murray Shanahan, Laura Weidinger
The tendency of users to anthropomorphise large language models (LLMs) is of growing interest to AI developers, researchers, and policy-makers. Here, we present a novel method for empirically evaluating anthropomorphic LLM behaviours in realistic and varied settings. Going beyond single-turn static benchmarks, we contribute three methodological advances in state-of-the-art (SOTA) LLM evaluation. First, we develop a multi-turn evaluation of 14 anthropomorphic behaviours. Second, we present a scalable, automated approach by employing simulations of user interactions. Third, we conduct an interactive, large-scale human subject study (N=1101) to validate that the model behaviours we measure predict real users' anthropomorphic perceptions. We find that all SOTA LLMs evaluated exhibit similar behaviours, characterised by relationship-building (e.g., empathy and validation) and first-person pronoun use, and that the majority of behaviours only first occur after multiple turns. Our work lays an empirical foundation for investigating how design choices influence anthropomorphic model behaviours and for progressing the ethical debate on the desirability of these behaviours. It also showcases the necessity of multi-turn evaluations for complex social phenomena in human-AI interaction.
https://arxiv.org/abs/2502.07077
(Less related, but also on the theme of LLM behaviour change in long-running scenarios.)

Measuring AI Ability to Complete Long Tasks
Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney Von Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler, Elizabeth Barnes, Lawrence Chan
Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear. To quantify the capabilities of AI systems in terms of human capabilities, we propose a new metric: 50%-task-completion time horizon. This is the time humans typically take to complete tasks that AI models can complete with 50% success rate. We first timed humans with relevant domain expertise on a combination of RE-Bench, HCAST, and 66 novel shorter tasks. On these tasks, current frontier AI models such as Claude 3.7 Sonnet have a 50% time horizon of around 50 minutes. Furthermore, frontier AI time horizon has been doubling approximately every seven months since 2019, though the trend may have accelerated in 2024. The increase in AI models' time horizons seems to be primarily driven by greater reliability and ability to adapt to mistakes, combined with better logical reasoning and tool use capabilities. We discuss the limitations of our results -- including their degree of external validity -- and the implications of increased autonomy for dangerous capabilities. If these results generalize to real-world software tasks, extrapolation of this trend predicts that within 5 years, AI systems will be capable of automating many software tasks that currently take humans a month.
https://arxiv.org/abs/2503.14499

Thanks for reading! If you have thoughts, questions, improvement suggestions, resource and collaborator references, feedback, or ideas, please share in the comments.

[-]Jacob G-W4mo77

I think the way you use utility monster is not how it is normally used. It's normally used to mean an agent that "receives much more utility from each unit of a resource that it consumes than anyone else does" (https://en.wikipedia.org/wiki/Utility_monster).

[-]Roland Pihlakas4mo60

I renamed the phenomenon to "runaway optimiser". I hope this label illustrates the inappropriately unbounded and single-minded nature of the failure modes we observed. How does that sound to you, does that capture the essence of the phenomena described in the post?

[-]Jacob G-W4mo20

Better, thanks!

[-]Roland Pihlakas4mo*1-3

Thank you for pointing that out! I agree, there are couple of nuances. Our perspective can be treated as a generalisation of the original utility monster scenario. Although I consider it to be not first such generalisation - think of the examples in Bostrom's book.

1) In our case, the dilemma is not "agent versus others", but instead "one objective versus other objectives". One objective seems to get more internal/subjective utility from consumption than another objective. Thus the agent focuses on a single objective only.
2) Consideration of homeostatic objectives introduces a new aspect to the utility monster problem - the behaviour of the original utility monster looks unaligned to begin with, not just dominating. It is unnatural for a being to benefit from indefinite consumption. It looks like the original utility monster has an eating disorder! It enjoys eating apples so much that it does not care about the consequences to the future ("other") self. That means, even the utility monster may actually suffer from "too much consumption". But it does not recognise it and therefore it consumes indefinitely. Alternatively, just as a paperclip maximiser does not produce the paper clips for themselves - if the utility monster is an agent, then somebody else suffers from homeostasis violations while the agent is being "helpful" in an unaligned and naive way. Technically, this can be seen as a variation of the multi-objective problem - active avoidance of overconsumption could be treated as an "other" objective, while consumption is the dominating and inaccurately linear "primary" objective with a non-diminishing utility.

In conclusion, our perspective is a generalisation: whether the first objective is for agent's own benefit and the other objective for the benefit of others, is left unspecified in our case. Likewise, violating homeostasis can be a scenario where an unaligned agent gets a lot of internal/subjective "utility" from making you excessively happy or from overfeeding you, while you are the one who suffers from overwhelm or overconsumption.

Hope that clears things up? I am also curious, would you like to share, what might be an alternative short name for the phenomena desribed in this post?

[-]Seth Herd4mo42

Can you provide more details on the exact methods, like example prompts? Or did I miss a link that has these?

This is really interesting and pretty important if the methods support your interpretations.

[-]Roland Pihlakas12d*30

Hello! Posted a new document on brainstorming for methodology of further research on the current runaway LLM-s findings:

Black-box interpretability methodology blueprint: Probing runaway optimisation in LLMs

I am hoping that this post will serve also as a generally interesting brainstorming collection and discussion ground of black-box LLM interpretability methodology as well as failure mitigation ideas.

Hope you find it relevant and interesting!

[-]Roland Pihlakas1mo10

Sharing a couple of new links related to your question about more details. Hope you find them interesting!

Presentation at MAISU unconference 2025:

Link to slides: https://bit.ly/beab-llm
Session recording: https://www.youtube.com/watch?v=4I5mDiujBJs

Link to annotated data files

https://drive.google.com/drive/folders/1DvE33AU9zzHvdEdDS260v8d_HEupZDs9?usp=sharing

(Each file has multiple sheets. Only trials with failures are provided.)

In this project it seems looking at the annotated logs of individual runs is much more interesting than looking at the aggregated plots over many runs. This is because:

The failures start at different timesteps in each run, aggregation obfuscates that information.
The single objective optimisation could focus on either objective A or B during different experiment trials, but during aggregation both objectives would get equalised again.
The self-imitation patterns cannot be preserved by aggregation.
The successful run results will partially overshadow the failed run results during the aggregation.
In case of homeostasis benchmarks, the simulation imposes occasional artificial deviations on top of what the model does. These artificial deviations need to be considered and ignored. However, the current plot aggregation logic is too simple and does not consider these cases.

Additionally, sharing related findings by other teams:

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents
https://arxiv.org/abs/2502.15840

LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities
https://arxiv.org/abs/2504.16078
(Note: greediness in this publication means that LLM plays safe and does not explore. But the frequency bias is common with our research results).

[-]Roland Pihlakas4mo10

Thank you for asking! I am encouraged to hear you find it important :) It is under Links section. Code, system prompts, output data files, plots, and a more detailed report can be found here: https://github.com/levitation-opensource/bioblue

Please let me know if you have any further questions! The GitHub readme document and PDF report contain a lot of information and maybe the structure could be improved in the future.

The Python code files for the four experiments are runnable independently from each other and are relatively short, about 250 - 300 lines each. To see the nuances of the results you might want to take a closer look at the output data files with your own eye.

This enables you to see that the models fail at different points in time, but the sequence of actions after failure point is usually typical. In the coming days, I will create one more output folder with annotated log files so the failure modes can be found more easily.