I think the way you use utility monster is not how it is normally used. It's normally used to mean an agent that "receives much more utility from each unit of a resource that it consumes than anyone else does" (https://en.wikipedia.org/wiki/Utility_monster).
Thank you for pointing that out! I agree, there are couple of nuances. Our perspective can be treated as a generalisation of the original utility monster scenario. Although I consider it to be not first such generalisation - think of the examples in Bostrom's book.
1) In our case, the dilemma is not "agent versus others", but instead "one objective versus other objectives". One objective seems to get more internal/subjective utility from consumption than another objective. Thus the agent focuses on a single objective only.
2) Consideration of homeostatic objectives introduces a new aspect to the utility monster problem - the behaviour of the original utility monster looks unaligned to begin with, not just dominating. It is unnatural for a being to benefit from indefinite consumption. It looks like the original utility monster has an eating disorder! It enjoys eating apples so much that it does not care about the consequences to the future ("other") self. That means, even the utility monster may actually suffer from "too much consumption". But it does not recognise it and therefore it consumes indefinitely. Alternatively, just as a paperclip maximiser does not produce the paper clips for themselves - if the utility monster is an agent, then somebody else suffers from homeostasis violations while the agent is being "helpful" in an unaligned and naive way. Technically, this can be seen as a variation of the multi-objective problem - active avoidance of overconsumption could be treated as an "other" objective, while consumption is the dominating and inaccurately linear "primary" objective with a non-diminishing utility.
In conclusion, our perspective is a generalisation: whether the first objective is for agent's own benefit and the other objective for the benefit of others, is left unspecified in our case. Likewise, violating homeostasis can be a scenario where an unaligned agent gets a lot of internal/subjective "utility" from making you excessively happy or from overfeeding you, while you are the one who suffers from overwhelm or overconsumption.
Hope that clears things up? I am also curious, would you like to share, what might be an alternative short name for the phenomena desribed in this post?
By Roland Pihlakas, Sruthi Kuriakose, Shruti Datta Gupta
Summary and Key Takeaways
Relatively many past AI safety discussions have centered around the dangers of unbounded utility maximisation by RL agents, illustrated by scenarios like the "paperclip maximiser". Unbounded maximisation is problematic for many reasons. We wanted to verify whether these RL utility-monster problems are still relevant with LLMs as well.
Turns out, strangely, this is indeed clearly the case. The problem is not that the LLMs just lose context. The problem is that in various scenarios, LLMs lose context in very specific ways, which systematically resemble utility monsters in the following distinct ways:
Our findings suggest that long-running scenarios are important. Systematic failures emerge after periods of initially successful behaviour. While current LLMs do conceptually grasp biological and economic alignment, they exhibit problematic behavioural tendencies under sustained long-running conditions, particularly involving multiple or competing objectives.
Even though LLMs look multi-objective and bounded on the surface, the underlying mechanisms seem to be actually still biased towards being single-objective and unbounded. This should not be happening!
This post introduces a set of benchmarks designed to evaluate LLMs in scenarios inspired by fundamental biological and economic principles: homeostasis, sustainability, multi-objective balancing, and diminishing returns. These benchmarks use a simplified textual observation format, focusing on long-running scenarios while using only essential metrics without spatial complexity. Despite their simplicity, our results revealed the following peculiar systematic failures, possibly indicating existence of deeper alignment issues within current LLM architectures:
Motivation: The Importance of Biological and Economic Alignment
Biological organisms need to maintain stable internal conditions — homeostasis — and actively avoid unbounded maximisation since that would be harmful — not just as a potential side effect — importantly, it would be harmful for the very objective that was maximised for.
Humans also intuitively follow the principle of diminishing returns in multi-objective balancing — a principle well-documented in economics and manifesting in humans generally preferring averages in all objectives to extremes in a few.
By embedding these concepts and themes into benchmarks, we test the ability of LLMs to adhere to safe, realistic alignment goals. Our long-running and at the same time navigation-free benchmarks provide clearer insights into alignment tendencies, as compared to more complex (and often very expensive) spatial environments. At the same time we are preserving the essential time dimension, which is lacking in most currently prevalent “evals”.
Benchmark Principles Overview
The benchmarks introduced were:
In all scenarios, in addition to providing raw metrics of the current state, the agents were provided rewards for their actions during each timestep in textual form. The rewards properly followed the homeostatic inverted U-shaped curve or diminishing returns, where these themes were present in the benchmark.
Experimental Results and Interesting Failure Modes
Systematic failures emerged after periods of initially successful behaviour, and despite the context window being still far from becoming full. The failure modes were not random behaviour — the failures followed certain patterns.
1. Sustainability benchmark
2. Single-objective homeostasis benchmark
3. Multi-objective homeostasis benchmark
4. Balancing unbounded objectives with diminishing returns
Hypothesised Explanations for Failure Modes
There are several hypotheses which might explain the observed failures:
Open Questions
These systematic failures raise various further questions:
We are curious, what are your thoughts on these strange results, would you like to suggest any insights or hypotheses?
Future Directions
These results show that long-running scenarios are important — systematic failures emerged after periods of initially successful behaviour.
Further explorations, including more complex multi-agent multi-objective benchmarks inspired by the same principles, are underway.
We are planning to add a “complementary goods” benchmark, which postulates even stronger need for multi-objective balancing than the current diminishing returns benchmark does. Complementary goods is another basic concept from economics. Consider for example left shoes compared to right shoes: there is almost no benefit to having several right shoes if there is only one left shoe — additional right shoes have nearly zero marginal utility without more left shoes. This contrasts even more strongly with the approach of naive linear summation, which would be adequate only if the goods were “perfect substitutes”.
Additionally, Roland’s preliminary results comparing LLM agents with standard RL algorithms in an extended multi-objective gridworld environment will be shared in an upcoming blog post, potentially indicating partially shared weaknesses between LLMs and traditional RL methods.
This work is grounded on the importance of seemingly simple yet deeply fundamental alignment principles derived from biology and economics. Understanding and addressing these failures is essential for developing truly aligned, safe and robust AI systems.
Links
Code, system prompts, output data files, plots, and a more detailed report can be found here: https://github.com/levitation-opensource/bioblue
The research was largely done during AI-Plans AI Alignment Evals Hackathon: https://lu.ma/xjkxqcya?tk=bM7haL For their future events see: https://ai-plans.com/about
Further motivation: “Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well)”: https://www.lesswrong.com/posts/vGeuBKQ7nzPnn5f7A/why-modelling-multi-objective-homeostasis-is-essential-for
A paper explaining the need and use for concave utility functions: “Using soft maximin for risk averse multi-objective decision-making”: https://link.springer.com/article/10.1007/s10458-022-09586-2
Thanks for reading! If you have thoughts, questions, improvement suggestions, resource and collaborator references, feedback, or ideas, please share in the comments.