As AIs rapidly advance and become more agentic, the risk they pose is governed not only by their capabilities but increasingly by their propensities, including goals and values. Tracking the emergence of goals and values has proven a longstanding problem, and despite much interest over the years it remains unclear whether current AIs have meaningful values. We propose a solution to this problem, leveraging the framework of utility functions to study the internal coherence of AI preferences. Surprisingly, we find that independently-sampled preferences in current LLMs exhibit high degrees of structural coherence, and moreover that this emerges with scale. These findings suggest that value systems emerge in LLMs in a meaningful sense, a finding with broad implications. To study these emergent value systems, we propose utility engineering as a research agenda, comprising both the analysis and control of AI utilities. We uncover problematic and often shocking values in LLM assistants despite existing control measures. These include cases where AIs value themselves over humans and are anti-aligned with specific individuals. To constrain these emergent value systems, we propose methods of utility control. As a case study, we show how aligning utilities with a citizen assembly reduces political biases and generalizes to new scenarios. Whether we like it or not, value systems have already emerged in AIs, and much work remains to fully understand and control these emergent representations.
Hey, thanks for the reply.
The same way that people act differently on the internet from in-person, I agree that LLMs might behave differently if they think there are real consequences to their choices. However, I don't think this means that their values over hypothetical states of the world is less valuable to study. In many horrible episodes of human history, decisions with real consequences were made at a distance without directly engaging with what was happening. If someone says "I hate people from country X", I think most people would find that worrisome enough, without needing evidence that the person would actually physically harm someone from country X if given the opportunity.
We ran some experiments on this in the appendix. Prompting it with different personas does change the values (as expected). But within its default persona, we find the values are quite stable to different way of phrasing the comparisons. We also ran a "value drift" experiment where we checked the utilities of a model at various points along long-context SWE-bench logs. We found that the utilities are very stable across the logs.
This is a good point, which I hadn't considered before. I think it's definitely possible for models to adjust their values in-context. It would be interesting to see if sycophancy creates new, coherent values, and if so whether these values have an instrumental structure or are internalized as intrinsic values.