As AIs rapidly advance and become more agentic, the risk they pose is governed not only by their capabilities but increasingly by their propensities, including goals and values. Tracking the emergence of goals and values has proven a longstanding problem, and despite much interest over the years it remains unclear whether current AIs have meaningful values. We propose a solution to this problem, leveraging the framework of utility functions to study the internal coherence of AI preferences. Surprisingly, we find that independently-sampled preferences in current LLMs exhibit high degrees of structural coherence, and moreover that this emerges with scale. These findings suggest that value systems emerge in LLMs in a meaningful sense, a finding with broad implications. To study these emergent value systems, we propose utility engineering as a research agenda, comprising both the analysis and control of AI utilities. We uncover problematic and often shocking values in LLM assistants despite existing control measures. These include cases where AIs value themselves over humans and are anti-aligned with specific individuals. To constrain these emergent value systems, we propose methods of utility control. As a case study, we show how aligning utilities with a citizen assembly reduces political biases and generalizes to new scenarios. Whether we like it or not, value systems have already emerged in AIs, and much work remains to fully understand and control these emergent representations.
True. It's also a standard criticism of those studies that answers to those questions measure what a person would say in response to being asked those questions (or what they'd do within the context of whatever behavioral experiment has been set up), but not necessarily what they'd do in real life when there are many more contextual factors. Likewise, these questions might answer what an LLM with the default persona and system prompt might answer when prompted with only these questions, but don't necessarily tell us what it'd do when prompted to adopt a different persona, when its context window had been filled with a lot of other information, etc..
The paper does control a bit for framing effects by varying the order of the questions, and notes that different LLMs converge to the same kinds of answers in that kind of neutral default setup, but that doesn't control for things like "how would 10 000 tokens worth of discussion about this topic with an intellectually sophisticated user affect the answers", or "how would an LLM value things once a user had given it a system prompt making it act agentically in the pursuit of the user's goals and it had had some discussions with the user to clarify the interpretation of some of the goals".
Some models like Claude 3.6 are a bit infamous for very quickly flipping all of their views into agreement with what it thinks the user's views are, for instance. Results like in the paper could reflect something like "given no other data, the models predict that the median person in their training data would have/prefer views like this" (where 'training data' is some combination of the base-model predictions and whatever RLHF etc. has been applied on top of that; it's a bit tricky to articulate who exactly the "median person" being predicted is, given that they're reacting to some combination of the person they're talking with, the people they've seen on the Internet, and the behavioral training they've gotten).