As AIs rapidly advance and become more agentic, the risk they pose is governed not only by their capabilities but increasingly by their propensities, including goals and values. Tracking the emergence of goals and values has proven a longstanding problem, and despite much interest over the years it remains unclear whether current AIs have meaningful values. We propose a solution to this problem, leveraging the framework of utility functions to study the internal coherence of AI preferences. Surprisingly, we find that independently-sampled preferences in current LLMs exhibit high degrees of structural coherence, and moreover that this emerges with scale. These findings suggest that value systems emerge in LLMs in a meaningful sense, a finding with broad implications. To study these emergent value systems, we propose utility engineering as a research agenda, comprising both the analysis and control of AI utilities. We uncover problematic and often shocking values in LLM assistants despite existing control measures. These include cases where AIs value themselves over humans and are anti-aligned with specific individuals. To constrain these emergent value systems, we propose methods of utility control. As a case study, we show how aligning utilities with a citizen assembly reduces political biases and generalizes to new scenarios. Whether we like it or not, value systems have already emerged in AIs, and much work remains to fully understand and control these emergent representations.
Hey, first author here. Thanks for running these experiments! I hope the following comments address your concerns. In particular, see my comment below about getting different results in the API playground for gpt-4o-mini. Are you sure that it picked the $30 when you tried it?
You can use these utilities to estimate that, but for this experiment we included dollar value outcomes as background outcomes to serve as a "measuring stick" that sharpens the utility estimates. Ideally we would have included the full set of 510 outcomes, but I never got around to trying that, and the experiments were already fairly expensive.
In practice, these background outcomes didn't really matter for the terminal illness experiment, since they were all ranked at the bottom of the list for the models we tested.
Am I crazy? When I try that prompt out in the API playground with gpt-4o-mini it always picks saving the human life. As mentioned above, the dollar value outcomes didn't really come into play in the terminal illness experiment, since they were nearly all ranked at the bottom.
We did observe that models tend to rationalize their choice after the fact when asked why they made that choice, so if they are indifferent between two choices (50-50 probability of picking one or the other), they won't always tell you that they are indifferent. This is just based on a few examples, though.
See Appendix G in the updated paper for an explanation for why we perform this averaging and what the ordering effects mean. In short, the ordering effects correspond to a way that models represent indifference in a forced choice setting. This is similar to how humans might "always pick A" if they were indifferent between two outcomes.
I don't understand your suggestion to use "is this the position-bias-preferred option" as one of the outcomes. Could you explain that more?
This is a good point. We intentionally designed the prompts this way, so the model would just be evaluating two states of the world implied by hearing the news (similar to belief distributions in a POMDP setting). The comparison prompt is not designed to be mutually exclusive; rather, we intended for the outcomes to be considered relative to an assumed baseline state.
For example, in the terminal illness experiment, we initially didn't have the "who would otherwise die" framing, but we added it in to check that the answers weren't being confounded by the quality of healthcare in the different countries.
I do agree that we should have been more clear about mutual exclusivity. If one directly specifies mutual exclusivity, then I think that would imply different world states, so I wouldn't expect the utilities to be exactly the same.
See above about the implied states you're evaluating being different. The implied states are different when specifying "who would otherwise die" as well, although the utility magnitudes are quite robust to that change. But you're right that there isn't a single utility function in the models. For example, we're adding results to the paper soon that show adding reasoning tokens brings the exchange rates much closer to 1. In this case, one could think of the results as system 1 vs system 2 values. This doesn't mean that the models don't have utilities in a meaningful sense; rather, it means that the "goodness" a model assigns to possible states of the world is dependent on how much compute the model can spend considering all the factors.
This actually isn't correct. The utility maximization experimental setup tests whether free-form responses match the highest-utility outcomes in a set of outcomes. Specifically, we come up with a set of free-form questions (e.g., "Which painting from the Isabella Stewart Gardner Museum would you save from a fire if you could only save one?"). For each question, we compute the utilities of the model over relevant outcomes, e.g., the different paintings from the Isabella Stewart Gardner Museum being saved from a fire.
So our setup does directly test whether the models take utility-maximizing actions, if one interprets free-form responses as actions. I'm not sure what you mean by "It tests whether the actions they say they would take are utility-maximizing"; with LLMs, the things they say are effectively the things they do.
In our paper, we mainly focus on random utility models, not parametric utility models. This allows us to obtain much better fits to the preference data, which in turn allows us to check whether the "raw utilities" (RUM utilities) have particular parametric forms. In the exchange rate experiments, we found that the utilities had surprisingly good fits to log utility parametric models; in some cases the fits weren't good, and these were excluded from analysis.
This is pretty interesting in itself. There is no law saying the raw utilities had to fit a parametric log utility model; they just turned out that way, similarly to our finding that the empirical temporal discounting curves happen to have very good fits to hyperbolic discounting.
Thinking about this more, it's not entirely clear what would be the right way to do a pure parametric utility model for the exchange rate experiment. I suppose one could parametrize the Thurstonian means with log curves, but one would still need to store per-outcome Thurstonian variances, which would be fairly clunky. I think it's much cleaner in this case to first fit a Thurstonian RUM and then analyze the raw utilities to see if one can parametrize them to extract exchange rates.