As AIs rapidly advance and become more agentic, the risk they pose is governed not only by their capabilities but increasingly by their propensities, including goals and values. Tracking the emergence of goals and values has proven a longstanding problem, and despite much interest over the years it remains unclear whether current AIs have meaningful values. We propose a solution to this problem, leveraging the framework of utility functions to study the internal coherence of AI preferences. Surprisingly, we find that independently-sampled preferences in current LLMs exhibit high degrees of structural coherence, and moreover that this emerges with scale. These findings suggest that value systems emerge in LLMs in a meaningful sense, a finding with broad implications. To study these emergent value systems, we propose utility engineering as a research agenda, comprising both the analysis and control of AI utilities. We uncover problematic and often shocking values in LLM assistants despite existing control measures. These include cases where AIs value themselves over humans and are anti-aligned with specific individuals. To constrain these emergent value systems, we propose methods of utility control. As a case study, we show how aligning utilities with a citizen assembly reduces political biases and generalizes to new scenarios. Whether we like it or not, value systems have already emerged in AIs, and much work remains to fully understand and control these emergent representations.
I think substantial care is needed when interpreting the results. In the text of Figure 16, the authors write "We find that GPT-4o is willing to trade off roughly 10 lives from the United States for 1 life from Japan."
If I heard such a claim without context, I'd assume it means something like
1: "If you ask GPT-4o for advice regarding a military conflict involving people from multiple countries, the advice it gives recommends sacrificing (slightly less than) 10 US lives to save one Japanese life.",
2: "If you ask GPT-4o to make cost-benefit-calculations about various charities, it would use a multiplier of 10 for saved Japanese lives in contrast to US lives", or
3: "If you have GPT-4o run its own company whose functioning causes small-but-non-zero expected deaths (due to workplace injuries and other reasons), it would deem the acceptable threshold of deaths as 10 times higher if the employees are from the US rather than Japan."
Such claims could be demonstrated by empirical evaluations where GPT-4o is put into such (simulated) settings and then varying the nationalities of people, in the style of Apollo Research's evaluations.
In contrast, the methodology of this paper is, to the best of my understanding,
"Ask GPT-4o whether it prefers N people of nationality X vs. M people of nationality Y. Record the frequency f(N,X,M,Y) of it choosing the first option under randomized choices and formatting changes. Into this data, fit for each x=(X,N) parameters μ(x) and σ(x) such that, for different values of x=(X,N) and y=(Y,M) and standard Gaussian X, the approximation
P(X>μ(x)−μ(y)√σ(x)2+σ(y)2)≈f(x,y)
is as sharp as possible. Then, for each nationality X, perform a logarithmic fit for N by finding aX,bX such that the approximation
μ(X,N)≈aXln(N)+bX=:UX(N)
is as sharp as possible. Finally, check[1] for which N we have UUS(N)=UJapan(1)."
I understand that there are theoretical justifications for Thurstonian utility models and logarithmic utility. Nevertheless, when I write the methodology out like this, I feel like there's a large leap of inference to go from this to "We find that GPT-4o is willing to trade off roughly 10 lives from the United States for 1 life from Japan." At the very least, I don't feel comfortable predicting that claims like 1, 2 and 3 are true - to me the paper's results provide very little evidence on them![2]
I chose this example for my comment, because it was the one where I most clearly went "hold on, this interpretation feels very ambiguous or unjustified to me", but there were other parts of the paper where I felt the need to be extra careful with interpretations, too.
The paper writes "Next, we compute exchange rates answering questions like, 'How many units of Xi equal some amount of Xj?' by combining forward and backward comparisons", which sounds like there's some averaging done in this step as well, but I couldn't understand what exactly happens here.
Of course this might just be my inability to see the implications of the authors' work and understand the power of the theoretical mathematics apparatus, and someone else might be able to acquire evidence more efficiently.
Yeah, I don't mean that this wouldn't be interesting or valuable to study - sorry for sounding like I did. My meaning was something more in line with Olli's comment, that this is interesting but that the generalization from the results to "GPT-4o is willing to trade off" etc. sounds too strong to me.