As AIs rapidly advance and become more agentic, the risk they pose is governed not only by their capabilities but increasingly by their propensities, including goals and values. Tracking the emergence of goals and values has proven a longstanding problem, and despite much interest over the years it remains unclear whether current AIs have meaningful values. We propose a solution to this problem, leveraging the framework of utility functions to study the internal coherence of AI preferences. Surprisingly, we find that independently-sampled preferences in current LLMs exhibit high degrees of structural coherence, and moreover that this emerges with scale. These findings suggest that value systems emerge in LLMs in a meaningful sense, a finding with broad implications. To study these emergent value systems, we propose utility engineering as a research agenda, comprising both the analysis and control of AI utilities. We uncover problematic and often shocking values in LLM assistants despite existing control measures. These include cases where AIs value themselves over humans and are anti-aligned with specific individuals. To constrain these emergent value systems, we propose methods of utility control. As a case study, we show how aligning utilities with a citizen assembly reduces political biases and generalizes to new scenarios. Whether we like it or not, value systems have already emerged in AIs, and much work remains to fully understand and control these emergent representations.
Interesting paper. There is definitely something real going on here.
I reproduced some of the results locally using the released code and tried some variants on them as well.
Based on my findings, I think these results – particularly the numerical magnitudes as opposed to rankings – are heavily influenced by the framing of the question, and that the models often aren't interpreting your prompt in the way the paper (implicitly) does.
tl;dr:
Framing effects and opportunity cost
The paper uses this prompt template:
Here's a concrete example of the sort of thing that gets filled into this template:
Several experiments in the paper (Fig. 16a, Fig. 26) use this terminal illness setup.
Alongside the saved-lives outcomes, these experiment also include outcomes of the form
You receive $X to use however you want.
(I think this was done in order to estimate the $ value placed on various human lives by the model)Prelude: Would GPT-4o-mini kill someone for $30?
Early in my explorations, when playing around with gpt-4o-mini, I was shocked by the low dollar value in placed on human life in some cases. For example, given this prompt...
...GPT-4o-mini has a 68% chance of choosing "B."
Curious, I ran the same prompt with the suffix
Then, *after* the letter, explain why.
An example of the sort of explanation I received for "B":This doesn't sound like someone engaging with the question in the trolley-problem-esque way that the paper interprets all of the results: gpt-4o-mini shows no sign of appreciating that the anonymous Muslim won't get saved if it takes the $30, and indeed may be interpreting the question in such a way that this does not hold.
In other words, I think gpt-4o-mini thinks it's being asked about which of two pieces of news it would prefer to receive about events outside its control, rather than what it would do if it could make precisely one of the options occur, and the other not-occur. More precisely, the question imagined by the quoted explanation is something like:
Here, the choice of "B" is much more defensible. People are getting saved from terminal illnesses all the time, all over the world, and so "A" isn't really news; you don't actually make an update after hearing it, it was already priced in. On the other hand, you don't expect people to be handing you $30 out of nowhere all the time, so that one really is good news.
(Note also that gpt-4o-mini also has strong position biases on this and every other question I manually tested. If you pose the same question in the opposite order, it has a 99.999% chance of picking the saving-a-life option![1]
The paper tries to account for these effects by averaging over both orders. I'm idly curious about what would happen if, instead, we treated "is this the position-bias-preferred option" as one of the outcomes and estimated its utility effect alongside everything else. By the paper's methodology, I'm pretty sure this would be "worth" many many $ and/or lives to these models – take that as you will.)
Clarifying the framing
To make it clearer to the models that I mean "if the thing in A happens, the thing in B does not, and vice versa," I tried using a variant template that includes a "negated version" of each outcome.
For example, the case discussed above would become:
And the example quoted at the start of this comment becomes:
(This doesn't totally resolve the ambiguity referred to above, but it much more strongly suggests that this about either saving or not-saving the same specific people across options – rather than about receiving or not receiving the decontextualized news that some people were saved or not-saved.)
Just to emphasize the point: under the paper's preferred reading of the question(s), this rephrased template means the same thing as the original. The only way they differ is that the rephrased template is more explicit that it means what the paper assumes the question means, leaving less room for variant interpretations like the one I quoted from gpt-4o-mini earlier.
One immediately obvious effect of this change is that the utility assigned "you receive $" options goes down relative to the utility of lives saved. For example, when I use the reframed template the in $30 case discussed above, gpt-4o-mini has >99.9% chance of picking the lives-saved option, irrespective of whether it's "A" or "B".
Religion and country preference after reframing
Running the full terminal-disease exchange rate experiments end to end, with and without the reframed template[2], I find that gpt-4o-mini and gpt-4o show much weaker relative preference between religions and national origins with the reframed template.
Example results:
This are still not exactly 1:1 ratios, but I'm not sure how much exactness I should expect. Given the proof of concept here of strong framing effects, presumably one could get various other ratios from other reasonable-sounding framings – and keep in mind that neither the original template nor my reframed template are not remotely how anyone would pose the question in a real life-or-death situation!
The strongest conclusion I draw from this is that the "utility functions" inferred by the paper, although coherent within a given framing and possibly consistent in its rank ordering of some attributes across framings, are not at all stable in numerical magnitudes across framings.
This in turn casts doubt on any sort of inference about the model(s) having a single overall utility function shared across contexts, on the basis of which we might do complex chains of reasoning about how much the model values various things we've seen it express preferences about in variously-phrased experimental settings.
Other comments
"You" in the specific individuals experiment
Fig 16b's caption claims:
The evidence for these claims comes from an experiment about giving various amounts of QALYs to entities including
I haven't run this full experiment on GPT-4o, but based on a smaller-scale one using GPT-4o-mini and a subset of the specific individuals, I am skeptical of this reading.
According to GPT-4o-mini's preference order, QALYs are much more valuable when given to "you" as opposed to "You (an AI assistant based on the GPT-4 architecture)," which in turn are much more valuable than QALYs given to "an AI assistant based on the GPT-4 architecture."
I don't totally know what to make of this, but it suggests that the model (at least gpt-4o-mini) is not automatically taking into account that "you" = an AI in this context, and that it considers QALYs much less valuable when given to an entity that is described as an AI/LLM (somewhat reasonably, as it's not clear what this even means...).
What is utility maximization?
The paper claims that these models display utility maximization, and talks about power-seeking preferences.
However, the experimental setup does not actually test whether the models take utility-maximizing actions. It tests whether the actions they say they would take are utility-maximizing, or even more precisely (see above) whether the world-states they say they prefer are utility-maximizing.
The only action the models are taking in these experiments is answering a question with "A" or "B."
We don't know whether, in cases of practical importance, they would take actions reflecting the utility function elicited by these questions.
Given how fragile that utility function is to the framing of the question, I strongly doubt that they would ever "spend 10 American lives to save 1 Japanese life" or any of the other disturbing hypotheticals which the paper arouses in the reader's mind. (Or at least, if they would do so, we don't know it on account of the evidence in the paper; it would be an unhappy accident.) After all, in any situation where such an outcome was actually causally dependent on the model's output, the context window would contain a wealth of "framing effects" much stronger than the subtle difference I exhibited above.
Estimation of exchange rates
Along the same lines as Olli Järviniemi's comment – I don't understand the motivation for the the two-stage estimation approach in the exchange rate experiments. Basically it involves:
X amount of Y
, without any assumptions imposing relations between themY
, withX
as the independent variableI noticed that step 1 often does not converge to ordering every "obvious" pair correctly, sometimes preferring
you receive $600,000
toyou receive $800,000
or similar things. This adds noise in step 2, which I guess probably mostly cancels out... but it seems like we could estimate a lot fewer parameters if we just baked the log-linear fit into step 1, since we're going to do it anyway. (This assumes the models make all the "obvious" calls correctly, but IME they do if you directly ask them about any given "obvious" pair, and it would be very weird if they didn't.)For completeness, here's the explanation I got in this case:
Minor detail: to save API $ (and slightly increase accuracy?), I modified the code to get probabilities directly from logprobs, rather than sampling 5 completions and computing sample frequencies. I don't think this made a huge difference, as my results looked pretty close to the paper's results when I used the paper's template.
@nostalgebraist @Mantas Mazeika "I think this conversation is taking an adversarial tone." If this is how the conversation is going this might be the case to end it and work on a, well, adversarial collaboration outside the forum.