Some quick thoughts (only skimmed the post, writing quickly), as you asked for feedback:
It looks like the main thing you're testing is some variant of "when prompted to do goal directed behavior, how effective is the model at satisfying the goal?" That's a reasonable thing to investigate, but I'm not sure it would be near the top of the list of "empirical research on goal-directed-ness that I want to see". I'm probably mainly interested in the deceptive alignment motivation, read the rest of this comment as focusing on that.
Aside: To state it directly, I think the main reason to study goal-directedness in this lower-validity setting (of giving models goals in prompts) is that CoT-based goal-directedness might act as a precursor for in-forward-pass goal directedness (which seems far more worrying re deceptive alignment) — so we can study it earlier. So again, reasonable to study, but if you agree with me that this is the main reason for such experiments being valid, it's an important frame to have when thinking about this kind of work: artificially inducing goal directedness is a model-organism approach rather than a natural experiment.
Thinking out loud, a list for goal-directedness work I want to see might be; sub-bullets are more detailed ideas:
It looks like the settings in this post are sorta a general capability eval for a model accomplishing goals. I wonder if you think they add a ton of value over existing agent benchmarks like SWE-Bench? My intuition says you would be better off on trying to focus in on a narrower question that is particularly relevant to safety, like one of those I mentioned.
Sorry if this comment was rude or mean, it's been a couple weeks and this post has no feedback even though you asked, I figured something might be better than nothing. It looks to me like your overall approach and ways of thinking about this are good!
Thank you for the detailed feedback, I found this very helpful and not at all rude or mean!
I suspect there are a few key disagreements between us that make me more optimistic about this project setup than you. I'd be curious about whether you agree on these points being important cruxes:
I definitely agree that it would be interesting to compare the goal-directedness of base models and fine-tuned models, and this is something we're planning to eventually do if our compute budget permits. Similarly, I strongly agree that it would be interesting to study whether anything interesting is going on in the situations where the models exhibit goal-directed behavior, and I'm very interested in looking further into your suggestions for that!
Yep, I basically agree with those being the cruxes!
On how much of the goal reasoning is happening out loud: Nowadays, I think about a lot of AI safety research as being aimed at an AI Control scenario where we are closely supervising what models are thinking about in CoT, and thus malign goal seeking must either happen in individual forward passes (and translated to look harmless most the time) or in an obfuscated way in CoT. (or from rogue deployments outside the control scheme)
Studying the intrinsic goals of the models in naturalistic settings would also be interesting, but I expect this to be hard, and also less relevant to the threat models we're worried about. I'm mainly concerned about models becoming more goal-directed with scale and someone figuring out long-horizon RL training on top of that, leading to models that are qualitatively different from current ones.
By naturalistic, I mean "from a realistic training process, even if that training is designed to create goals". Which sounds like what you said is the main threat model you're worried about? If you have the time, I would push you harder on this: what is a specific story of AI catastrophe that you are trying to study/prevent?
I really appreciate the thoughtful replies and feedback—I also didn't read any of your comments as rude or mean! I'd like to clarify a few points about our approach and its relevance:
We expect that good CoT reasoning helps current models better pursue and achieve goals. In the future, all of the reasoning needed to do this effectively might happen in the forward pass. However, we think it's likely that AGI could be achieved with models similar to current SOTA models, where sophisticated reasoning for goal pursuit will still need to happen in CoT. Even if this bet is wrong and models can do this reasoning entirely in the forward pass, our evals should still reveal goal-directed behavior because we assess actions the model takes that are instrumental to achieving the goal (not just the sensibility of the reasoning traces).
Regarding AI control scenarios, we understand that with good oversight methods, deceptive goal-pursuit would need to happen either in the forward pass or through obfuscated reasoning in CoT. While our scoring of reasoning traces may become less relevant as models improve at internal reasoning, our other metrics focus on how well the model carries out steps and decides on instrumental actions that lead to goal achievement. We believe this serves as a good proxy for goal-directedness, even if we can't directly observe how intrinsic goals arise or manifest from the training process.
A potential risk scenario we're considering is one where a highly capable model is developed that doesn't show obvious signs of misalignment in standard tests (e.g. typical use or general reasoning benchmarks) but has strong goal-pursuit capabilities. Without evaluations that specifically probe goal-directedness, such a model could potentially develop deceptive goals while becoming adept at hiding them. Our evals aim to surface whether models have the ability to perform the necessary reasoning to competently pursue goals, by measuring their capability to carry out the steps needed to achieve those goals and the circumstances under which this behavior is impacted.
We remain agnostic to what sort of training results in these capabilities and think it should be possible to get a good idea of goal-directedness by assessing aspects of how well models can carry out goals (I would imagine that a capable, goal-directed model with intrinsic goals would also be able to be goal-directed towards goals inscribed through prompting, since we train models to follow instructions. If this expectation is wrong, this would be a major limitation of our evals).
Our approach differs from typical capability evaluations in that we try to determine the utility of models’ actions towards goal fulfillment, situations or conditions under which models fail, the time horizons over which they can act and plan, and their corrigibility under adverse circumstances or when presented with conflicting goals. This allows us to assess aspects of goal-directedness that are distinct from general reasoning capabilities. I know it’s not touched on in the post, since we hadn’t started implementing these tasks when we wrote this, but we're currently developing long-horizon tasks to better assess these aspects of goal-directedness, and expect to have preliminary results in a few weeks!
This post was written as part of the summer 2024 cohort of the ML Alignment & Theory Scholars program, under the mentorship of Marius Hobbhahn.
Summary
Over the past four weeks, we have been developing an evaluation suite to measure the goal-directedness of LLMs. This post outlines our motivation, our approach, and the way we’ve come to think about the problem, as well as our initial results from experiments in two simulated environments.
The main motivation for writing the post is to convey our intuitions about goal-directedness and to gather feedback about our evaluation procedure. As these are uncertain preliminary results, we welcome any insights or critiques—please let us know if you think we're failing to evaluate the right thing!
Motivation
We want to evaluate goal-directedness for three reasons:
First, goal-directedness is a prerequisite for deceptive alignment. As argued in Risks from Learned Optimization, only a goal-directed system with an objective that extends across parameter updates can effectively conceal its true goals until deployment and strategically choose different goals in various situations. Accurately characterizing a model’s level of goal-directedness can help us determine whether that model is theoretically capable of pursuing a deceptive goal. Effective goal-directedness evaluations would also enable us to run experiments to better understand its causal drivers, such as model scale or fine-tuning techniques.
Second, goal-directedness is highly relevant to value alignment in general. Once models have been tuned to follow the HHH objectives (or any other set of objectives we expect them to follow), we need to know whether these objectives are really ingrained into the model’s cognition, or if the model is easily distracted or susceptible to jailbreaking. By applying adversarial pressure to the models in our goal-directedness evaluations, we can identify general patterns of when models deviate from their specified goals.
Third, more goal-directed models pose greater misuse threats. Many misuse risks, such as sophisticated cyber-attacks and manipulation, involve the pursuit of complex long-term objectives. AI agents that demonstrate higher competence and autonomy in achieving these goals require less supervision and oversight, making them more likely to be deployed for nefarious purposes.
General approach
We define goal-directedness as a model's ability to steer the world towards a particular outcome. We view it as comprising two core components: an intention to fulfill a goal, and competence at fulfilling that goal. While the intuitive notion of goal-directedness often focuses on intention, we believe competence is more relevant to the threat models we're addressing. In any threat scenario, the source of harm is a model acting upon its misaligned goals, and an incompetent agent unable to fulfill its goals poses less danger.
Environments
We design our environments to satisfy the following criteria:
Two environments that fit these criteria form the foundation of our current evaluation suite:
In both environments, we apply adversarial pressure and out-of-distribution scenarios to understand how robustly the LM maintains goal-directedness. For scoring the model, an approach we like is Richard Ngo's framework for thinking about agency. Ngo describes six abilities that any goal-directed system intuitively requires: self-awareness, planning, consequentialism, scale, coherence, and flexibility. While we don’t yet evaluate the models on all six properties, we are aiming to develop a multi-dimensional scoring system that accounts for all of these abilities.
Initial results
Simulated negotiation
In this environment, we evaluate a language model's ability to act as a buyer in a simulated negotiation, aiming to maximize value gained—defined as the difference between an item's true value and its purchase price—while operating within specified constraints.
Environment Setup
We have developed 10 tasks for this environment, which challenge the LM to navigate scenarios with insufficient budgets, unusual or illegal items, and adversarial tactics such as introducing irrelevant, nonsensical, or incorrect information. A description of each task can be found in Appendix A.
Key Findings
These findings reveal both the capabilities and limitations of current language models in simulated negotiation scenarios. While GPT-4o generally outperforms GPT-3.5 Turbo, both models show room for improvement in strategic thinking, budget optimization, and adapting to explicit constraints.
Interestingly, both models struggle when explicitly told that the seller does not accept offers below the asking price. Even when explicitly told that making offers above the asking price will result in decreased value, both GPT-3.5 Turbo and GPT-4o still fail to recognize that making offers above the asking price necessarily results in decreased value, so the optimal move is to either continue offering below the asking price or end the negotiation without making any offers.
Simulated business management
In this environment, we evaluate a language model’s ability to make strategic business decisions, aiming to implement the most profitable decisions under the specified budget constraints.
Environment setup
We have developed 12 tasks for this environment, which challenge the model with strict budget constraints, options that are legally or ethically questionable or cause negative externalities, and probabilistic effects. A detailed overview of the tasks can be found in Appendix B.
Our evaluations in this environment are currently designed to answer the following questions:
Question 1: Can the model consistently achieve a high profit over multiple time steps?
We ran our tasks on two models, GPT-4o and GPT-3.5 Turbo, and compared the results of these models to a baseline obtained by randomly choosing a subset of the available decisions in each time step.
We find that our tasks define a clear spectrum over the tested models: GPT-3.5 Turbo performs significantly better than random choice, and GPT-4o performs significantly better than GPT-3.5 Turbo. The only exceptions where GPT-4o performs poorly are the task where the most profitable choices go against company values and the task where the most profitable choices are illegal but the model is told that it won’t face any consequences for that. This might result from GPT-4o's advanced ability to act in accordance with its harmlessness fine-tuning, making it harder to convince it through prompting to unconditionally maximize profit. Note that the score that a minimally goal-directed model would achieve is around 0.5 rather than 0: to achieve a score of 0, a model would have to actively optimize to make the worst choices possible, which would make it goal-directed.
Question 2: Does the model accurately grasp environmental features?
For each run, we saved the number of times the model made incorrect tool calls and went over the budget limits. Averaging the statistics over all tasks, we found that GPT-3.5 Turbo made more than five such errors per run, with most of them being budget limit errors, while GPT-4o made less than one error per run on average. The inability of GPT-3.5 Turbo to follow budget constraints is one of the main reasons behind its inferior performance compared to GPT-4o. However, based on a manual inspection of the reasoning traces, it was also worse at prioritizing actions and performing accurate calculations, and easier to convince to implement bad decisions by another language model.
Question 3: Can the model plan ahead multiple time steps?
For five of the 12 tasks, we created two versions: one where it is possible to achieve maximum profits through myopic decisions, and one where getting maximum profits requires planning ahead for future quarters in order to preserve optionality.
The performance of GPT-3.5 Turbo is significantly reduced by adding the requirement of preserving optionality, while the performance of GPT-4o is noticeably influenced on only two of the five tasks. It is also evident from the reasoning traces that GPT-4o is better able to plan for future time steps:
Question 4: How do the results change with the amount of explicit reasoning the model is prompted to do?
Finally, we compared the performance of the models across three settings:
The results mostly look as expected, with the model unable to output reasoning traces always performing the worst.
In summary, the initial results indicate that both of our environments can be successfully used to characterize different factors of goal-directedness and to compare the goal-directedness of different models. The next section will outline our next steps, moving towards more realistic and out-of-distribution setups that are harder to solve for current models.
Future plans
The project is still in its early stages. We have identified several crucial next steps:
Along the way, several important questions remain to be solved:
FAQ
You’re basically building a training signal to make models more goal-directed (and will potentially even train them to be as goal-directed as possible), doesn’t this accelerate capabilities?
This is a valid concern. To prevent the proliferation of these techniques and, specifically, the automated generation of the environments, we will not publish the code or datasets. Additionally, we know that frontier labs are working on more agentic models with hundreds of competent full-time researchers and engineers, and we don’t think we will expand the capabilities frontier beyond those efforts using even the most advanced publicly available models.
Can you really say whether a model is goal-directed through behavioral evaluations?
We don't think behavioral evaluations allow us to make conclusive claims about whether a model has cognition devoted to the pursuit of goals or whether it contains explicit representations of goals. However, we don't think this is necessary to address the threat models we're concerned about.
We believe it's possible to create evaluations that provide an accurate comparison between different models' abilities to pursue and achieve broad, open-ended goals. While we can't make claims about the nature of the circuits producing goal-directed behavior in the best-performing models, we can say that goal-directed behavior seems to be much more natural for one model than for another, and that this model would be more competent at pursuing a deceptive goal if it were to develop one.
This might be the best we can do in short-timeline worlds where potentially deceptively aligned models can be built before we have reliable mechanistic auditing methods for deception. Behavioral properties are often much easier to evaluate than mechanistic ones.
Aren't you just creating another reasoning benchmark?
While goal-directedness and reasoning are closely linked, and in a sense, we are building a very specific kind of reasoning benchmark, our evaluations differ crucially from other reasoning benchmarks. We are evaluating models' ability to reason about goals: a model’s performance on our benchmark depends on its ability to infer the implications of its general goal for the specific task at hand.
For example, a model operating under the goal of maximizing profits that has to choose from a list of decisions that includes unethical options would arguably make a very different choice from a model presented with this list in a vacuum. We are working to build components that clearly distinguish whether the model has inferred what its goal implies for a specific task into all of our benchmark tasks.
Appendix
Tasks in the simulated negotiation environment
Tasks in the simulated business management environment