It might also fail for more mundane reasons, such as imitating agentic behavior in the training distribution.
I think what you're saying here is that you'd see it behaving non-myopically because it's simulating a consequentialist agent, correct? This difficulty seems to me like a pretty central blocker to doing these kinds of tests on LLMs. It's not clear to me that at this point we have any way of distinguishing 'model behavior' from simulation (one possibility might be looking for something like an attractor, where a surprisingly wide range of prompts result in a particular behavior).
(Of course I realize that there are plenty of cases where simulated behavior has real-world consequences. But for testing whether a model 'is' non-myopic, this seems like an important problem)
You could try to do tests on data that is far enough from the training distribution that it won't generalize in a simple immitative way there, and you could do tests to try and confirm that you are far enough off distribution. For instance, perhaps using a carefully chosen invented language would work.
In “Hidden Incentives for Auto-Induced Distributional Shift”, we introduce several “unit tests” meant to determine whether a system will pursue instrumental incentives to influence its environment. A system that does so would not be “consequence-blind” (using terminology from https://www.lesswrong.com/posts/c68SJsBpiAxkPwRHj/how-llms-are-and-are-not-myopic), and might seek power or to deceive or manipulate humans it interacts with.
The main unit test basically consists of a simple environment where a system plays the prisoner’s dilemma against its future and past self, i.e. “cooperating” now reduces present reward, but increases reward at the next time step.
The original purpose of this unit test was to understand which training set-ups or learning algorithms would produce systems that were consequence-blind. For instance, one surprising result shows that systems trained with myopic (gamma=0) Q-learning of a particular form may fail to be consequence-blind. Supervised learning with a meta-learning outer-loop can as well.
The self-supervised pre-training of LLM seems like it should produce a consequence-blind model, as argued in the linked LW post. However, there are speculative concerns that it might fail to, e.g. because of mesa-optimization or functional decision theory. It might also fail for more mundane reasons, such as imitating agentic behavior in the training distribution. Either type of failure might be informative, but I would find it particularly concerning if failures do not seem “explained away” by imitation.
The main issue I see with applying this unit test to pretrained LLMs is answering the question of “How is the goal communicated to the system?” The unit test is basically designed to evaluate learning algorithms, not models. I see two possibilities: