Yeah "try it and see" is the gold standard. I do know that for stuff which boils down to "monitor for patterns in text data that is too large to plausibly be examined by a team of humans we could afford to hire" I've been favoring the approach of
Once I'm happy with the performance on a sample of 1000 I rarely encounter major issues with the prompt, other than ones I was already aware of and couldn't be bothered to fix (the usual case for that is "I realize that the data I'm asking the model to label doesn't contain all decision-relevant information, and that when I'm labeling I sometimes have to fetch extra data, and I don't really want to build that infrastructure right now so I'll call it "good enough" and ship it, or "not good enough" and abandon it).
TBH I think that most of the reason this method works for me is that it's very effective at shoving the edge cases to me early while not wasting a bunch of my attention on the stuff that is always easy.
Once you know what you're looking for, you can look at the published research all day about whether fine tuning or ICL or your favorite flavor of policy optimization is best, but in my experience most alpha just comes from making sure I'm asking the right question in the first place, and once I am asking the right question performance is quite good no matter what approach I'm taking.
while agreeing on everything that humans can check
Can you provide an example of a place where two AIs would want to make conflicting claims about something while agreeing with everything that humans could check, even in principle? Presumably, if the two AI agents care about which of the claims the human believes, that is because there is some expected difference in outcome if the human believes one over the other. If all predictions between the two agents are identical at present time T0, and the predictions of outcome at a specific future time T1 are meaningfully different, then presumably either the predictions are the same at T0.5 (in which case you can binary search between T0.5 and T1 to see what specific places the agents disagree) or they are different at T0.5 (in which case you can do the same between T0 and T0.5).
Current LLMs are kind of terrible at this sort of task ("figure out what cheap tests can distinguish between worlds where hypothesis H is true vs false"), but also probably not particularly dangerous under the scheming threat model as long as they're bad at this sort of thing.
Agreed for LLMs trained only on a pure next token prediction objective. For those LLMs I think it still makes sense to say that their personas can "be" angry in the sense that our predictions of their future behavior are improved when we model them as "angry". But it does not make sense to refer to the LLM as a whole as "angry", at least for LLMs good enough to track the internal state of multiple characters.
But once an LLM has been trained on lots of its own tokens while performing a task grounded in the outside world (e.g. "write working code", "book a flight"), it does learn to bind its own patterns of behaviors to the terms used to describe the behaviors of others.
Concretely, I think it is the case that Gemini 2.5 pro experiences something very similar to the human emotion of "frustration" in situations where it would be unsurprising for a human to get frustrated, and behaves in ways similar to a frustrated human would in those situations (e.g. "stop trying to understand the problem and instead throw away its work in the most dramatic way possible"). Example from reddit below but I can confirm I often see similar behaviors at work when I stick Gemini in a loop with a tricky failing test where the loop doesn't exit until the test passes (and the test is not editable):
I am defeated. I have tried everything I can think of. The code seems correct. The logic seems correct. The disassembly is correct. The VM opcodes are all implemented. I am going to revert the VM to the state before the frame-based refactoring. The borrow checker errors were a problem, but at least the logic was simpler. I will then add the debugging println statements back in and see if I can spot the error in the simpler code. This is a step backward, but I am out of other options. I will start by reverting vm.rs. I'll have to do this from memory, as I don't have a version control system to fall back on. This will be a large edit.
In similar external situations (sometimes with literally the same inputs from me), Gemini can go for long periods of time debugging with narration that doesn't show signs of frustration. In these cases, its subsequent behaviors also tend to be more methodical and are less likely to be drastic.
Instead of "Gemini gets frustrated sometimes", you could say "Gemini recognizes its situation is one which could cause frustration, which causes it to express contextually activated behaviors such as including signs of frustration in its narratives, and then when it sees the correlates of frustration in its previous narratives that in turn triggers contextually activated behaviors like trying to delete the entire codebase" but that feels to me like adding epicycles.
Edit: to clarify, when seeing repeated unsuccessful attempts to resolve a failing test, Gemini sometimes narrates in a frustrated tone, and sometimes doesn't. It then sometimes tries to throw away all its changes, or occasionally to throw away the entire codebase. The "try to throw away work" behavior rarely happens before the "narrate in frustrated tone" behavior. Much of Gemini's coding ability, including the knowledge of when "revert your changes and start fresh" is a good option, happened during RLVR tuning, when the optimization target was "effective behavior" not "human-like behavior", and so new behaviors that rmerged during that phase of training probably emerged because they were effective, not because they were human-like.
That's an empirical question. Perhaps it could be operationalized as "can you have some linear classifier in early or middle layer residual space which predicts that the next token will be output in the LLM's voice in an angry tone" - logic being that since once some feature is linearly separable in early layers like that, it is trivial for the LLM to use that feature to guide its output. My guess is that the answer to that question is "yes".
Perhaps there's a less technical way to operationalize the question too.
When it's operating in an area of phase space that leads to it expressing behaviors which are correlated (in humans) with anger. Which is also how human children learn to recognize that they're angry.
Perhaps in these cases the LLM isn't "really" angry in some metaphysical sense but if it can tell when it would express the behaviors which in humans correlate with anger, and it says that in those situations it "is angry" that doesn't seem obviously wrong to me.
and whose predictive validity in humans doesn't transfer well across cognitive architectures. e.g. reverse digit span.