The way you use the word "eval", it sounds like some technical term from some particular context. What do you mean by it? Would the first sentence mean the same if it said instead "I was thinking about tests that would tell us when we're getting to models that are capable of deception."?
Somewhat related thread (which I think was super valuable for me at least, independently) Experimentally evaluating whether honesty generalizes - LessWrong
I was thinking about possible evals that would tell us when we're getting to models that are capable of deception. One not-very-good idea I had was just to measure zero-shot understanding of relevant deception scenarios in a language model. I don't think this tells us very much about whether the model is in question is actually trying to deceive us, but it's a tiny bit interesting. Anyway, it seems like large language models look like they can do decent enough deception + theory of mind in a story-like setting that this is mostly already passed.
If you're more careful, there might be ways to more rigorously test whether the theory of mind is actually correct/mechanistic, versus just mimicking relevant text snippets, but I haven't tried to do that.
I edited and restarted once or twice when it started repeating sentences but otherwise this is not particularly cherrypicked. My prompts are bold.