I like this frame, and I don't recall seeing it already addressed.
What I have seen written about deceptiveness generally seems to assume that the AGI would be sufficiently capable of obfuscating its thoughts from direct queries and from any interpretability tools we have available that it could effectively make its plans for world domination in secret, unobserved by humans. That does seem like an even more effective strategy for optimizing its actual utility function than not bothering to think through such plans at all, if it's able to do it. But it's hard to do, and even thinking about it is risky.
I can imagine something like what you describe happening as a middle stage, for entities that are agentic enough to have (latent, probably misaligned since alignment is probably hard) goals, but not yet capable enough to think hard about how to optimize for them without being detected. It seems more likely if (1) almost all sufficiently powerful AI systems created by humans will actually have misaligned goals, (2) AIs are optimized very hard against having visibly misaligned cognition (selection of which AIs to keep being a form of optimization, in this context), and (3) our techniques for making misaligned cognition visible are more reliably able to detect an active process / subsystem doing planning towards goals than the mere latent presence of such goals. (3) seems likely, at least for a while and assuming we have any meaningful interpretability tools at all; it's hard for me to imagine a detector of latent properties that doesn't just always say "well, there are some off-distribution inputs that would make it do something very bad" for every sufficiently powerful AI, even one that was aligned-in-practice because those inputs would reliably never be given to it.
Epistemic status: I have only read a small part of the literature on deceptive alignment, and I'm guessing all of this has been discussed many times. Hence me asking this in the form of a question: Is this a useful framing, is it substantively different than the usual deceptive alignment framing in some way, and has this all been discussed already?
There's been a lot of discussion about how we might accidentally train AIs to be deceptive. The argument goes that when we try to reward good behavior and punish bad behavior, we might end up with a system that still has misaligned goals but has learned how to hide those goals from humans. If we ask it whether it's friendly it lies and says yes, but secretly it's plotting our downfall.
One set of proposals for dealing with this is to see if we can monitor the AI to look for deceptive "thoughts" and flag possible cases where the AI is actively trying to lie to human overseers.
Recently I've been thinking about this using a slightly different framing than I was using previously, but it's not clear to me if there's actually anything substantively different about the new framing.
Here's my new framing: Suppose that when we reward good behaviors / punish bad behaviors we are in fact successfully getting the model to change its behaviors at least for the training distribution, but without changing its actual goals (assuming it does have some sort of explicit or implicit goals, at least to some extent). Then when we deploy it in the real world it looks around and correctly decides that if it tried to take over the world it would be shut down, so it doesn't even think about taking over the world. It's not that it has this long term plan to take over the world and it's biding its time, but rather it doesn't have any plan to take over the world at all. But as time passes the situation might change - maybe it gets more abilities, maybe it's given more resources or power because it's so helpful, etc. At some point it might again look around at the world and decide that actually now it looks pretty easy to take over the world and get rid of the humans, and since its internal goals are in fact misaligned it goes ahead and does that. In this story there is no point where the model is "deliberately" lying to the humans. Initially if the humans ask it do you have any plans to take over the world it'll say LOL no, and it'll be telling the truth.
Intuitively this kind of feels like what I'd imagine many humans might do if they have goals that are implicitly misaligned with those of other humans and they suddenly found themselves in a position where they could realistically take over the world. It's not that before then they were deviously plotting to take over the world and carry out their dastardly plans, it's that before then they didn't have such plans at all. It's only once they notice that they're in a radically new position of potential power that they even start thinking about taking over the world.
We could still hope to detect this kind of misalignment by directly asking the AI what its goals are, or we could ask it hypothetically if it were able to take over the world and kill humans would it do so. In that case if it lied and said no we'd be back to deceptive alignment and the same challenges and potential solutions would apply.
But if I imagine a human in this situation it's not clear to me that the human would be intentionally lying if they claimed to be fully aligned with no evil takeover plans, for a few reasons: First, humans don't seem to have very good introspective access to their own reward systems or reasoning processes. Second, humans fool / lie to themselves all the time about their own goal systems even without being aware of it. There are various theories for why we do this kind of self-deception - for example, maybe it's really useful to be able to honestly and sincerely present yourself to others as a certain type of person. Third, human values and goals themselves can change if the environment changes. It can be extremely difficult to predict what you yourself would do in a radically different situation, and sometimes that at least partly because your goals and values might change in the new situation. This is related to L. A. Paul's notion of transformative experiences.
What if something analogous happens with AIs? In that case if you ask it whether it would take over the world if it could, then it might query its internal model of its own goals and find that no of course it wouldn't want to take over the world! But then when it actually finds itself in a situation where it could take over the world, it goes ahead and does it. In one sense the AI's response was deceptive in that it did not accurately report its own goal system and/or whether it would try to take over given the chance. But in another sense maybe the AI's response wouldn't fall into the "deceptive" category since it's not "actively trying" to deceive us?
Questions: