It's looking like the direction things are going is that most AIs are going to have large language models, if not as their entire architecture at least as something that's present in their architecture somewhere. This means that an AI will have access to an enormous amount of data about what the real world should look like, in a form that's infeasible to restrict or ablate. So it will be difficult or impossible to make a simulated environment which can fool it.
But it might be possible to do with alignment tools that are also AIs or other purposes/prompts that condition behavior/identity of the same model, as a sort of GAN. Humans can't inspect models directly, so it's always interpretability and oversight that's AI-driven in some way, even if it's grounded in humans eventually. So it might be the case that at a superhuman level, an aligned AGI will have other superhuman AGIs (or prompts/bureaucracies/identities within the same AGI) that design convincing honeypots to reveal any deceptive alignment regimes within it. It doesn't seem impossible that an apparently (even to itself) aligned AI has misalignment somewhere in its crash space, outside the goodhart boundary (where it's no longer known to be sane), and exploring the crash space certainly warrants containment measures.
I disagree with jimrandomh that it would be intractable to make a simulated environment which could provide useful information about a LLM-based AI's alignment. I am a proponent of the 'evaluate the AI's alignment in a simulation' proposal. I believe that with sufficient care and running the AI in small sequences of steps, a simulation, especially a purely text simulation, could be sufficiently accurate. Also, I believe it is possible to 'handicap' a model that is too clever to be otherwise fooled by directly altering its activation states during inference (e.g. with noise masks).
I think that in this situation, allowing the AI being evaluated to directly communicate with humans or make any lasting record of its activities would be very dangerous. Instead, abstract low-bandwidth evaluation tools should be used to analyze the AI's performance.
In reading about proposals for AI Boxing, I notice two excerpts jump out at me. First, is a proposed scenario to uncover an unfriendly AI:
And the second excerpt of interest is a technique by which an AI can trick its owners into letting it escape:
The first scenario, attempting at containment, would fail if the AI is smart enough to detect that it is in a simulation, and pretend to be friendly until it is released. The second scenario, where the AI attempts to escape, would succeed if it was smart enough to create a simulation that is completely undetectable for a human being.
However, I have a hard time imagining that both of these scenarios could be possible at the same time. To paraphrase a recent video game, the human mind is "well-equipped to deal with reality". As with the case with the Uncanny Valley, humans are hard-wired to immediately detect when something is very close to imitating life, but is slightly off, due to a survival instinct to be cautious of creatures that might be mentally or physically ill. This is a similar reason why people experience nausea or dizziness after long exposure to virtual reality, because the brain is still able to detect that something is not quite right.
Thus, a scenario where a simulation is so perfect that it is undetectable by humans for a long period of time (especially if said humans are actively trying to detect it) seems extremely improbable, because even if the simulation has a tiniest flaw it would start to trigger these subconscious reactions. I don't dismiss it is impossible, but it relies on a certain assumption of technology existing that we aren't currently aware of.
Contrariwise, it should be extremely easy to put a computer in a simulation without detection. A machine has no connection to the outside world except for hardware components like input and sensors, which are treated the same as the input it receives from a separate module. If these input signals were rerouted to a simulation, then I don't know of any scenario that a machine could be able to tell the difference, but I can assume a sufficiently-intelligent AI might be able to detect it somehow.
With these two scenarios being so improbable, it would be exponentially more improbable for them both to be true at the same time. If a simulation is powerful enough to be undetectable for humans, then it should be at least powerful enough to be undetectable for an AI. Contrariwise, if the best simulation is poor enough that an AI can detect it, then it should be at least poor enough for humans to detect it, as well.
Thus, if the above analysis is valid, then a boxed AI could be virtually impenetrable if the two techniques were somehow combined: place the AI inside a simulated environment, and then place that environment in box disconnected from other systems. If the AI detects the simulation and escapes, then it is still contained, and humans can live in the assurance that it can't create a simulation that we couldn't detect.