I don't see how something like this could be a natural problem in my setting, it all seems to depend on how the goal definition (i.e. WBE research team evaluated by the outer AGI) think such issues through, e.g. make sure that they don't participate in any bargaining when they are not ready, at human level or using poorly-understood tools. Whatever problem you can notice now, they could also notice while being evaluated by the outer AGI, if evaluation of the goal definition does happen, so critical issues for the goal definition are things that can't be solved at all. It's more plausible to get decision theory in the goal-evaluating AGI wrong so that it can itself lose bargaining games and end up effectively abandoning goal evaluation, which is a clue in favor of it being important to understand bargaining/blackmail pre-AGI. PM/email me?
I agree that if the WBE team (who already know that a powerful AI exists and that they're in a simulation) can resist all blackmail, then the problem goes away.
I think I've found a new argument, which I'll call X, against Paul Christiano's "indirect normativity" approach to FAI goals. I just discussed X with Paul, who agreed that it's serious.
This post won't describe X in detail because it's based on basilisks, which are a forbidden topic on LW, and I respect Eliezer's requests despite sometimes disagreeing with them. If you understand Paul's idea and understand basilisks, figuring out X should take you about five minutes (there's only one obvious way to combine the two ideas), so you might as well do it now. If you decide to discuss X here, please try to follow the spirit of LW policy.
In conclusion, I'd like to ask Eliezer to rethink his position on secrecy. If more LWers understood basilisks, somebody might have come up with X earlier.