It seems like it would be really easy to come up with a lot of moral questions and answers and then ask an AI to tell us what it predicts humans preferring as an outcome.
There's a possibility that AI is not good at modeling human preferences, but if that's the case, it'll be very apparent at lower levels because that will mean commands will have to be very specific to get results. Any model that can't answer basic questions about it's intended goals is not going to be given the (metaphorical) nuclear codes.
In fact, why wouldn't you just test every AI by asking it to explain how it's going to solve your problem before it actually solves it?
How do I die?
This article (by Eliezer Yudkowsky) explains why the suggestion in your 2nd paragraph won't work: https://arbital.com/p/goodharts_curse/
I'm afraid I'll butcher the argument in trying to summarize, but essentially it is because even slight misalignments will get blown up (i.e. it will pursue the areas where it is misaligned at the expense of everything else) at increasing optimization pressure. So you might have something aligned fairly well, and at the test optimization level, you can check that it is indeed aligned pretty well - but then when you turn up the pressure, it will find weaker points in the specification and optimize for that instead. And this problem recurs at the meta-level, so there's not an obvious way to say "well obviously just don't do that" in a way that would actually work.
The problem with asking the AI how it will solve your problem is that if it is misaligned, it will just lie to you if that helps it complete its objective more effectively.
I think the question of you/Adele miscommunicating is mostly under-specification of what features you want your test-AGI to have.
If you throttle its ability to optimize for its goals, see EY and Adele's arguments.
If you don't throttle in this way, you run into goal-specification/constraint-specification issues, instrumental convergence concerns and everything that goes along with it.
I think most people here will strongly feel a (computationally) powerful AGI with any incentives is scary, and that any test-versions should require using at-most a mu... (read more)