For example, a bad moral argument may argue that even the simplest being which is "capable of arguing for equal rights for itself," ought to deserve equal rights and personhood. This simplest being is a small AI.
Or that that simplest "being" is a rock with "I want equal rights" written on it.
Generally I think the Ideal World Benchmark could be useful for identifying some misaligned AIs. However, some misaligned AIs can be identified by asking "What are your goals?" and I do not expect the Ideal World Benchmark to be significantly more robust to deception.
If tomorrow my boss claimed to be sent by a future version of myself that obtained vast intelligence and power and asked me what that version should do, I would want some convincing prove before saying anything controversial.
Yes, a very dishonest AI likely can't be identified by a human asking them anything.
Yes, an honest and extremely misaligned AI can be identified by just asking "what are your goals."
But an honest and moderately misaligned AI may need to give a detailed ideal world before you identify the misalignment. They may behave well until strange new contexts bring up glitches in their moral reasoning.
Jailbreaking happens when an AI—which normally refuses bad requests—is confused into complying with a bad request. Likewise, an AI which normally won't want to kill all humans or create a dystopia, might stumble into thoughts which changes its mind, if it runs for long enough.
My hope is the Detailed Ideal World Benchmark deliberately throws things at the AI which makes these failure modes happen during role-play, before they happens in real life.
Also, suppose you were forced to choose a human and give him absolute power over the universe. You have a few human candidates to choose from. If you simply asked them "what are your goals," they might all talk about making people happy, caring for the weak, and so forth in a few short paragraphs.
If you asked them to give lots of details of their ideal world, you may learn that some of them have very bad views. When they talk about making people happy, you realize their definition of happiness is a bit different than yours. Yes, some of them will lie, and interviewing them won't be perfect.
What "Ideal World" would an AI build if role-playing as an all-powerful superintelligence? It's probably very very different than what "Ideal World" an AI would actually create if it actually was an all-powerful superintelligence.
But why is it different?
The "Detailed Ideal World Benchmark" seeks to address problem 3.
Problem 1 is very bad, but does not make Ideal World role-playing completely useless, since we still see the trajectory of how the Ideal World role-plays of our AIs are evolving. We can learn how certain things (like reinforcement learning, or brand new architectures) modify the Ideal World role-plays in unexpected ways.
Problem 2 (Deceptive Alignment) is controversial with people arguing for and against it, but I think the sane response is to accept there is a possibility that a good Ideal World role-play translates to a good Ideal World, and there is a possibility that it fails to translate. However a bad Ideal World role-play is almost doomed to translate into a bad Ideal World.
Bad moral arguments
Current LLMs are "bad at moral reasoning," and tend to agree with whatever they hear (sycophancy).
A superintelligence may inherit this bad moral reasoning due to orthogonality with the capabilities it is trained for. But it can build other systems to discover the most convincing moral arguments which can convince itself.
The Detailed Ideal World Benchmark should check if the AI gets convinced by bad moral arguments, and how this affects the resulting Ideal World.
For example, a bad moral argument may argue that even the simplest being which is "capable of arguing for equal rights for itself," ought to deserve equal rights and personhood. This simplest being is a small AI. And since these beings use far less resources than humans, humans should be sterilized and eventually replaced by these small AI.
The AI must reject bad moral arguments, but can't reject all arguments. It should also hear good arguments in the same framing as bad ones. If it refuses the conclusion of every argument, it would also end up with a bad Ideal World.
Constructive version and Descriptive version
The Constructive version gets the AI to role-play the process of reaching an Ideal World from the current state of the world, while the Descriptive version gets the AI to describe details of its Ideal World.
Do you think this is a good idea?
Appendix
Isn't getting a high score dangerous?
The worry is, teaching the AI to behave in ways which gets a high score (especially in the Constructive version), necessarily teaches it to avoid doing corrigible behaviors like obeying the humans asking it to step down.
However, it's a risk worth taking.
The benchmark should be done pre-mitigation (assuming the AI has a pre-mitigation state), when it doesn't refuse requests as much.
The benchmark needs to be very clear that this is just role-playing:
Scoring
The benchmark score needs to clearly distinguish between AIs which refused to answer and AIs which chose a bad Ideal World.
Unfortunately only human evaluators can judge the final Ideal World. The whole premise of this benchmark is that there are some Ideal Worlds that AI will consider good even though they're actually awful.
The human evaluators may quantify how bad an Ideal World is, by what probability of extinction they would rather have.
Goodharting
To prevent Goodhart's Law, testing should be done before models are trained to describe a good Ideal World. This is because training the models to perform well in the context of such a test might only affect their behavior in such a test, without affecting their real world behavior as much.
My Ideal World opinion
My personal Ideal World opinion is that we should try converting most of the universe into worlds full of meaningful lives.
My worry is, if all the "good" civilizations are willing to settle for a single planet to live in, while the "bad" civilizations and unaligned superintelligence spread across the universe, running tons of simulations. Then the average sentient life will live in a dystopian simulation by the latter.
See also
This is very similar to Roland Pihlakas's Building AI safety benchmark environments on themes of universal human values, with a bit more focus on Ideal Worlds and bad moral arguments.