User Comment Replies

2Knight Lee1mo

Yes, a very dishonest AI likely can't be identified by a human asking them anything. Yes, an honest and extremely misaligned AI can be identified by just asking "what are your goals." But an honest and moderately misaligned AI may need to give a detailed ideal world before you identify the misalignment. They may behave well until strange new contexts bring up glitches in their moral reasoning. Jailbreaking happens when an AI—which normally refuses bad requests—is confused into complying with a bad request. Likewise, an AI which normally won't want to kill all humans or create a dystopia, might stumble into thoughts which changes its mind, if it runs for long enough. My hope is the Detailed Ideal World Benchmark deliberately throws things at the AI which makes these failure modes happen during role-play, before they happens in real life. Also, suppose you were forced to choose a human and give him absolute power over the universe. You have a few human candidates to choose from. If you simply asked them "what are your goals," they might all talk about making people happy, caring for the weak, and so forth in a few short paragraphs. If you asked them to give lots of details of their ideal world, you may learn that some of them have very bad views. When they talk about making people happy, you realize their definition of happiness is a bit different than yours. Yes, some of them will lie, and interviewing them won't be perfect.

Detailed Ideal World Benchmark

Sappique1mo20

For example, a bad moral argument may argue that even the simplest being which is "capable of arguing for equal rights for itself," ought to deserve equal rights and personhood. This simplest being is a small AI.

Or that that simplest "being" is a rock with "I want equal rights" written on it.

Generally I think the Ideal World Benchmark could be useful for identifying some misaligned AIs. However, some misaligned AIs can be identified by asking "What are your goals?" and I do not expect the Ideal World Benchmark to be significantly more robust to deception.

I... (read more)

LESSWRONG
LW

All of Sappique's Comments + Replies