It is commonly asserted that aligning AI is extremely hard because
- human values are complex: they have a high Kolmogorov complexity, and
- they're fragile: if you get them even a tiny bit wrong, the result is useless, or worse than useless.
If these statements are both true, then the alignment problem is really, really hard, we probably only get one try at it, so we're likely doomed. So it seems worth thinking a bit about whether the problem really is quite that hard. At a Fermi-estimate level, just how big do we think the Kolmogorov complexity of human values might be? Just how fragile are they? If we had human values, say, 99.9% right, and... (read 6035 more words →)
However, if we found that the classifier was using a feature for "How smart is the human asking the question?" to decide what answer to give (as opposed to how to then phrase it), that would be a red flag.