New Comment
2 comments, sorted by Click to highlight new comments since:

(Comment ported over from Less Wrong)

I'm glad that you're thinking about these things, but this misses what I think is the hard part of the problem: truly out-of-sample cases. The thing that I'm worried about isn't that a superhuman AI will map (human beings suffering in a currently understood way) to the concept "good", but that it will have a lot of degrees of freedom of where to map (thing that is only possible with nanotech, which human brains aren't capable of fully understanding) or (general strategy for meme-hacking human brains, which human brains aren't able to conceptualize), etc, and that a process of picking the best action may be likely to pick up one of these edge cases that would differ from our extrapolated volitions.

Basically, I don't see how we can be confident yet that this continues to work once the AI is able to come up with creative edge cases that our brains aren't explicitly able to encompass or classify the way our extrapolated volitions would want. For an example of progress that might help with this, I might hope there's a clever way to regularize model selection so that they don't include edge cases of this sort, but I've not seen anything of that type.

[-]jessicataΩ000

I suspect that reaching into the human concept space is going to be helpful for idealizing human concepts even if we don't automatically get extrapolated answers in edge cases. Specifically, people have some sort of concepts of what it means for a decision process to be better or worse. For example, if you ask me whether some weirdtopia is good, I might have no idea, but maybe I could say something about how the question might be decided (for example, thinking longer about it is likely to be better than thinking shorter), and there are some unknown abstract principles behind my judgments about decision procedures that would be useful to learn.

By "regularize model selection so that they don’t include edge cases of this sort", do you mean creating models that are confident about some moral judgments but not about edge cases (or don't even have edge cases in their domain)? I think something like this is a good idea, and ideally we would want to have high confidence about decision procedures for the things we have low confidence about.

I think it would be really nice to have a mathematical toy model in which "people think X but they think Y decision procedure is good, and Y believes not X" can be expressed, and to see how this relates to concept learning.