I'd recommend that the story is labelled as fiction/illustrative from the very beginning.
Thanks, modified!
I believe I do.
Thanks!
Having done a lot of work on corrigibility, I believe that it can't be implemented in a value agnostic way; it needs a subset of human values to make sense. I also believe that it requires a lot of human values, which is almost equivalent to solving all of alignment; but this second belief is much less firm, and less widely shared.
Instead, you could have a satisficer which tries to maximize the probability that the utility is above a certain value. This leads to different dynamics than maximizing expected utility. What do you think?
If U is the utility and u is the value that it needs to be above, define a new utility V, which is 1 if and only if U>u and is 0 otherwise. This is a well-defined utility function, and the design you described is exactly equivalent with being an expected V-maximiser.
Thanks! Corrected.
Thanks! Corrected.
Great and fun :-)
Thanks! Corrected (though it is indeed a good hard problem).
Pre-training and domain specific knowledge are not needed.
Run them on examples such as frown-with-red-bar and smile-with-blue-bar.
Which problems are you thinking of?