But if you don't know what human values are, how can you be sure that the AI will learn them correctly?
So you make an AI and tell it: "Go forth and learn human values!" It goes and in a while comes back and says "Behold, I have learned them". How do you know this is true?
If I train a neural network to recognize dogs, I have no way of knowing if it learned correctly. I can't look at the weights and see if they are correct dog image recognizing weights and not something else. But I can trust the process of training and validation, that the AI has learned to recognize what dogs look like.
It's a similar principle with learning human values. Of course it's more complicated than just feeding it images of dogs, but the principle of letting AIs learn models from real world data is the important part.
If it's worth saying, but not worth its own post, then it goes here.
Notes for future OT posters:
1. Please add the 'open_thread' tag.
2. Check if there is an active Open Thread before posting a new one. (Immediately before; refresh the list-of-threads page before posting.)
3. Open Threads should start on Monday, and end on Sunday.
4. Unflag the two options "Notify me of new top level comments on this article" and "