If I train a neural network to recognize dogs, I have no way of knowing if it learned correctly. I can't look at the weights and see if they are correct dog image recognizing weights and not something else. But I can trust the process of training and validation, that the AI has learned to recognize what dogs look like.
It's a similar principle with learning human values. Of course it's more complicated than just feeding it images of dogs, but the principle of letting AIs learn models from real world data is the important part.
If I train a neural network to recognize dogs, I have no way of knowing if it learned correctly.
Of course you do. You test it. You show it a lot of images (that it hasn't seen before) of dogs and not-dogs and check how good it is at differentiating them.
How would that process work for an AI and human values?
the principle of letting AIs learn models from real world data
Right, human values: “A man's greatest pleasure is to defeat his enemies, to drive them before him, to take from them that which they possessed, to see those whom they cherished in tears, to ride their horses, and to hold their wives and daughters in his arms.”
If it's worth saying, but not worth its own post, then it goes here.
Notes for future OT posters:
1. Please add the 'open_thread' tag.
2. Check if there is an active Open Thread before posting a new one. (Immediately before; refresh the list-of-threads page before posting.)
3. Open Threads should start on Monday, and end on Sunday.
4. Unflag the two options "Notify me of new top level comments on this article" and "