Kyre comments on Concept Safety: Producing similar AI-human concept spaces - Less Wrong Discussion
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (45)
I totally agree with you that AIs should be able to learn what humans mean by different concepts. I never really understood that objection. I think the problem is a bit deeper. This sentence right here:
What's to stop the AI from instead learning that "good" and "bad" are just subjective mental states or words from the programmer, rather than some deep natural category of the universe? So instead of doing things it thinks the human programmer would call "good", it just tortures the programmer and forces them to say "good" repeatedly.
The AI understands what you mean, it just doesn't care.
The pictures and videos of torture in the training set that are labelled "bad".
It is not perfect, but I think the idea is that with a large and diverse training set the hope is that it alternative models of "good/bad" become extremely contrived, and the human one you are aiming for becomes the simplest model.
I found the material in the post very interesting. It holds out hope that after training your world model, it might not be as opaque as people fear.