You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

Kyre comments on Concept Safety: Producing similar AI-human concept spaces - Less Wrong Discussion

31 Post author: Kaj_Sotala 14 April 2015 08:39PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (45)

You are viewing a single comment's thread. Show more comments above.

Comment author: Kyre 15 April 2015 05:09:18AM 4 points [-]

What's to stop the AI from instead learning that "good" and "bad" are just subjective mental states or words from the programmer, rather than some deep natural category of the universe? So instead of doing things it thinks the human programmer would call "good", it just tortures the programmer and forces them to say "good" repeatedly.

The pictures and videos of torture in the training set that are labelled "bad".

It is not perfect, but I think the idea is that with a large and diverse training set the hope is that it alternative models of "good/bad" become extremely contrived, and the human one you are aiming for becomes the simplest model.

I found the material in the post very interesting. It holds out hope that after training your world model, it might not be as opaque as people fear.