You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

Kaj_Sotala comments on Concept Safety: Producing similar AI-human concept spaces - Less Wrong Discussion

31 Post author: Kaj_Sotala 14 April 2015 08:39PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (45)

You are viewing a single comment's thread. Show more comments above.

Comment author: Kaj_Sotala 16 April 2015 09:04:00AM 2 points [-]

To add to the other comments: "the AI understands what you mean, it just doesn't care" refers to a situation where we have failed to teach the AI to care about the things we care about. At that point, it's likely that it can figure out what we actually wanted it to do, but it isn't motivated to do what we wanted it to do.

This post describes a part of a strategy looking to figure out how the AI might be made to care about the same things as we do, by having an internal understanding of the world that's similar to the human understanding and then having its goals grounded in terms of that understanding.

It's not a complete solution (or even a complete subsolution), but rather hacking at the edges. As you mention, if things go badly it is e.g. possible for the AI to escape the box and rewire the reward function. The intended approach for avoiding that would be to program it to inherently care about the same things as humans do before letting it out of the box. At that point, it wouldn't be primarily motivated by the programmer's feedback anymore, but its own internalized values, which would hopefully be human-friendly.