You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

Houshalter comments on Concept Safety: Producing similar AI-human concept spaces - Less Wrong Discussion

31 Post author: Kaj_Sotala 14 April 2015 08:39PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (45)

You are viewing a single comment's thread. Show more comments above.

Comment author: jacob_cannell 15 April 2015 06:08:41AM *  4 points [-]

What's to stop the AI from instead learning that "good" and "bad" are just subjective mental states or words from the programmer, rather than some deep natural category of the universe?

Kaj_Sotala's post doesn't directly address this issue, but transparent concept learning of the form discussed in the OP should combine well with machine learning approaches to value learning, such as inverse reinforcement learning.

The general idea is that we define the agent's utility function indirectly as the function/circuit that would best explain human actions as a subcomponent of a larger generative model trained on a suitable dataset of human behavior.

With a suitably powerful inference engine and an appropriate training set, this class of techniques is potentially far more robust than any direct specification of a human utility function. Whatever humans true utility functions are, those preferences are revealed in the consequences of our decisions, and a suitable inference system can recover that structure.

The idea is that the AI will learn the structural connections between human's usage of the terms "good" and "bad" and its own utility function and value function approximations. In the early days its internal model may rely heavily on explicit moral instruction as the best predictor of the true utility function, but later on it should learn a more sophisticated model.

Comment author: Houshalter 17 April 2015 09:04:55AM 0 points [-]

I know of inverse reinforcement learning and similar ideas, I still argue that they are bad for the same reason.

In regular reinforcement learning, the human presses a button that says "GOOD", and a sufficiently intelligent AI learns that it can just steal the button and press it itself.

In inverse reinforcement learning, the human presses a button that says "GOOD" at first. Then the button is turned off, and the AI is told to predict what actions would have led to the button being pressed. Instead of actual reinforcement, there is merely predicted reinforcement.

However a sufficiently intelligent AI should predict that stealing the button would have resulted in the button being pressed, and so it will still do that. Even though the button is turned off, the AI is trying to predict what would be best in the counter-factual world where the button is still on.

And so the programmer thinks that they have taught the AI to understand what is good, but really they have just taught it to figure out how to press a button labelled "GOOD".

Comment author: jacob_cannell 17 April 2015 05:36:06PM 0 points [-]

This is not how IRL works at all. The utility function does not come from a special reward channel controlled by a human. There is no button.

To reiterate my description earlier, IRL is based on inferring the unknown utility function of an agent given examples of the agent's behaviour in terms of observations and actions. The utility function is entirely an internal component of the model.