Clarity comments on Concept Safety: Producing similar AI-human concept spaces - Less Wrong Discussion
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (45)
Kaj_Sotala's post doesn't directly address this issue, but transparent concept learning of the form discussed in the OP should combine well with machine learning approaches to value learning, such as inverse reinforcement learning.
The general idea is that we define the agent's utility function indirectly as the function/circuit that would best explain human actions as a subcomponent of a larger generative model trained on a suitable dataset of human behavior.
With a suitably powerful inference engine and an appropriate training set, this class of techniques is potentially far more robust than any direct specification of a human utility function. Whatever humans true utility functions are, those preferences are revealed in the consequences of our decisions, and a suitable inference system can recover that structure.
The idea is that the AI will learn the structural connections between human's usage of the terms "good" and "bad" and its own utility function and value function approximations. In the early days its internal model may rely heavily on explicit moral instruction as the best predictor of the true utility function, but later on it should learn a more sophisticated model.
I'm trying to wrap my head around all this and as someone with no programming/ai background, I found this the clearest, gentlest learning curve article on the inverse reinforcement learning.