jazzkingrt comments on Learning values versus learning knowledge - Less Wrong Discussion
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (17)
I don't think this problem is very hard to resolve. If an AI is programmed to make sense of natural-language concepts like "chocolate bar", there should be a mechanism to acquire a best-effort understanding. So you could rewrite the motivation as:
"create things which the maximum amount of people understand to be a chocolate bar"
or alternatively:
"create things which the programmer is most likely to have understood to be a chocolate bar".
That's just rephrasing one natural language requirement in terms of another. Unless these concepts can be phrased other than in natural language (but then those other phrasings may be susceptible to manipulation).
Another way of putting the objection is "don't design a system whose goal system is walled off from its updateable knowledge base". Loosemore's argument is that that is in fact the natural design, and so the "general counter argument" isn't general.
It would be like designing a car whose wheels fall off when you press a button on the dashboard...1) it's possible to build it that way, 2) there's no motivation to build it that way 3) it's more effort to build it that way.
Connecting the goal system to the knowledge base is not sufficient at all. You have to ensure that the labels used in the goal system converge to the meaning that we desire them to have.
I'll try and build practical examples of the failures I have in mind, so that we can discuss them more formally, instead of very nebulously as we are now.
Ok, assuming you are starting from a compartmentalied system, it has to be connected in the right way. That is more of a nitpick than a knockdown.
But the deeper issue is whether you are starting from a system with a distinct utility funciton:
The problem exists for reinforcement learning agents and many other designs as well. In fact RL agents are more vulnerable, because of the risk of wireheading on top of everything else. See Laurent Orseau's work on that: https://www6.inra.fr/mia-paris/Equipes/LInK/Les-anciens-de-LInK/Laurent-Orseau/Mortal-universal-agents-wireheading
Simpler AIs may adopt a simpler version of a goal than the human programmers intentions. It's not clear that they do so because have a motivation to do so. In a sense, a RL agent is only motivated to avoid negative reinforcement. But simpler AIs don't pose much of a threat. Wireheading doesn't pose much of a threat either.
AFAICS, it's an open question whether the goal-simplifying behaviour of simple AI's is due to limitation or motivation.
The contentious claims are concerned with AIs that are human level, or above, sophisticated enough to appreciate human intentions directly, but nonetheless get them wrong. A RL AI that has NL, but nonetheless misunderstand "chocolate" or "happiness", but only on the context of its goals, not in its general world knowledge, needs an architecture that allows it to do that, that allows it to engage in compartmentalisation or doublethink. Doublethink is second nature to humans, because we are optimised for primate politics.