Steve_Rayhawk comments on Q&A with Abram Demski on risks from AI - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (70)
Abram: Could you make this more precise?
From the way you used the concept of "negative reinforcement", it sounds like you have a particular family of agent architectures in mind, which is constrained enough that we can predict that the agent will make human-like generalizations about a reward-relevant boundary between "sociopathic" behaviors and "socialized" ones. It also sounds like you have a particular class of possible emergent socially-enforced concepts of "sociopathic" in mind, which is constrained enough that we can predict that the behaviors permitted by that sociopathy concept wouldn't still be an existential catastrophe from our point of view. But you haven't said enough to narrow things down that much.
For example, you could have two initially sociopathic agents, who successfully manipulate each other into committing all their joint resources to a single composite agent having a weighted combination of their previous utility functions. The parts of this combined agent would be completely trustworthy to each other, so that the agents could be said to have modified their goals in response to the "social pressures" of their two-member society. But the overall agent could still be perfectly sociopathic toward other agents who were not powerful enough to manipulate it into making similar concessions.
Steve,
The idea here is that if an agent is able to (literally or effectively) modify its goal structure, and grows up in an environment in which humans deprive it of what it wants when it behaves badly, an effective strategy for getting what it wants more often will be to alter its goal structure to be closer to the humans. This is only realistic with some architectures. One requirement here is that the cognitive load of keeping track of the human goals and potential human punishments is a difficulty for the early-stage system, such that it would be better off altering its own goal system. Similarly, it must be assumed that during the period of its socialization, it is not advanced enough to effectively hide its feelings. These are significant assumptions.
Interesting! Have you written about this idea in more detail elsewhere? Here are my concerns about it:
Given these problems and the various requirements on the AI for it to be successfully socialized, I don't understand why you assign only 0.1 probability to the AI not being socialized.