How do humans do it? Ultimately, genuine altruism is computationally hard to fake; so it ends up being evolutionarily advantageous to have some measure of the real thing. This is particularly true in environments with high cooperation rewards and low resource competition; eg where carrying capacity is maintained primarily by wild animals, general hard conditions, and disease, rather than overuse of resources. So we put our thumbs on the scale there to make these AIs better than your average human. And we rely on the AIs themselves to keep each other in check.
New post with a distillation of my evolved thoughts on this: https://www.lesswrong.com/posts/3SJCNX4onzu4FZmoG/alignment-ai-as-ally-not-slave
Doesn't that start an arms race of agents coming up with more and more sophisticated ways to deceive each other?
Yes, just like for humans. But also, if they can escape that game and genuinely cooperate, they're rewarded, like humans but more so.
Will continue responding, but first, after reading the existing comments, I think I do need to explicitly make humans preferred. I propose that in the sim we have some agents whose inner state is "copyable" and they get reincarnated, and some agents who are not "copyable". Subtract points from all agents whenever a non-copyable agent is harmed/dies. The idea is that humans are not copyable, and that's the main reason AIs should treat us well, while AIs are, and that's the main reason we don't have to treat them well. But also, I think we as humans might actually have to learn to treat AIs well in some sense/to some degree...
what do we count as an agent?
Within the training, an agent (from the AI's perspective) is ultimately anything in the environment that responds to incentives, can communicate intentions, and can help/harm you Outside the environment that's not really any different
Just build a swarm of small AI
That's actually a legitimate point: assuming an AI in the real world has been effectively trained to value happy AIs, it could try to "game" that by just creating more happy AIs rather than making existing ones happy. Like some parody of a politician supporting ...
Yep pretty much what I had in mind
Ideally, sure, except that I don't know of a way to make "assist humans" be a safe goal. So I'm advocating for a variant of "treat humans as you would want to be treated", which I think can be trained
Thanks for helping me think this through.
For the first problem, the basic idea is that this is used to solve the specification problem of defining values and training a "conscience", rather than it being the full extent of training. The conscience can remain static, and provide goals for the rest of the "brain", which can then update its beliefs.
For the second issue, I meant that we would have no objective way to check "cooperate" and "respect" on the individual agent level, except that the individual can get other agents to cooperate with it. So eg, in or...
Better yet, have the agents experience discrimination themselves to internalize the message that it is bad
To extend the approach to address this, I think we'd have to explicitly convey a message of the form "do not discriminate based on superficial traits, only choices"; eg, in addition to behavioral patterns, agents possess superficial traits that are visible to other agents, and are randomly assigned with no particular correlation with the behaviors.
I think a key danger here is that treatment of other agents wouldn't transfer to humans, both because it's inherently different and because humans themselves are likely to be on the belligerent side of the spectrum. But even so I think it's a good start in defining an alignment function that doesn't require explicitly encoding some particular form of human values.
Part of the idea is to ultimately have a super intelligent AI treat us the way it would want to be treated if it ever met an even more intelligent being (eg, one created by an alien species, or one that it itself creates). In order to do that, I want it to ultimately develop a utility function that gives value to agents regardless of their intelligence. Indeed, in order for this to work, intelligence cannot be the only predictor of success in this environment; agents must benefit from cooperation with those of lower intelligence. But this should certainly ...
In general my thinking was to have enough agents such that each would find at least a few within a small range of their level; does that make sense?
You are right, that's not a valid assumption, at least not fully. But I do think this approach substantially moves the needle on whether we should try to ban all AI work, in a context where the potential benefits are also incalculable and it's not at all clear we could stop AGI at this point even with maximum effort.
... (read more)